 In the next seven minutes I'm going to introduce the R package GS scraper, which is for scraping search results from Google Scholar. Google Scholar is useful as a resource in evidence, synthesis and meta-analyses, because it's a web-based academic search engine that's free to use. It's also very easy to use, which is why it's so popular. There's no human selection of the content, so unlike bibliographic databases, any information that looks like academic content is included in Google Scholar, and that's regularly crawled from the internet, so it's very up-to-date, more up-to-date than many bibliographic databases, because it doesn't require manual inclusion. It also identifies great literature, things like pre-prints, and has been shown to be a useful addition to databases and other searching in systematic reviews. It's not suitable as a single source of information in systematic reviews or other evidence synthesis for a number of reasons. Firstly, it doesn't support full Boolean search strings, only basic strings. You can only see 1,000 search results, and the order of those search results is unknown. So if you have more than 1,000 search results returned, you can only see the first 1,000, and you don't know why those 1,000 are being shown as some black box ranking algorithm. There's also no bulk export from Google Scholar, which makes it difficult to integrate into review management platforms. And there's a very sensitive bot detection algorithm, which makes it difficult to do anything in a patterned way. One of the solutions to these kind of problems with great literature is web scraping, which is the extraction of patterned data from information on the internet. There's a range of different ways of scraping information, from the human-driven copy-paste to more technical machine learning type approaches. But the most common perhaps is HTML parsing, in which you extract chunks of content based on unique or repeated signatures in the HTML code. Within Google Scholar, you can identify each result in a page of results by the tag class equals GSRI within the HTML code. And for every one set of search results, there is one tag, which makes it quite easy. On the left you can see the search results in Google Scholar, and on the right you can see the HTML. And there's an entire div with the class GSRI, and we can then extract each of the pieces of information within that search result using this code. So I've produced some functions within GS scraper. One is build GS links, and this allows us to build a Google Scholar URL that takes you to a real page of search results based on patterns in the URL itself. So you specify the terms exactly the same as you would for the advanced search, and it generates functioning URLs for pages of search results. So you'd input the information as you see on the left, and the function on the right allows you to specify how many pages of search results you want brought back. And it will provide, in this case, 100 URLs for the first 1,000 search results. The next function save HTML saves these URLs or any given URL from Google Scholar. And it uses two or any web page, in fact, it uses two methods to avoid bot detection. The first is a consistent pause between calls, and the second is a specifying a multiplier to multiply this pause depending on the response delay of the server. So if the server takes 1.2 seconds, it multiplies this by the pause and waits that long between the next, between the current call and the next, and that just introduces some randomness. So if we see these five URLs given here, the function would pull back the HTML pages for those five URLs waiting, in this case, five seconds multiplied by whichever server delay happened at the time, and it would save them locally. The next function is to get the information out of those HTMLs. It uses a sub function called split by div, which breaks the HTML into chunks of code, each starting with div space class. It then uses regular expressions to extract each set of search results in a single chunk from within an HTML file. And it extracts specific citation information based on the position of the sub chunk, so the subset of text within each major result chunk, based on its position, and we know its position because Google Scholar is consistent. And the output is a data frame where each line is a search result and each column is a title, author, description, year, link, or DOI. And the DOIs is really useful because the DOIs are more often than not embedded within the publisher URL, so we can extract that information that allows us to very consistently link to other systems like Crossref. So this code here shows the function that's used to extract titles from each search result within the HTML. And it uses a few substitutions, either global substitutions or single substitutions, to remove field codes to make it easier and more consistent to have the text rather than the embedded code as well. And the final function is save and scrape Google Scholar, and that's just a wrapper that takes the building, saving and scraping functions and pulls them together into one function. So here you can see you just input the terms that you want in your search, and then you can create a global object, in this case info, which is a data frame of as many search results as there are in your HTML files. And this just looks locally within your working directory for any HTMLs and extracts all of the relevant information that's saved as part of that function. So in this case we're specifying 30 pages of search results, which would give us 300 results. There are some challenges though, and despite the pause and back off, the bot detection is still an issue, so there's a need to tweak that a little bit and try to improve it. I tried using IP stuff, but it's much harder for me, I'm not an expert in that. Another issue is that not all of the links that are provided by the publisher contain DOIs, but I was surprised by how many did. Future work is to build a link to Crossref, I've done a bit of work on that, but it needs tweaking and improving to validate the records that are extracted. Some of the other resources that already exist, like publish or perish, get much of this information, but they don't get the description or the DOI, so that's quite unique improvement I think. I also want to improve the bot evasion and allow exporting as RIS format files, probably using synthesizer from the metaverse. Thanks very much for your time.