 Hi everybody, I'm back. Cyrus here. How many people work for an agency? Okay. Yeah, I can see all your hands. Thank you. So our next speaker spent 10 years running one of the most well-recognized and respected agencies here in the States. Distilled Mr. Rob, who's been after 10 years of running that agency. He joined Moz as VP of product and now he's helping Moz build the SEO tools of the future. Today, he's going to talk about something that's kind of intimidating to some people, including myself, which is scraping, scraping for SEO. I'm really curious to hear, uh, look at his presentation and I'm excited to watch. So let's let's kick it over to Rob. Hi everyone. I'm Rob Busby. I'm the VP product at Moz. Now I joined Moz at the end of last year, having spent almost 12 years working as an SEO consultant. I mentioned this because today I'm not going to be promoting Moz's tools and I'm not going to be showing off a bunch of Moz features. In fact, I'm not even going to be talking about big strategic topics. Instead, I'm going to share a single technique with you, a technique that has so many uses and applications that I've pulled it out of my toolkit again and again over the last decade. And it's been so useful that I thought it was high time that I shared it with you as well. Some of you will see immediate applications for this and it'll be a game changer for the work you're doing. But my hope for everyone else is that I can show you how powerful and flexible this can be and it will stick in your mind for long enough that when the day comes you need it, you'll remember that it's in your toolkit, revisit this presentation and you'll get as much value from it as I have. I'm going to cram a lot into the next 30 minutes, which is why I'm going to make sure that these slides are available to you immediately afterwards. Or since you're all going to get copies of the videos as well, then I'd encourage you to go and watch again. And in particular, pause the video to step slowly through the code samples that I'm going to share. During the live presentation, I'm going to live tweet some links to extra content that you'll hopefully find useful. My Twitter handle is right there at Rob Usby and it'll be in the corner of every slide as well. If you're not watching this live, then follow the link later in the presentation and I'll have all of these resources wrapped up into one place for you. This technique was on my mind because a colleague asked me the other day about g2.com. In case you're not familiar with g2, it's a fantastic site with real customer reviews about thousands of software products. And it's absolutely one of the places you go when you're thinking about making an investment in software for your business. They have a review score for each product. And this person had asked me, for a list of products that I want to compare, how can I easily see the scores for each of them side by side? And the obvious answer is, search for each product, click to the products page and write down the score. But I know we can do better than that. So once again, I busted out my scraping technique, got the job done and then realized that I should share how to do this with all of you. Which means that today, I'm going to show you how to use it to achieve three different things. I'm going to scrape this publisher site g2. I'm going to build a new tiny but powerful SEO tool. And I'm going to blend together a bunch of SEO data. Now fair warning, this requires writing a bit of code. So if you're not already super confident with JavaScript, there are three paths you can take. You can absolutely use the code that I've created. And hopefully you'll find some of that useful on its own. I'll share a link in a little bit for you to do all of this without writing a single line. Or you can find someone that's already comfortable writing code, share this presentation with them and explain what the two of you want to achieve together. Or the option that I would recommend is learn a little bit of JavaScript. Don't worry, that's not as tough as it sounds. I'm about to share a link to a fantastic set of resources for learning JavaScript fundamentals that Madison can put together. And while I'll be talking about JavaScript today, I'll also be using a lot of jQuery, which is one of these clever little libraries that does a bunch of useful things. And most useful for us is the fact that it basically has a bunch of prewritten functions to make our lives easier. I'm also going to be sharing some bonus content, which is Rob's six tips for becoming an effective JavaScript hacker, which includes my sixth tip that if it works, it works. We're writing code to help you get your job done as efficiently as possible. It doesn't need to look beautiful or be perfectly optimized. We're just trying to make things that work. Links to all of these are available right now on my Twitter feed or in the link bundle that I'll share later. Okay, so to explain how I want to scrape websites differently, let's have a look at how we usually do it. Imagine an internet full of websites. And here's the one that we want to scrape. So we might have our laptop sat on our desk at home, and you can scrape from there, requesting all the pages and all the content you need. This is how a crawler like screaming frog works, you run it from your own machine. In fact, it's not too different. Our browser like Chrome works that both just software on your desktop, requesting pages. Sometimes we'll crawl not from our desktop, but from a server somewhere in the cloud. Remember, of course, that the cloud isn't anything special. It's just other people's computers sat somewhere else doing the same work for you. So when you use Moz Pro to crawl your site, you're really just running software on our machines. Or when you use stat for rent tracking, again, you're just using our machines to scrape Google's search result pages. But if we're trying to extract data from a site, particularly at scale, we do run into a couple of problems. One might be authentication. Lots of SAS tools, for instance, will let you log in to look at your data, but you can't export it and they don't have an API to access it programmatically. You can only get that data if you're a human visiting the website via a browser. Secondly, there's the issue of rate limiting or being banned from scraping. Google is a classic example of this. If you run a bot to scrape Google, they'll pretty quickly pick up on this and block you with captures or flag your IP address so that you can't visit any longer. So how can we do all this differently? Well, in contrast to using other software or servers, I think of this technique as having sites scrape themselves. The requests we're going to make the data on the site won't be coming from anywhere but code running on that site. So we're going to trick the site into thinking that it has new JavaScript embedded and running on it, that it asked to be there. And we're going to do all of this from Chrome on your desktop. This means that authentication won't be a problem. You're already logged into a site. And it means that while rate limiting will still be of a concern, it's much easier to avoid because your browser has all the hallmarks of being a real browser run by a real user on a real computer. I'll reference an old SEO quote here, which is that I'm not doing anything black at. I'm just using computers to help me do the things that I would have been doing anyway, but much, much faster. And it starts with the premise that you want to be able to insert your code into any website. I'll show you three ways that can happen. One simple option to execute JavaScript as if it's on the site is to use the console. You're almost certainly familiar with Chrome DevTools already, hit F12, and there it is. It lets you inspect all the elements on the page, lets you look at all the network requests the page is making, and it has this console tab where you can interact with the page's JavaScript. Just to prove that, here's an example of color of change.org. Now, this website already has jQuery installed. So down here in the console, I've written a line of JavaScript to find out about the heading of the page. I've used jQuery's amazingly simple way of selecting elements. In this case, I put the dollar sign, and then in parentheses, I put H1, and that selected the H1 from the page, which let me access its contents, which are now printed down there in the console as well. And then I wrote a different line of code to alter the HTML of the page. I've put a yellow dashed box around the same element, the H1, to show that that is working. The point here is that you can write or paste code directly into the console, and it will run on the site. But if you want to run some code more than just once, you're going to need something easier than pasting it into the console every time. And this is where I love using JavaScript bookmarklets. I just looked this up, and I actually published my first blog post about JavaScript bookmarklets 11 years ago today. I'm still a fan. Basically, you can create bookmarklets that don't point to URL, but instead run a bit of code. You do this by starting the URL field of the bookmark with JavaScript colon, followed by the code you want to run. It means you can take little code snippets, store them in a bookmark, and run them any time you want at the click of a button. And it's great for bits of code that are never going to change again, though it can be tricky to share code in this way. So my third option, while slightly more involved, is a technique that I'm going to use today. And again, we're going to create a JavaScript bookmarklet, but instead of putting all our code in there, the bookmarklet is only going to do one thing. It's going to create a new script tag on the page that imports a JavaScript file from elsewhere. You can store that file on your own server, on Dropbox, on GitHub, or anywhere else. And then have a bookmarklet like this that adds it to the page as if the site itself had put it there. Your code now has all the permissions that native code written by the site's developers would have. So in the rest of this presentation, I'll focus on the code we will write in that JavaScript file that you inject into the site using a bookmarklet and then we'll run from there. I have three examples and I'll start with that challenge that was posed to me earlier. How could we compare product stats from g2.com? Well, I suggest starting with pseudo code or rather just writing out the process you would go through if you were doing this manually. So we might say that we have a list of products to compare. For each product, you'd search for it on the front page of the site. Then you'd look at the search result page, find the top result, then you'd visit that page which is all about the product and you'd grab the ratings from the top section. And while you're there, you can grab the specific ratings that g2 has about ease of use, quality of support, and set up from further down the page as well. And then you'd put them all into some kind of table as an output, maybe in Excel or Google Sheets or whatever you want. So let's turn that into code. And one thing I noticed that again g2 already has jQuery installed so all the fancy features that I like using are ready to go. Now here's the results page when you search for Zendesk Chat. I'll zoom in and when we look at the URL we realize it's a really straightforward address with a parameter called query. So we know that if we replace the value of that query we'll get to the g2 results page for any product name search. And we want to look at the top result here. That's when we fire up Chrome DevTools inspect these elements and we find that each product listing has a div with a class called product listing. In each of those is a difficult product name and a difficult star rating and a link to the specific page about the product itself. So the first thing we're going to do is crawl that search page, find the top result and grab those bits of information. Here's some code that looks a bit intimidating at first glance but I'll show you three cool features we can use to write powerful code really fast. First I just created a variable up at the top there called search query that I set to be Zendesk. This was just going to be useful while I tested things. Then I used this jQuery function that we're going to rely on again and again. It's called get and it literally sends a get request to any page. You give it a URL you want to fetch and then it gives you a variable containing the entire html of the target page. You also add a function that processes the html however you want. We're not navigating to that page, we're not changing anything visible in the browser, we're just requesting the page in the background and sticking it in a variable. Now let's take a moment to appreciate the power of running this as code on the site. The website is making the request so we have no cross-site scripting issues, we have no authentication challenges, all the cookies that the site has set in your browser are available and working. So you get that html back in the blink of an eye ready for you to pass and investigate however you want. And then we can do things like use the dot find function. Similar to earlier this lets us use the CSS selectors that we already know and love to find things in the page that was returned. Here I've started off by just finding the h1 of that page and writing it to the console. But then I use dot find again to find the very first listing on the results page and then use it again within that element to collect the URL, the product name and the rating of the very first listing. You can see here that once we have an element we care about we can then use the dot text function to tell us what the text of each of those elements is. So let's grab this code and run it for real in the console. There's my code, I pasted it into the DevTools console while we're anywhere on g2 and it outputs a couple of things. Let's zoom in. The first line is the h1 of the results page that I asked it to output. The second is that object we created and it's correctly pulled in URL, the name and the rating of the product. We can do a similar iteration now, we can take the URL of the actual product page, do another get request, we can then look for those three metrics of how users reviewed the product and extract the numbers. Let's look at that code. Here it is, very similar. We start with a get request, but this time we're requesting the product pages URL that we found earlier. Then we use those metrics and we find them, they're not surrounded by any like clean or semantic markup, so we have to hack our way around. We use the contains pseudo selector to look for an appropriate div that mentions ease of use or whatever else. We traverse up and down the dom tree a little bit, find the element we really want and grab the score of it again using the dot text function. I pasted that code into the console, let's see what it gives me. Excellent, it's created this object called rating details that has those three scores. So we can get data about a single product, but how do we output it? Well, just like reading data from the page we're on, JavaScript lets us change the content of the page and jQuery makes it even easier. Dot text can change the text of an element dot HTML can change the entire HTML of an element and dot append will add HTML to the end of an element. And we're going to combine these for our purposes. Before we do any crawling, we're actually going to use the HTML function to overwrite the entire contents of the page. By running this code we're replacing the HTML element of the page, which is absolutely everything with the code. This gives us a text area, a go button and a table with some headings. When we run this, here's what the front page of g2 turns into. And I know what you're thinking. First you're saying, wow, that's really powerful but also, my god, that's ugly and you are right on both accounts. This is ugly. It's raw, unstyled, HTML. But this is for you and not for anyone else to see. So does it matter? This is going to help you get your job done fast and efficiently and who cares what it looks like. Now, I'll also contradict myself here just a little because later on I will share something with you that is a little better looking and a little more usable. But for now, once we have a table in there we can just put data into it. What I've done here is you've got that input text area where you could type a load of product names. I've taken the contents of the text area using the dot val function and then I split it up into an array by separating each new line. Now we have an array of the product names you entered and then I use this dot each function from jQuery. It makes it easy to loop over that array one item at a time. For each term that was entered, we use the append function to add a new row to the table. And then we wrap all of this up in a click function. This code selects the start button that we added and says when it's clicked here's what to do next, which in our case is fill that table up with the names we entered and then start crawling the site. After that populate table function runs here's what we get. Our ugly ass table is now an ugly ass table with five rows in it. After we've done that we kick off another each loop. This time we'll loop over a set of elements from the page. We'll select the table's rows and for each row we'll run some code which is basically the code I showed you earlier that did the scraping. So now we have all of this wrapped up in a JavaScript file. We have a bookmarklet that injects the file into the page. So here we are. I tried doing a screen recording of this for you but it all happened so fast so I've just screenshotted things instead to slow it down. We're on the front page of g2. We hit our book marklet and the page gets replaced with this. And like I said earlier I've actually made this look a little bit better than unstyled html. I found a tiny css spring work and with just one extra line in the html here we could include the external css file that makes the buttons and the tables look a little nicer. But with you spend time doing that kind of thing for yourself it's entirely up to you. It's also important to note here as far as the browser is concerned we're still on the same website. We still have permission to make requests from this site. So you type some product names into the box and you hit the go button and quickly the code populates the table and starts scraping. It goes to the search result page for each product name and grabs the url of the product page and it goes to the product page url and grabs the scores. 10 page requests happen all the data is collected and one row at a time it's written back into our table. This takes almost no time at all and here we are. Now my friend that wanted to compare the scores from a bunch of different software options can see this table with all of the details. Have a different list of products to compare click it again run it again. I've published this code and there are comments all over it for you. I can't imagine many of you will have exactly the same use case and need to scrape data out of this site but you can use this as a template for running the same process over almost any other site that you might want to scrape. The code and the bookmarklets are available at oozeby.com slash mozcan. Now I said I had three examples. So what else can we scrape? Well that leads us onto chapter two which I call when the scraper becomes the scrappy and who's the biggest scraper of them all? Yeah you're damn right let's jump straight to scraping Google but not just for the sake of it let's solve a real SEO problem. We've all used mozpro or a site like it to crawl our site. Here's my mozpro account for my old company distilled and from the site crawl I can see that distilled has pages under folders like events like blog like resources and training but my question is always how many pages does Google know about in each of those folders? How many pages of each type have they indexed? A site crawl cannot tell you this and I'd love to know this for some competitor sites as well. How are they structured and how does Google understand that structure? Here's the process I've been through in the past to answer this question. First you do a site code on search. Here I'm doing a search for site colon moz.com to find all the index pages that Google has from moz and then we look through that list of results to find the unique folders and we write those down. We've got moz.com slash learning slash blog slash community. How many pages does Google have indexed reach those folders? We do another site search site colon moz.com slash learn and there's the number 447 pages of amazing learning content. We do it for each of the folders we found and eventually we have a list like this. It takes a while but we get there but we want to find more of the folders at the site so we do another crafty little search. We do a site colon moz.com search and now we exclude all the folders we already knew about minus site colon moz.com slash learn minus site colon moz.com slash community. This gives us a search result with a whole bunch more folders that we can now put into our list and then go start the process for them again. This is the kind of thing that is arguably at the boundary of being very time consuming and enough that you're not going to do it by hand for a load of different sites but becomes potentially very useful if you can just hit one button and get it all done for you. So what would the code for this look like? Well first we'll replace the Google page again with our code. Next when we go to Google and run our bookmarklet this is what we get an input box, the bare bones of a table and a couple of buttons. Again I want this to be a little less ugly so it's one extra line of code to add a CSS library. What does the actual code do? Well this might sound familiar second time around. First we get the text of the domain name you typed into the box at the top and we add a site colon to the front of that to make it into a search query then we use jQuery's get function to go and request that page on Google that is slash search question mark q equals query and again we're on Google and as far as the browser knows this is Google doing its own Ajax request to other parts of the site. This happens all the time and there's nothing out of the ordinary here. Once it gets the cert back it selects all the links on the page that points the site loops through them with dot each and extract the top level folder. If it's not seen that folder before then it adds it to the table on the page. So for example you enter moz.com in that box hit the button and boom in the blink of an eye here's what you get a list of some of the top level folders that Google has indexed pages from but now we need to find out how many index pages it has in each so we loop over the list of folders for each one we get another Google results page this time for site colon domain dot com slash folder that's this get request here which grabs this page and as you'll see at the top there it has the thing saying about 7400 results we inspect the html for that and see that it's helpfully wrapped up in a div called result stats so with the result of that get request we can use dot find to find the result stat div we can use dot text to grab the text from it then we use plain old java script to extract the number from that text we write that back into the appropriate row of our table and that's done and we move on to the next folder until they're all done now it'll be no surprise to you that running dozens of site colon queries one after the other is likely to trigger some bot sensors at Google's end and honestly your first line of defense there is just to slow down a bit so instead of running all these get folder count functions at the same time i use java scripts set timeout function to run these one after the other a second apart this might be enough to prevent problems but even so you might sometimes see your code suddenly stop working and that's because google has you pegged as a non-human but we don't have to stress we're not running a load of cloud servers and churning through ip addresses or whatever we are a human our browser has all the fingerprints of being a regular user so we can pass a capture no problem if you get blocked you just open up another tab go to google click the recapture button to prove you're human and then carry on merrily collecting the data that you want remember it's not black hat it's just faster than the manual way of doing what you were going to do anyway okay so let's see how the whole thing looks when we put it together here we are on google let's go click the bookmark and great so it loads the form and the table we type our domain into the box and hit the blue button it immediately grabs the first set of folders from google and then one each second it does another search to see how many pages google has indexed from that folder when it's complete we hit the button again here's another set of folders and the process continues and another so what have we done here we've built a very simple but very useful tool that runs entirely in your browser and collects data from google there are obviously some other really time-consuming ways of getting the same data but with just a few lines of code we now have a brand new SEO tool that solves a particular question that i've had for a long time again the tool is available here if you'd like to use it but more than that i'd suggest you grab the code take a look at it and see how you can make it bend to your will for doing other things you might want to do okay one final example to push home some ideas of things that we can do here what about some very specific SEO data i got this slack message from my colleague Brittany Muller the other day she asked me to look at the web.dev site and particularly their implementation of google lighthouse she said the lighthouse api was the worst and could i please scrape the site i said i'd take a look and exactly two hours and one minute later i messaged her back to say i'd come up with something fun so let's dive in this is the app that Brittany linked me to it's a site made by google to support web developers and this page is a tool that runs a lighthouse report on any url you put in the url of any web page and it crawls it renders the page and discovers a lot of information about it it takes about 15 to 20 seconds to run and then it comes back to you with all of these data points this is the report for the mozcon marketing page it has these top level metrics about performance accessibility best practice uh seo and these are all just roll-ups of dozens of even more detailed metrics for example under performance there we see mozcon's page scores for things like time to first contentful paint time to interactive blocking time and loads more but let's just focus on these top line metrics i thought at first that if i want to crawl this results page and get all the metrics but i realized that what was actually happening here when you run a report is that the page itself was doing an ajax request to get the data from elsewhere so we fire up the network tab of chrome dev tools we take a look at all the network traffic from this page and lo and behold there's a request that takes 15 to 20 seconds to complete so that must be it fetching the report data let's dig into that request and see what's going on well we see three interesting things from the headers on that request the first is that it's actually making this request to another domain is this app spot dot com app that's doing all the heavy lifting here okay so we know we're probably going to end up making a cross-domain request this might cause some problems there might be cross-site tokens and stuff at play that are going to complicate matters then we see that this time it's a post request rather than get okay we just have to remember that and then we see down here that it's basically just sending the requested url as the post parameter and nothing else of interest that means we don't have to worry about any of those cross-site token issues and potentially means we could actually request this lighthouse data from any site we want to not just from the web dev site okay so we know that to make this request we just post to that url and send it url of the page we want information about let's take a look at what the service sends back to us well this is where I think I actually yelped out loud there's no html being sent back it's just a JSON object perfectly structured data right there under the lhr key is an array of all the metrics we want to get so I thought we have this easy to pass lighthouse report that I think I can request from anywhere what can we do with that well what data set would be improved by having performance and accessibility data hiked into it what about some of moz's tools hopefully most of you have at least one moz pro campaign up and running if you don't you can bounce your way over to moz.com slash pro for a free trial get into your moz pro campaign then go down to the analyze a feature sorry analyze a keyword feature select a keyword from your campaign and here I'm analyzing the SEO agency keyword when you scroll down here you'll get to the surf analysis table now surf analysis has typically focused on two huge traditional types of ranking factors the link equity of a page represented by domain authority and page authority and the keyword optimization of a page moz has all of those next to each other in our surf analysis table right here but in addition to those factors Google has said that they don't want to send people to sites have poor user experiences which could mean that things like slow sites or sites with poor accessibility will suffer in the serfs so let's try to include those lighthouse metrics in this very report I wrote a file and created a bookmark that uses nothing but the tech that I've already walked through in this deck if you go to a page like this and click it here's what happens first a bunch of additional columns and cells are added to the table along with these run lighthouse links in each row and when you click one of those it requests the lighthouse report for the url in that row since it takes like 20 seconds to run the report I replaced the link with a little loading gift to let you know that something's working and then the data starts appearing we'll come back here in a second but let's take a look at the new code I wrote for this feature remember we navigated to a moz campaign serf analysis page and I did this with an array of objects that specified the lighthouse features I wanted I ran a loop over it then added a heading and a bunch of cells to the table and then I added those run lighthouse links when you click any of those links we go request the data from the outspot server we're using the ajax function here it's a bit more customizable than the get or post functions and it's helpful when we're working across domains and then when it comes back I loop over my features once again and I go into the json response to grab the particular data points I need it turns out that instead of a 0 to 100 range they're giving us numbers on a 0 to 1 range so I just multiply the result by 100 round off to the nearest whole number and slap that data into the appropriate place in the table now let's take a step back here and see what we've got we now have a serf analysis tool that presents more data in one view than has ever been seen before link data, keyword data, site performance data and accessibility all stacked up next to each other if you've been looking at the ranking sites for a particular keyword and wondering why is this well optimized website doing so poorly and maybe you want to run something like this and see if there's a non-traditional ranking factor coming into play maybe a site that seems strong with lots of links and good optimization is actually really slow or not very accessible and that might be holding it back so while this is really useful I want to make two things very explicit firstly this is not an official Moz product this is not a feature release this is not something Moz built this isn't something that we're going to provide any support for and in fact my product engineers and designers are mostly seeing this for the first time right now this is an example of the kind of powerful things that you can create in two hours and one minute just by injecting JavaScript into a site and secondly let me make it really clear that I didn't need to work for Moz to do this I'm not using any secret Moz knowledge about our code or our infrastructure here I'm not using any of our APIs I could have written this last year when I was an SEO consultant as well as any of you could one thing you can take away here is the realization that if you think of a genius idea to extend your favorite tools or apps you might not need to submit a feature request maybe you can just hack around with the site a bit inject code grab data from as many places as you need and create the features or data sets that are most useful to you in your role right now of course I also want you to take away a link to this page that includes the code for this bookmarklet like I say it's not a Moz product it's not supported and for all I know web.dev will change their code in the next 24 hours and screw up the way that I've been scraping this data but for now if you have a Moz pro campaign up and running you should 100% try going to a surf analysis page in your campaign and give this a try to wrap up I want to reiterate the items that you can put in your toolkit you can run JavaScript on any site as if it was native to the site itself run it in the DevTools console store it in a bookmarklet for later use or use a bookmarklet to import an external JavaScript file and store that wherever you want online when you're writing JavaScript for this kind of work there are some jQuery functions that you'll want to use a lot you can use get or post to other URLs on the same site you can select elements on the current page or find them in the page that was returned from a get request you can iterate through a whole bunch of elements that you find and you can extract or edit the individual elements of a page whether they were there to begin with or whether you put them there if you're going to start hacking around with JavaScript don't worry about making it pretty if it works then it works I hope I've inspired you to think about some of the things and some of the ways that you might be able to use this for yourself I'll post more of this kind of thing on Twitter so follow me there at robusby if you'd like more free tools as I create them and I look forward to you tagging me to show me what you've created have a great rest of your MozCon and I'll see you all soon thanks very much