 How about crawl budget? I would love to talk about crawl budget. Yeah, I think it's a term that's thrown around a lot in the industry and it seems like an abstraction of something that is much more technical and concrete and I think we can get a lot of clarity through this video. Hello and welcome to SEO Moodbusting. With me today is Alexis Senders from Merkle and you are a senior account manager of Moodbusting. Yeah okay so you work with a lot of different companies and accounts and I have like different kinds of work. Yesterday we were discussing fermented food because that's a hobby of yours. Yeah definitely. I went on a trip, came back and my boyfriend had fermented like everything. So like lots of kimchi. Oh my god we have so much kimchi. It's a huge jar. But we're not here today to talk about kimchi. So what are things that your clients and customers are dealing with and what's the question that you often get or what's the topic you would like to discuss? I would love to talk about crawl budget today. So to get started before we even talk about anything can we start with what is crawl budget? So kind of like recapping some of the stuff that Gary covered in his Google Webmaster article. That's a very good point. Actually the the article is a fantastic starting point to this entire journey. So when we are talking Google search and indexing and thus crawling we have a bit of a like what's it called trade-off to make and the trade-off that we have to make there is we want to get as much information crawled as quickly as possible but we don't want to overwhelm the service. Definitely. Right. And would you categorize that part of it as crawl limit? Yeah pretty much. Almost like an abstraction under an abstraction. Yeah it's like the that's the crawl rate basically. How much can we how much stress can we put on your server without crashing anything or like suffering from from killing your server too much and that's one thing. The other thing is we just have to be reasonable with our resources right. The web is really really large. So instead of just crawling everything all the time we need to make some distinctions there right. The news website probably changes quite often and we probably need to like be catching up with them every now and then whereas a website on the history of kimchi is probably not changing as often I mean nothing against kimchi but I think like the history isn't as fast-paced as the world news are. Yeah definitely. So that's the crawl demand so we try to figure out like is this something that we need to crawl more often or is this something where it's okay if we check up every now and then. And what goes into that decision making process so for instance if you had that kimchi website or if you had something like archive.org where it's really just something that's going to be constant for a long period of time and there really isn't going to be any changes. Yeah. How do you guys determine that? So we basically fingerprint content as we crawl it right so we see like what is this page about we use that for the duplication later on but we also look into like when was this last change. So you can tell us in things like structured data has a possibilities you can have date elements somewhere in your page. We more or less keep track of one or how often or how frequently things are changing and if we detect that the frequency of change is really low then we don't have to crawl as much. That has nothing to do with quality you can have fantastic content that will rank really high or whatever that never changes. It's more about is this information that we need to revisit every now and then or is this something that we can leave alone for a longer period of time. Now do you use do you guys use something like the e-tag or the last modified network header or is it more about taking that fingerprint of the content. So you can give us various like hints as I said structured data dates are one e-tag is an interesting one as well the HTTP headers are useful for this last modified date in sitemap. Right? Yeah. And as more the more useful you're making these if you just like automatically update the last modified date in the sitemap and we figure out that does not correspond to actual changes on the website or the website changes are minimal then that's not helping us. Not only that at some point we're just going to be like okay so this is not useful for this particular site so but as much information as we get we will use to figure out what we can reasonably expect as a change frequency. Nice so going backwards a little bit what size of sites should be worried about crawl budget should be worried about hitting that crawl limit. Large sites and like if we were to give like millions of pages okay cool so if you have under a million pages really crawl budget is not something that you should be concerned about. Yes unless you have a really flaky server setup but then again crawl budget isn't your problem your problem is your server setup right. No definitely so do you typically see that the issue isn't really with the server setup but more with the crawl demand or do you see it more with the server setup. Depends on what you say the issue is isn't like so normally crawl budget gets cited in many many different times when there actually wasn't an issue to begin with it was just like the quality of the content was bad and that's why we didn't index it so we crawled it we figured out oh this is not worth keeping then we didn't index it and people are like so is this a crawl budget problem we're like no no we we have crawled it it was just not good. And is that what you would typically see in like the excluded section of the GSE report these things are really helpful sometimes you also see people are like oh yeah my crawl budget is really low and I'm like yeah but is that is that a problem are you having like lots of changing content and they're like actually no it's like a block that updates once a week and I'm like how big of a problem really is that sometimes it's also people just barely missing the the crawl window where we does not discover the URL and if you're not submitting it in sitemaps then we are we have to crawl another page to find that link to that URL and then we're going to crawl that one so using sitemap reasonably is a good idea for these things but again if you're not having millions of pages and crawl budget is really yeah so let's say you're working with a major e-commerce or something like real estate where you have hundreds of millions of pages that are a little bit smaller maybe a little bit similar um is there anything that those industries can do to make their pages crawled more often or more appealing to google so as you said so the thing is like being crawled more often is not necessarily helping it's not giving us a signal for quality or it's not meaning like oh this is fantastic um having something crawled and then indexed and then never changing is okay as far as we're concerned but um for e-commerce I would highly recommend if you were saying something very interesting like a lots of smaller pages that are very similar to each other if they are very similar to each other should they exist is my first question or is it more like a can you extend the content for instance if it's product variations of it maybe you just describe them in a table inside one page rather than having 10 pages for all the variations definitely which colors are available well it almost brings to fact that there's so many different issues that fit into crawl budget whether it's like duplication or finding the pages yes server speed is an issue right if you have a server that every now and then just flakes on us yeah then we don't know is that because it's a flaky server is that because we are about to overwhelm you you got a flaky server yes like it's just wobbly and falling over every now and then then we have to be careful we have to touch very lightly on that server so let's say someone is switching over their site to a more powerful server what is something they should expect to see in their log files if google is testing their servers to see how much they can handle they might see like an increase in crawl activity and then like a small decrease again and basically there's like this wave motion that's gonna happen um but normally when they just switch the server and nothing else changes you might not even see us doing anything you just see like it just continues to go through definitely unless you change from something that was broken to something that works then you probably see more crawling happening okay so what about in the situation of migrations a lot of sites are really hyper aware about how their site is being crawled during a migration are there any tips that you guys have to making sure that your site is being crawled accurately so what you can definitely do is you can make sure that you're more or less progressively updating your site map and saying like this has now changed this has now changed and this has now changed and it's now time to crawl that and we crawl it we get a redirect that's actually a change that is interesting for us and then you can control a little bit on how we discover the change in the migration but generally speaking just make sure that both your both servers are up on par and like running smoothly and not like flaking out every now and then or like giving us error codes definitely make sure that your your redirects are set up correctly as well and that you're not accidentally blocking important things for us to understand once we are landing if you have like a robot steaks tb forehand and then you change that completely different URLs being blocked and put that on the other thing then we're like what's happening here what is going on yeah so that's not a great idea so try to like do one thing at a time do not be like we're going to change the text deck we're going to change the server we're going to change the URLs we're going to change the content and we're going to migrate to a different domain it's like that's a lot too much for one year too much for one project no for sure okay so for crawl budget what levels of a site does it affect so if you have a cdn will it affect your site if you have a subdomain or a domain at what levels of your infrastructure is it affecting that depends so generally speaking we go on the site level so everything that is on the same domain okay sometimes subdomains might be counted towards that sometimes they are not cdns are weird one where it's like kind of counted against your site but not really and you should normally not worry about it and i think like it's stressing so much SEOs enjoy yourself and what's interesting is when it comes to crawl budget where i would say like you should consider this especially if you're dealing with user generated content you can try to tell us not to index or not to crawl certain things that you know are bad okay so it's more of a content side of things rather than the technical infrastructure side of things yeah definitely okay so does crawl budget actually affect both your indexing phase crawling phase and your rendering phase crawl budget does actually touch on rendering because as we render we will fetch additional resources that comes from your core budget yeah because again the trade off is still there right when we're crawling initially we don't want to overwhelm your server but the same goes to when we're rendering we don't want to overwhelm your server there what's the point in killing your e-commerce website just because we are making thousands of requests at the same time to crawl all these resources so it can happen especially if you have a large site with millions of URLs and you get your caching wrong for instance that we actually have to fetch everything over and over and over again which means we have a lot of simultaneous requests to your server if your server then slows down significantly then we'll be like oh okay careful and then resource fetches might fail so we might be able to like fetch and render the first I don't know 500,000 pages but then suddenly the resources seem to fail because we're like ah we are not we don't want to make more requests at this point and then we can't really render them so that would be bad yeah so what do we do in terms of caching for those resources so obviously with pages we can put them in excellent site maps and all the stuff you mentioned before with potentially structured data and then maybe even adding some network headers but what do you think about those resources that could be really consuming yeah of google's resources absolutely so the thing is we're trying to be as aggressive as possible when we're caching sub resources such as css, javascript, api calls all that kind of stuff if you have api calls that are not get requests then we can't cache them so you want to be very careful we're doing like post requests or something like that some apis are doing that by default and then i'm like hey that's an interesting one because we can't cache it so that's going to consume your call budget quicker um the other thing is uh you can help us by making sure that URLs optimally never change so what is a really good idea is you can have content hashes in your resources so instead of saying like application dot javascript you can say application dot and then this hash that describes the content like asc exactly like this these md5 hashes dot javascript then this will never change so when you're making an update to your javascript you get a different url so we can cache the other thing forever yeah and once we get to get the new html we get the new url so the javascript and then it's almost like versioning it's like versioning in the url exactly nice so for that first version that you would have or any particular version that you would have would that how long is that typically cached for it depends but we're trying to cache as aggressively as possible so this can be like a day this can be a week this can be a month and we might ignore caching headers so if you're telling us this expires tomorrow and we're like no it does not then we might just cache it nonetheless yeah awesome so do you find that any issue any particular industries complain more about crawl budget than others or talk more about it that's true i think like e-commerce and publishers are quite prone to this because they have historically typically large sites with lots of pages and they might actually run into this we had a really interesting situation with one webmaster presented that a webmaster conference on Zurich once in Japan it's a website that has lots of user generated content and they do machine learning to figure out if the quality of the content that was produced is considered good or not and if it's not good then they put it in no index and they put it in robots txt so that we are not wasting crawl budget on it and that way they steer because they had a crawl budget of like 200 000 pages per day or something like that but they had like a million coming in and most of it was spam so they're like uh it's annoying if we just waste our crawl budget for the day on spam that will not get indexed anyway so that's an interesting one so do you believe that there's an overall quality metric that google uses so for that particular site in general because they had so much non-high quality content did that affect google's outlook on their content overall not really it was more like this this page is not good we're not going to put it in the index so that's that's the outcome really or if it's light on content if someone just literally posts a picture and says like haha then that's not great content necessarily especially if the picture is just a repost of some i don't know meme or some some stock photo right yeah definitely so what are some things that we as webmasters and seos and agency people can recommend to our clients on how to help google bot out and help your rendering system out as well so if you know that there's URLs that you don't need to render the page and that's a very big be careful here because if you're like blocking the entire api because like oh we don't need the api and then the javascript does need the api to render the page then that you don't get the full experience exactly do not block that because then we might actually not see much content but if you have something that definitely like internal analytics or some internal like tools or some chat tool that pops up or whatever just don't don't let that be crawled because what's the point yeah also you can do you recommend blocking it the robots.txt for those particular resources like kind of drawing them in a full time sure yes yes just be careful that you're not putting something rule at 60 that turns out to be useful and important later yeah yeah you can also big caution there yes you can also especially with client-side rendered apps you oftentimes have like a bunch of api calls going to and from you can proxy that through like a facade kind of thing you basically build a little application that all it does is it takes one request for one specific piece of data that you need to render the page makes all the api requests and all the back-end requests that it needs maybe has a layer of caching as well to make this faster and then has one response going back that way you have one response a one request one response cycle so i may sound a bit goofy but this is something like graphql my graphql exactly yes but with graphql again be careful to not use the post mode no make sure that it uses get requests not post requests yeah but yeah there's a bunch of stuff you can do and yeah and basically cut off the bits that you don't want crawled and give us information of the frequency of change in site maps and stuff so that we can get a better grasp of where we should be spending our time definitely and do you see any pitfalls that people fall into with crawl budget quite often well definitely the robots one like that happens all the time that they block something that turns out to be an important resource like it does not need to load the css and we're like we don't know what the page looks like so we need to lay out exactly this people sometimes fall for it when they do like a b testing that like something gets no index when they don't want it and then they waste crawl budget on something that doesn't end up in the index like yeah like all sorts of things can go wrong my favorite is when people are dealing with servers that are not configured correctly and then they just give you like a 500 something and we're like okay if this happens that we're not crawling this much anymore because we don't want to overwhelm your server and people are like why is google not crawling this and we're like well because your server constantly tells us that it's overwhelmed yeah yeah so let's say you have a very large site and you know you have these mega powerful servers with gigabytes and gigabytes of data are we um is there any way that we can tell google that hey you're you can crawl us more like we expect you to crawl us more here right you can't really uh we are detecting that what you can do is you can limit us but you can't really like say more please more please okay um this crawl scheduler is relatively clever when it comes to these kind of things and normally if we detect like oh there's lots of good content good content out there on this page and the site map is full of of URLs then we'll try to grab as much as we can and as long as your server doesn't like tell us not to we'll continue doing that so eventually the the crawl budget might rank uh ramp up to what you would expect to see oh lovely so just keep creating fresh great quality content is really what it comes down to to generally the answer to pretty much everything good quality content if it's fresh also great awesome Alexis thank you so much for being with me and having this fantastic conversation and I think there was like a bunch of nuggets in there and it's great having you here so thank you so much you did this and I hope you enjoyed it and see you soon again thanks so much everyone next episode we have Eric with us and we're going to be talking about page speed how it works in ranking what to look out for what to not do and how to not get it wrong it's such a great topic because so many people do get it wrong I'm afraid so don't miss out on the next episode