 Welcome, everyone, to the next episode of the Search Off the Record podcast. Our plan is to talk a bit about what's happening at Google Search, how things work behind the scenes, and who knows, maybe have some fun along the way. My name is John Mueller. I am a search advocate on the Search Relations team here at Google in Switzerland. And I'm joined here by Martin and Gary, who are also on the Search Relations team. But that's not all. We have Lizzie joining us here today. Wee, Lizzie. Do you want to introduce yourself briefly? Sure. My name's Lizzie, and I'm also on the team. And I work with John, Gary, and Martin. I'm a technical writer, which means that I help people write stuff down. And I work on the search documentation. So what does a workday look like when you're not writing technical documentation? Writing a lot of emails and responding to emails and meetings. Meetings. Like a normal person. Like a normal person. Well, I try to avoid meetings as much as I can. But I think I also saw you at events, right? We were at the PE Summit together a couple of years back, for instance, and you were at the Webmaster conference in Zurich last year. So that's happening as well, right? Yeah. Sometimes I go to those. And not just go to them, because I remember that you also led a session at the Virtual Webmaster Unconference. Yes. Well, I was there. Lead the session. I mean, you were a facilitator, weren't you? Yes. I was there, and I facilitated the session. Yeah, that's true. That's cool. Will you make an appearance in our Virtual Webmaster conference that's coming up? Maybe. When will that happen? What's that? Well, OK. So here comes the confession. I know that I said it's going to happen in December, but it's not happening in December anymore. We are probably doing it in January or February next year. So yeah, that's the thing. But you would love it. Really? Yeah, I know it takes longer. But it takes longer because we want to run it across multiple time zones, and we want to have community speakers as well. So it'll be a little larger, but really, really cool. And I might need your help. What are you going to do at the conference, Martin? Is it like the same thing as the unconference again? Or tell me more. OK, so it's not the same as the unconference. The unconference is a different format, which was pretty cool. And Lizzie can probably testify to it being very different from normal events, right? It felt different. Yeah, it was different. But it was nice. It was a nice change, because we haven't been to an event or I haven't been to an event since this whole thing started, I guess, since January. Or when was our last event in person? Yeah, I think it was just January, the last one. I think I was at a conference in February. Yeah, I was nervous about it. I didn't think I wasn't sure how a virtual event would go, but it ended up being completely different from what I thought and a lot less stressful because it wasn't like a presentation you had to prep. Which is a big reason why I don't like leading a session. But you did, and I hear that the session was great. And did you get the feedback that you needed? Yeah, I thought it was pretty helpful and I wish that it was longer and more. OK, so the good news is that we will do virtual unconference events next year as well. But this one is not an unconference. This one is actually one where you have sessions where people are preparing presentations and giving presentations. But we also want to have live Q&A and we want to encourage people to have also some sort of interactive session. We'll have to see how these interactive sessions would look like. But it's basically like a larger event. I want to get Googlers from multiple time zones in as well. It's probably stretching across two or three days. And yeah, it's open for everyone. The virtual Webmaster Unconference was limited in the number of people who could attend because we needed to make sure that the discussions didn't go too big and were unproductive. But this time, everyone can join. And it's a virtual event. And if I can, I would like to see a session from maybe John. Or if you want to MC, that's also an option, just saying. And maybe Lizzie has something that she wants to contribute. And I know that I want Gary to do his life of a query, if possible. No. Is this like the public pressuring place? It's like, OK. I'm trying to basically get a public commitment from y'all. But Gary put a stop to it. Hey, this is off the record, right? You can't get us to commit on the record. Damn it, you got me there. Oh. Oh, god. This was great. That was well done. Well done. We should promote Lizzie. Definitely. This was fantastic. Public relations work A plus. You got him. Next level stuff. Oh, this was amazing. So Gary, why can't I have a life of a query session from you? Man, I really wish I could do it, but I really don't want to. Wait, that's all you have to say. You can just say, I really don't want to, and we'll leave you be. Don't teach her that. Why are you teaching her that? No. No, Lizzie. I will never step down just because someone tells me that they, oh, come on. OK, I love this episode already. Because it appears that I can team up with Lizzie against you, and this is just amazing. This is the worst day of my life. What is happening? Finally, I'm not alone. We are two against the one unicorn. And this is amazing. Help. So Gary, why don't you tell us a bit about life of a query before the conference? Yeah, basically do it now, and then you don't have to record it again. Come on, do it. That's because the life of a query class, that's two hours long, and we don't have two hours. Then, OK, so we had a bit of crawling. What's next? What's up next in the queue, basically? Right, so we were talking about indexing, actually, not crawling. Somehow. Right, true. No, we had crawling and indexing. We had both, right? I think we silently skipped over crawling, which is really weird. But we can cover that at one point where perhaps there's something to announce around Googlebot, or something interesting happens with Googlebot, and then we can cover how Googlebot actually works. So I don't think that we lost all that much. But we were talking about indexing and rendering, and we covered quite a bit about indexing. Namely, we talked about content conversion. We talked about collapsing, basically error page detection. You talked about rendering a little bit. We collected signals, and now we ended up with the next step, which is actually canonicalization and duped detection. Wow, that's a big word. Isn't that the same duped detection and canonicalization kind of? Well, it's not, right? Because first, you have to detect the dupes, basically cluster them together, saying that all of these pages are dupes of each other, and then you have to basically find a leader page for all of them. Right. And that is canonicalization. So you have deduplication, which is the whole term. But within that, you have cluster building, like dup cluster building, and canonicalization. Right. Got it. So for duped detection, what we do is, well, we try to detect dupes. And how we do that is perhaps how most people at other search engines do it, which is basically reducing the content into a hash or checksum and then comparing the checksums. And that's because it's much easier to do that than comparing perhaps 3,000 words, which is the minimum to rank well in any search engine. Wait, wait. 3,000? What? Oh, I said that. Gary. We have some documentation in our developers. This documentation, that's not 3,000 words. Should we add some fluff? Is it ranking well? No. See? It's always ranking well. Come on. I think it's doing OK. Yeah, I'm basically trolling. So Lizzie, do you want to reconsider which site you're on? Yeah, he's throwing a documentation under the bus. What the heck, Gary? I thought we were on the same team. We are. Nope. We totally are. Gary is on his own team. Yeah, that's true. OK, so 3,000 words. Is that like a real thing, or did you just make that? No, I just made it up. That's bullcrap. And so we are reducing the content into a checksum. And we do that because we don't want to scan the whole text because it just doesn't make sense, essentially. It takes more resources, and the result would be pretty much the same. So we calculate multiple kinds of checksums about the textual content of the page, and then we compare the checksums. Does that catch near duplicates or only duplicates, exact duplicates? Good question. It can catch both. It can also catch near duplicates. We have several algorithms that, for example, try to detect and then remove the boilerplate from the pages. So for example, we exclude the navigation from the checksum calculation. We remove the footer as well, and then we are left with what we call the centerpiece, which is the central content of the page. Kind of like the meat of the page. That's where it's at. Considering you are vegetarian, yes. That's the meaty part of the page, yes. These meat jokes again. I feel like that has been like the theme of the week from John. It is. Yeah, Sundar sent a memo about this on Monday. About meat jokes. Yes, it's the meat jokes week. And now Lizzie doesn't know if I'm serious or not. We're all out of cheese, I guess. Yeah, because I'm not checking those emails. I guess I should go check. But he's the boss. Well, I mean, what could you say, Gary? You filtered all of those emails, right? I do. I don't even see his emails because I filter them. And for context, for those listening and being very confused about the meat jokes, like this week, we had several meetings with the team or members of the team. And somehow we always end up with joking about meat, even though we have several vegetarians on the team. But going back to kind of our dupe detection. When we calculated the checksums and we compared the checksums to each other, then those that are fairly similar, or at least a little bit similar, we will put them together in a dupe cluster. I have kind of a dumb question. What is a checksum? And why do you keep mentioning it? So a checksum is basically a hash of the content, basically. A fingerprint. A fingerprint, yes. That's a better word. Basically, it's a fingerprint of something. In this case, it's the content of the file. In case of, for example, zip files like packages, it could be, again, the content reduced into a checksum so you can easily compare two different packages to each other, essentially. So from a practical point of view, it's basically you take all of the letters in the document and you add them all together and you come up with this really long number, and then you just compare the numbers instead of comparing the text. So it's kind of a simple way to have a fingerprint for the whole document. Then you just compare the numbers. You don't have to look at the text anymore. What if the document changes? Does it get a new number? Then the number changes. Yeah, and then basically, if the number changes, then the Duke cluster would be, again, different because the contents of the Duke cluster would be different because you have a new number in the cluster. So that would just go into another cluster, essentially. One that's relevant to that number. And then once we calculated these checksums and we have the Duke cluster, then we have to select one document that we want to show in the search results. Why do we do that? We do that because typically users don't like it when the same content is repeated across many search results. And we do that also because our storage space in the index is not infinite. Basically, why would we want to store duplicates in our index when users don't like it anyway? So we can basically just reduce the index size. But calculating which one to be the canonical, which page to lead to the cluster is actually not that easy because there are scenarios where even for humans, it would be quite hard to tell which page should be the one that is in the search results. So we employ, I think over 20 signals, we use over 20 signals to decide which page to pick as canonical from a Duke cluster. And most of you can probably guess like what these signals would be. Like one is obviously the content, but it could be also stuff like page rank, for example, like which page has higher page rank because we still use page rank after all these years. It could be, especially on same site, which page is on an HTTPS URL, which page is included in a sitemap, or if one page is redirecting to the other page, then that's a very clear signal that the other page should become canonical, the rel canonical attributes. That's also, is it an attribute tag? It's not a tag. It's a tag, yeah. No, it's not. It's a link, the link tag. It's a link tag with a relation attribute, canonical. Right, I'm confused, but anyway. So the link rel canonical tag is quite a strong signal again because people or someone specified that that other page should be the canonical. And then once we compared all these signals for all page pairs, then we end up with actual canonical, right? And then each of these signals that we use have their own weight. And we use some machine learning, who do to calculate the weights for these signals. But for example, to give you an idea, like 301 redirect or any sort of redirect actually should be much higher weight when it comes to canonicalization than whether the page is on an HTTP URL or HTTPS. Why? Well, because eventually the user would see the redirect target. So it doesn't make sense to include the redirect source in the search results. So do we get that wrong sometimes? Or like, why do we need machine learning? Like we clearly just write down these weights once and then it's perfect, right? So that's a very good question. And a few years ago, I worked on canonicalization because I was trying to introduce a Jeff Lang into the calculation as a signal. And it was a nightmare to fine tune the weights manually because even if you change the weight by 0.1 number, I don't think that it has a measure, then it can throw off some other number and then suddenly pages that, for example, whose URL is shorter might show up or more likely to show up in the search results, which is kind of silly because like why would you look at that? Like who cares about the URL length? So it was an absolute nightmare to find the right weight when you were introducing, for example, a new signal. And then you can also see bugs. I know that, for example, John escalates quite a bit to index dupes basically based on what he picks up on Twitter or the forums or whatever. And then sometimes he escalates an actual bug where the dupes team says that, why are you laughing, John? You shouldn't laugh. This is about you. I'm putting you on the spot. You should appreciate this. But anyway, so then he escalates a potential bug and it's confirmed that it's a bug and it's related to a weight. Let's say that we use, I don't know, the sitemap signal in a two or the weight of the sitemap signal is too high. And then let's say that the dupes team says that, okay, let's reduce that signal a tiny bit. But then when they reduce that signal a tiny bit, then some other signal becomes more powerful. But you can't actually control which signal because there are like 20 of them. And then you tweak that other signal that suddenly became more powerful or heavier. And then that throws up yet another signal. And then you tweak that one and basically it's a never ending game essentially. So it's a whack-a-mole. So if you feed all these signals to a machine learning algorithm plus all the desired outcomes, then you can train it to set these weights for you and then use those weights that were calculated or suggested by a machine learning algorithm. Are those weights also like a ranking factor? Like you mentioned, like, is it in a sitemap file? Would we say, well, if it's in a sitemap file, it'll rank better? Or is canonicalization kind of independent of ranking? So canonicalization is completely independent of ranking but the page that we choose as canonical that will end up in the search result pages and that will be ranked but not based on these signals. Right. I mean, that was a lot of stuff and I wonder if we should just like write this up somewhere. That's probably not a bad idea. So if we wanted to write that up, what would we need to do? It sounds like Gary has all of the inside information and he's willing to share it off the record. What would it take to get it on the record? Well, since it's Gary, I think he could write it down himself, maybe. First draft, yeah, yeah. No. Because I've seen him write some stuff and I think it, you know, first draft pretty good. But to be fair, I find it super hard to write the first draft. How do you get a draft started? Because basically I stare at an empty Google Doc and it stares back into my soul. Yeah, you and I have had conversations about this. I know. The blank page is just like the death of all writing. And one thing that can really help is talking about it with someone else and then kind of roughly banging out like a skeleton of what you want to say. So maybe start with headings. Like I want to talk about this, then this, then this, and then maybe some notes under each one. And I guess like before even that, maybe deciding what the point of the doc would be. So for Gary, like for this one that he was just talking about, would it be something that people will read to then do something else about it? Is there any action needed that they need to do? Or is it explaining a concept so that they can understand how something works? So first kind of deciding what's the main point and then that will determine the structure. Okay, so basically Gary would come up with this idea that we should document this and then create a rough draft and send it to you. And then you just copy and paste it into the documentation and hit publish. No, I think there's a little bit more back and forth. Probably we would talk about it first. And then I would say, Gary, can you please write something down? And then he would write it down. And then I would look at it and make comments, probably in a Google doc. And then we would queue it up to go into another review process for Dev site because we have to check it in to our content management system. And then we would do a final review, stage it, make sure it looks nice. And then we would publish it. And depending on what it is, like we might want to promote it a little bit as well. Maybe on Twitter or like blog posts, depending on if it's related to a feature or if it's just explaining a search concept. Okay, cool. So basically you'd work on the docs draft with Gary first and then I guess he would turn it into an HTML page or how does that work with the developer documentation? Yeah, either me or Gary. But since Gary knows what to do, I think he would do it. Okay. But yes, he would write in HTML because that's how our content management system works. Okay, cool. And then he would basically check it in like any source code file or... Yes. Okay, cool. Exciting stuff. And then the publishing part I guess is like just clicking a button or is there something fancy? Yes. Just clicking a button. No, it's not fancy. You just click a button. So it used to be a little bit more complicated where there was two steps where you had to manually publish but the team that maintains the site that we publish on fixed that. So it was automatic. So all you do is submit to the code base and then it pushes it like within a second. So you just click a button, submit and then it's live. So cool. So basically you would just publish anything that we would bring you and nobody would check it. No. No. No. That's not true. That's a little... I would not just publish anything. Could we do like a guest blog and the developer documentation and guest blog post? Well, I don't know how interesting that would be to people. But maybe if people want to know about it, then maybe. I think that that's also like a good point is like, that's something that I often push back on. People who want to just write something, like what is the point and why? And will this help anyone? Or are you just putting this out there just to put out there? Because if there's a pile of information and some of it is not useful, like that's not helping anyone. Oh, okay. So well, I mean, obviously it should be useful. That kind of makes sense. Yes. Step one, make sure it's useful or interesting. Okay. Do you do anything afterwards? Like once it's live, is it basically just live forever? Or do you go back and review that from time to time? Do you ever get feedback on these things? How does that tend to work? Yeah. So we have a feedback mechanism, well, multiple feedback places where you can report things. So sometimes it's on Twitter, which we don't like because it's hard to track what's going on. There's a button in the documentation where you can click send feedback and then write to us about what is wrong or what you found confusing. Mostly it's like negative stuff, but you could write in some positive stuff if you wanted to, if it helped you. Like we like to know that, but mostly it's like typo or like, I don't like this. And it goes into this queue for us to look at and then we will revisit and fix it. Oh, cool. It's really helpful if it is specific. So there's an option where you can like select to, like this part of the page is confusing. So when people give us more context then it's easier for us to know, like how can we improve this? Okay, so it's not like this just giant black box where people go and they complain and nobody actually reads it. It's actually, it goes to people and they try to read it and figure out what to do. Yeah, and the people are me and Gary. We are looking at everything. A lot of it is like weird stuff or like not very helpful. Oh yeah. I had an idea that maybe we could have a video where we just like read the bad feedback, like the Twitter bad tweets mean celebrity tweets because some of it is so aggressive or just like, this is bad, I don't like it. And when you get feedback like that, it's like, well, what should I do differently if it's just bad? Like, what about it? I need specifics to improve it. Okay. So I was thinking about that video and I'm in actually. Awesome. I kind of think as the lead of the team we should be careful if I would do something like that. Too much public shaming. Come on. I mean, we shouldn't be encouraging people to leave even worse feedback. I think getting more actionable feedback is always useful but it sometimes feels like, especially with a big company like Google, if there's a feedback link then it's easy to assume, oh, nobody actually reads it. But it sounds like the feedback that comes in for the documentation that's actual people behind it. So it's, if you want something changed or if you're frustrated because I don't know you're trying to implement something and it's just not working then it sounds like people should just leave more clear feedback. Yeah. I completely understand why people don't. I mean, it's worked to write in a long story about like why you had trouble with something. And it's much easier just to say like bad one star. Like, we don't like it. And if I did get a big pile of that for like one document then I would look into it more. I would probably have to speak to some people like run a survey or something for that doc to find out. Okay, well, more specifically like I'm seeing a lot of one star reviews for this doc. Why is it doing so poorly? What can we do to fix it? And then more specifics can kind of help that along. That sounds pretty cool. Yeah. Okay. And then also the forum. Sometimes like things, sorry, I meant to mention that one. So there's Twitter and the feedback funnel into the docs and then the forum. And so sometimes people will write in there and we'll get bugs surfaced from that. If something's wrong, like maybe incorrect or out of date or something like that, and then we'll fix it. Cool. What about like non-English content? Is that something where you also process that? Or does that just go to like translators directly? We also process that. The feedback tool that we have, it will translate what they're saying. A lot of the feedback there is like, oh, you use this word when you should have used this other word in this language. And it will translate that into English, which doesn't really make any sense because sometimes that auto-translate thing just makes it sound like the same word. But that's a signal for us to look into it deeper to report back like, hey, are we sure that this is the right word? And then talk to our localization experts about what we should do about that. So cool. Okay. So I guess if you want to complain about the documentation on Twitter, you should go to Gary and he will accept your feedback, but better than that is just to give actionable feedback in the documentation directly so that... Wait, what? No. I don't know how I feel about this. No. Okay, fine, fine. Then maybe don't complain about it on Twitter to Gary. Yeah, complain to John. Why me? To me, yeah. I'm too nice, I'm too nice. Nobody complains to me. Right, yeah. Sure. Okay, sometimes, sometimes, sometimes. Okay, we should move on. Cool, okay. I think that helps a lot with the documentation. So that's something where it sometimes comes across as like it just magically appears and nobody really knows where it's coming from. And for the documentation, do you do all of the search documentation or is it just a part? Just a part, yeah. So that like when you say search documentation, there's a couple of different angles there. So there's the one that's just for everybody who's trying to use search. So like web search kind of thing, like how do I search for a job or like a business near me? That's a different team. There's also the search console documentation which there's a tech writer, Josh, who's on the search console team who documents stuff about how to use search console. And then I work on the developers.google.com search documentation, which is a little bit more technical, I guess. Okay, cool. We should talk a bit about how the process works with getting kind of new features documented and all of that, but maybe we can do that in one of the next episodes. I don't know if we didn't scare you too much with this podcast. Cool, okay. I think maybe we should take a break here. It has been fun doing this episode, having more people join in and kind of more viewpoints and yeah, I hope you as a listener, you found this useful and entertaining and a bit insightful. And if you like these, then of course let us know on Twitter or wherever you can give feedback for podcasts in general. And hopefully we'll see you again or wait, wait, I always messed this part up. We'll hear you. No, you will hear us again in one of the future episodes. And of course, to hear us again on a future episode, don't forget to subscribe. See you then. Bye everyone. Bye. Goodbye. Oszi Buna.