 So my name is Alan and I'm here to talk about web deduplication. So probably I should explain what web deduplication is first. So what we do is we identify and cluster duplicate web pages, anything that looks the same basically. And then we take these clusters, we pick representative URLs that actually get put into the index and served to users at search time. In the process we generate a mapping from dupes to these representative URLs, which we usually call canonicals, and we forward the signals some of the time so that these deduplicated pages don't just lose their signals, you can keep them. So why do we do this? Well, first most intuitive answer is that search users don't want to see the same result 20 times, you know, same page over and over again. That's not a very good search experience. The second reason is that once you have removed all these pages, you actually get a bunch of space back in the index, so you can serve more unique results, so you can start handling long tail queries. And this is also advantageous to webmasters because you can retain signals for your site when you redesign it, when you move pages around, the signals get forwarded from the old location to the new location. And the last thing that we get out of this is we get what we call alternate names. Now this is two different things kind of at the same time. These days it's used mostly for localization, but the older use is if you wanted to say rebrand your site. We used to be, let's say that Larry decides that he's tired of Google and he wants to fold it into alphabet, redirects Google.com into the alphabet page. Well we can still serve a search for Google with Google.com because we'll know that Google is now an alternate name for alphabet. So I'm going to talk about three things in specific. So the signals that we use to cluster pages, I'll talk a little bit about localization specifically because localization tends to get kind of caught in the grill a little bit here. And then I'll talk a little bit about how we select representative URLs. So the signals that we use, there's a bunch of them. The three that are most prominent and that are most at your disposal as well are redirects, the actual content of the page, and rail canonical tags that you send to us. There's a few others that we use mostly variants on URL normalization, but those are a little different. So redirects are the most trustworthy signal that we get. This is one of the reasons why when you guys redesign your sites we always suggest please redirect the old pages to the news pages because then it makes it very trivial for us to identify, oh yes, this guy is now over here and we can continue to forward the signals as correct as appropriate. Now in terms of content, to the surprise of no one, we take checksums of the content of your page. We make a number of various efforts to ignore the boilerplate in them. Now what you might be surprised by is that what sometimes gets caught in the grill here is what we call soft error pages where someone has served us a 200 and then some kind of very fancy, elaborate looking crypto error that says, sorry, this page is not here. We do have some machine learning models that try to catch these, but webmasters are very, very creative and constantly surprise us with new ways of representing what is basically this page is not here. So this is one of the reasons why we prefer getting an HTTP error because then instead of just dupe a limit in your page we can do error handling which is a little bit different. One of the things that often gets caught here is site goes down for maintenance and puts up this page is down for maintenance and suddenly half the site is gone because we crawled it while you were in maintenance mode. So please service a 500 instead of a 200. Now finally the rail canonical annotations that you guys use, they get fed directly into clustering. We have a fair amount of validation in front of these guys because sometimes people make mistakes. My favorite one was when we opened up one rail canonical to discover something to the effect of open squiggly brace, open squiggly brace, rail canonical target, closed squiggly brace, closed squiggly brace across the entire site. But this happens in ways that are not so obvious like your entire site is rail canonical to foreslash, so please check your scripts. Now on localization, most of the time when you think localization you say, okay here's the French version of my page, here's the German version of my page and they have nothing to do with web deduplication because the content is completely dissimilar. Now when deduplication does get involved is when for example you have same language, different country or someone decides to get clever with a bunch of geo redirecting and to us it all looks like the same page. So everything gets jammed into one cluster. Same for, you know, we see a lot of cases where people localize only the boilerplate and then we throw out the boilerplate and then it looks like the same page to us. In these situations we're basically hoping that you guys will tell us what to do with your href langs because to us it looks like you guys have sent us the same page. Okay, so picking representative URLs, obviously if we jam a whole bunch of your pages into one cluster then we have to figure out which one actually goes into the index. We have a machine-learned trained system that has every signal compete or every page, pair of pages compete on a set of signals that we've chosen. The main thing that we try to do here is avoid hijacking. That's like rule number one, everything else falls behind that. And in that case it's actually very useful we get escalations via WTA through the forums and these are a great source of reports for us. Once we get past hijacking our second concern here is basically the user experience. So is this really a good page to send the user to? So something like a slow meta refresh is a bad experience. Security if the page has an expired certificate, it's a bad experience. One thing that people fall into here is that we actually care quite a bit about your dependencies for secure pages. So if your secure page has insecure dependencies we're not sure that it's going to work properly, might just have a broken script, might not render. So check your dependencies. And then finally once we get past all that then all the stuff that you guys have really direct control over. So you can just tell us your real canonical target is the URL that you think we should make canonical of this cluster. Great. Or you can use redirects, particularly 301s. In this case they are a signal. And sitemaps we use to some extent because well these are pages you prefer this to crawl so there's probably good reason for us to use them. Okay so finally to recap some suggestions from what I just said. Use redirects to clue us in to how you redesign your site. Send us meaningful HTTP results. Check your rel canonical links. Sometimes there's broken scripts, that kind of fun stuff. Use hreflang to help us localize. Please keep reporting those hijacking cases to the forums. Secure dependencies on your secure pages. And finally try to keep your canonical signals unambiguous. Sometimes we see webmasters will say here's a 301 with a rel canonical point of the other way. And then we're like well I don't know what to do. What webmasters told me both cases are good. So if you can keep them unambiguous then you'll get what you want. If they aren't, if they are ambiguous the system is probably going to just go off and find something else. All right and that's it for me. Thank you.