 I'm feeling great today mostly because of the performance enhancing drugs so this should be fun. Last night about 10 o'clock so I live on the East Coast dealing with that but I brought my family up this time to finally get to see Seattle and see the sights and meet the Mazers that don't know too much about me to put in a bad word and they're kicking around and being assholes kids do three five and seven and I go in there I'm like look daddy has got to talk to like a thousand people in the morning you have got to go to sleep and my oldest looks at me and she says a thousand people that's a lot what if they don't like you I was like sweetie daddy's a black hat he can always buy reviews it'll be fine but it's been a great time here in Seattle really looking forward to talking to you today I have the unenviable task of discussing with you data and not even the cool parts of data I know Brittany is going to be talking later about machine learning awesome stuff you can do there and will is going to be talking about data visualization but I'm going to focus just on the data itself so there's probably like 5% of you in here who are pretty excited about this talk I wouldn't tell anyone that it's not good for your marriage or your relationships you just keep that on the download because it's about the least cool thing you could really be interested in and the other 95% of you will will get it you understand it but it's going to be a lot to digest really really quickly so first a little bit about myself I am principal search scientist at Moz my primary responsibility has kind of morphed into data quality when I first joined I worked on keyword Explorer and we really wanted to improve predictions for actual search traffic and we realize there's a big quality gap there and then slowly but surely the research I did for that ended up making its way into other types of tools in our in our product line I also get to do a lot of research and development and proof of concept which is a lot of fun but and it's hard to believe my favorite task in Moz is specifically to ruin Dr. Pete's day and and just to really like I want to drive this home because this is absolutely true I am principal search scientist at Moz I have never heard of anyone with the title principal in their name other than at school but when I was talking with Adam who is my boss at Moz we're trying to figure out what to call me I knew that Dr. Pete has this incredible reputation across the industry is being like the science guy like Bill Nye the science guy of of SEO and he's a doctor too which is a little bit intimidating so I said well what's his title it's like he's a marketing scientist and then I went to a thesaurus and I found as many different words that sounded better or more important than marketing scientist and came up with principal search scientist so now whenever we're put side by side it appears as if there's some sort of hierarchy of scientists at Moz and I've I've traipsed myself or position myself above Dr. Pete and he's done he's already given his talk so he can't respond so that works great for me so I've tried given the amount of stuff that I've got to go through pretty quickly to narrow down the talk to one really primary goal and and that's this to cultivate an appreciation for data quality one of the hardest things that our industry deals with is the fact that our foundational data and when I when I use foundational data I'm talking about the raw stuff the not the pretty metrics like PA and DA that you might deal with but but the data at the very bottom the stuff that is being collected by the tools that we use and then turned into things that are valuable that data is really messy it is filled with holes it is filled with biases is often misleading and so today I want to talk through a bunch of issues with data and try and get you into position to appreciate quality data like I want you to be like the wine snob of data I want you to appreciate data the way snobs appreciate wine or Cyrus appreciates his hats or just any sort of thing like that where you just really love quality data and the reason for that is because when there is bad information it puts us into a unique situation I wish I had reworded this because this is like a perfect Yoda saying it should have been a symmetry bad information creates or something like that but instead it's bad information creates asymmetry and basically what that means is that when two different companies you and your competitors get the same bad data but one of you realizes how to fix it or improve it it creates an information asymmetry means that one of you is a little bit more informed than the other and that asymmetry creates advantage let me explain that really really quickly imagine that you and a competing firm have identical knowledge when it comes to how to implement the tactics that you're going to hear about in the strategies that you hear about all throughout the course of mosque on let's say that you have you've somehow managed to hire sets of twins identical twins just to smart read all the same books all the same blog post same experience etc but one of you knows that Google analytics has a particular bug that causes organic traffic to be mislabeled and we'll talk about that later while it means that when you're trying to figure out how much money to invest into one of those tactics or strategies you'll know to look at the correct labeling and realize that you need to invest a little bit more than your competitors do so you'll make better decisions based on that quality now how do we do this how do we cultivate this appreciation for data quality I can only see one way that the first is just a big downhill slope of pointing out how horrible all of the data is that we rely upon so for the first like 15 minutes of this presentation I just want to disabuse you of any pretense that the data that you rely upon daily is good it's not and it's not to say it's not usable it's not to say it's not valuable but it has serious issues and then after that from the shambles of data that's left on the ground what we hope to do is resurrect out of that an example at least one of how to use multiple data sources to produce a really quality metric so stop being done with the data but actually pull things together and be smart about it so to begin let's do the kickoff let me count the ways your data sucks we're going to start with what everybody uses Google Analytics unless you like to pay for shitty data which is fine too but most of the different tools out there have similar types of gotchas that you're just not going to notice now I'm going to run through these really really quickly so please do not try to like take notes or whatever the slides will be up later the whole point here is actually not to inform you at the beginning of specific issues it's just kind of make your stomach hurt a little bit like just feel uneasy about where you are in the world regarding data and that you rely on and then we can talk about how we how we undo those knots in your stomach so Google Analytics number one first Google Analytics often counts real search engine crawlers as visitors this is an example there's an international search engine named so go now that they've decided to start crawling JavaScript it's going to start counting in your analytics number two Google Miss labels well-known search engines ask.com if it's a subdomain Google doesn't count it as organic traffic which means that if you're doing a great job in picking up search results from across the web not just from Google or Bing or major search engines it's missing out in fact I I think the last count Google only looked at maybe 30 or 40 different search engines duck duck go wasn't one of them in determining your organic traffic what about Google Analytics coverage I'm not sure if Tom has spoken yet but he wrote a great blog post on this depending on your customers how they've installed Google Analytics and what the profile is the browsers they're using you're going to get different rates of impressions and visits and data which is a huge issue because most of the time when a customer comes to you you're looking at years of data and you can't go back in time and fix their implementation much less change the audience that's visited now of course Jono as soon as I mentioned this to him who you know spoke yesterday pointed out how that this isn't Google's problem and I was like thanks a lot still doesn't matter because you're stuck you can't change this foundational data there is no going back in time there's no fixing it but of course it's not just Google Analytics we use dozens of different types of tools Google search console I mean this one holy shit it does not get any worse than this if my daughters ever used Google search console impression data or position data in a project or report I would change my position on corporal punishment I mean this is just really horrible data let me give a quick example imagine you were running in the Olympics you did a hundred different races and it was actually you mean you're probably going to lose every single race but in the last race it pretty much been identified who everybody was the best so everybody went home but you you stuck around you ran the last race you're the only participant so of course you take number one and you you walk up and you stand on the podium and you know they put the metal on you and it's great and you send the picture home to mom and don't tell her the truth and it's great but what actually happens then it's that's when Google shows up Google shows up with the camera takes a picture of the podium and that's what they're using to determine your average position it's what I call a podium first count and what that means is that all of the races you lost so miserably that you never made to the podium don't matter let's give a better context let's imagine you are really pissed off at your company because you're not ranking in the top 100 for any of your key terms so you send out an email that night to all of your employees like what are we doing you're horrible blah blah blah look click here you'll see that we don't rank for anything of course they're gonna see personalized search results so they go and look and in that day everybody is seeing you in position seven or eight because they're getting hyper personalized results well three days later Google search console is gonna tell you your average rank is like number six and that's because they only count when an impression is made if you rank a thousand and nobody ever sees it it doesn't count against you it's like participation awards but your participation doesn't count against you so that you only get counted when you win this is about as garbage as a metric as could possibly be but it's not the only one there are tons of issues with Google search console this was an experiment that was run with a couple of friends of mine it turns out Google filters out a ton of long tail stuff a ton of really great long tail stuff and because of that it means that even though people are searching for things and it doesn't matter if they're logged in and a different browser it's just not gonna show up you're just gonna miss the data there's other issues where Google or port things is not indexed JR Oaks over at Adapt Partners did this great study where he found that up to 80% 80% of pages that were supposedly not indexed by Google according to Google search console were in fact in the index and then what about links if you use the live links data the sample links data from Google or Google search console you're gonna find out that it has a lot of dead links in it in fact it does worse than Moz Majestic and Ahrefs which is nuts Google is a far faster crawler than anything else on the internet but just the time delay alone between the time in which they get the data and they present it to you causes this problem but if we're talking about links let's not just harp on Google's link data I mean there's link data that you pay for surely if you're paying all this money to Moz or Majestic or Ahrefs you're getting great perfect foundational data not a chance we'll start with this one research that I did just a couple of months ago on the mobile first index determined that roughly about 12% of the pages on the internet deliver different links if you have a mobile bot versus a desktop one well none of the major index not Moz not Majestic not Ahrefs none of the minor indexes like Sumrush or WebMeUp none of them crawl with the mobile bot which means that now that mobile first is rolling out different links are going to be in Google's index than the links that are in Moz's Majestic or Ahrefs but it's not like this is a new problem we've always had an issue with inconsistent robots.txt I mean imagine like this part of the room blocks Majestic, this Ahrefs, this Moz but everybody lets in Google but what that ends up meaning is we all have just different link indexes and if you're not using all of them then you're missing out on certain links and then the worst of these things that we like to call just black holes I mean a major hosting company just recently decided to block every non-search bot didn't notify their customers other than in a blog post now we're starting talks with them to see if we can fix this kind of issue but if this hadn't come to our attention from a customer of ours we're in the same position as everybody else in the link index industry we just don't know and then there's spam now as I mentioned earlier I did cut my teeth doing black hat SEO and a decade or so ago I thought spam was great because it made me money but the reality is that it just fills up link indexes now you're not going to be able to read this on the screen this is what happens when you try to fit in 500 domains of a single spam link network into a page now that might not seem like a big deal but there's a lot of pages of this in fact it just goes on and on and on and on and on call this the badminton link network still live still active 16,000 plus domains it's a Wikipedia scraper with infinite crawl depth which means that every time you visit a page it just randomly grabs something from Wikipedia and then links to it and then it creates hundreds of millions of external links and because of that when you go and look at your foundational data you're like link counts for example they're going to be blown up if a link index hasn't done a good job of getting rid of this network this is just one this is just one network now we've talked about Google Analytics and we've talked about search console we've talked about links let's take a deep breath surely we've gotten rank tracking right I mean it's been what 15, 20 years since the first rank tracker was written it can't be that hard we're just searching Google unfortunately this is such a moving target that we have teens at Moz just dedicated to looking at what's changed and how to fix it this is one that almost nobody ever talks about but it's a huge endemic problem across the industry rankings data changes based on the size of the cert if you get the top 50 search results the featured snippet at the top can come from anywhere position 40, 50, 60 doesn't matter what about the frequency of cert features what about whether or not there's a top stories what about whether there's a featured snippet or a knowledge card all of that can change now luckily Moz does what's called a double fetch we get the true first page and the top 50 and we mesh it together but almost no one in the industry does this and because of that means that if you're getting rankings data that's anything other than the true first page you're getting biased data and then this one this is just annoying as hell Google has decided that everything should be geographically related at least some way or another so if you tell Google that you're just generically in the United States it puts you smack dab in Independence, Kansas and you'll start getting results at the bottom of your page like you'll search for restaurant and try to figure out something related to restaurants that's nationwide and at the bottom of it it's gonna have like the McDonald's and Independence, Kansas what about keyword planner we got keyword data this one is really close to me because it's something that I've worked with the most over the last several years and we know it's just straight up fucked sorry part of my French they group keywords differently in keyword planner than they do in SERPs so if you're a football fan and you search for Texas A, Ampersand, M you're gonna get a certain search result and if you misspell that or put in an extra space sometimes it'll correct it and say showing results for sometimes it won't but that's not the same as what they do in keyword planner itself in keyword planner they do different groupings and then they group which is in itself a problem if you looked up SEO or search engine optimization Google says oh they got the same volume but the reality is that's the two volumes combined so if you go to a tool out there that relies on Google keyword planner volume and then you say okay what are all the keywords that I rank for chances are you're combining together volumes that have already been combined together and it ends up with a really huge problem that we call disambiguation what about click data we're using click data more and more and more now CTR curves you've all seen this Dr. Pete's gonna talk more about this in depth but you know in the first slide you'll see a standard click-through rate curve where the number one gets X percentage of clicks number two Y and so on and so forth but we know that changes radically based on SERP features the second one is what happens if you have sight links the number one position gets almost 90% of the clicks but it's not just how people click down that curve it's also just the total clicks this is the total click-through rate so like if a hundred people showed up for a SERP what percentage people would actually just click on organic at all well your standard no features is about 80% that's not bad but if there's a knowledge card on the page it drops to 25% what about personalization personalization is almost the hardest to pin down we don't really have a good percentage metric for it yet but we have an internal metric that we have built and you'll see that if there's a local pack on the page there's almost double the personalization of what you would expect let's say if there were no features at all or if there was a knowledge card and that means that all of this data that you've been looking at telling you how many clicks you're gonna get or how much money you're gonna make and you're just doing these formulas are all ruined because the reality is far different from what is presented by these most basic numbers but we have a solution right you've probably heard about clickstream data I'm sure a lot of you have read some blog posts on Moz and now AHFs and SEMrush are using clickstream data to improve their models nope not really that good either turns out the most important or at least most popular clickstream data source has 0% MAC and 0% iOS traffic that means that we are fixing our bad data with more bad data luckily we know this bias so we can solve some problems with it but all this amounts to is just a giant data train wreck just a huge massive problem everything that we rely upon all of this foundational data is just filled with holes and problems and bugs just it's just crap and so what do we do well we've got about five minutes left and we're gonna try and solve one problem just one if you're an agency or you're in-house you've probably been asked at some point or another how many clicks are we gonna get if we just rank position X for this term so we're gonna look at what happens if you rank number three for ROI so what do we need to know what's the foundational data that we're starting with well we need to know what the click through rate is position three oh that's easy right we solved that like 20 years ago when we started doing eye tracking and all that kind of stuff right should be easy we need to know the search volume even easier log into Google and it'll tell us great right this is what we used to do this is what everybody would do we'd have a standard click through rate curve we'd circle you on the graph then we go to keyword player and we'd circle your volume we just multiply them together and there you go great perfect grand we're all done but we know this is wrong we know this is completely wrong we just went over this first click through depends on search features we already talked about how that that curve changes based on what different search features are there so instead of just using one click through curve we need to look at a huge array of potential click through rate curves not just what happens when you have one search feature but when you have a huge combination of them so given search features A, B and C like it's it's got ads it's got sight links it's got a knowledge panel what is that click through curve turns out there's a lot of those more than the people in this room by far so how do we solve this well luckily Dr. Pete is actually he's kind of smart he's kind of smart dude so we work together on a project pulling together huge data and then did some you know magic finagling with machine learning and we can end up with a way to predict novel arrangement click through curves which is pretty awesome but that's only one of the issues you know we could come up with that specific click through curve but we also know that the total organic matters as well it's not just the curve but how big is the pie in the first place well that means we got to do the whole damn thing again except just for organic click through rate so we just looked at this graph a second ago we need to figure out what happens when there's a local pack and a knowledge card or a local pack and sight links we also know it depends on personalization so there's a third part too it gets even harder gets even deeper and so we have to do the whole damn thing again just for personalization and we can once again come up with these novel arrangements of what the personalization rate is it's pretty amazing to see for example if there is a movie box like because there's some sort of movie that's recently come out your chances of getting a click from that serve it's even if you have the best site is less than one percent ranking number one and the reason is because they're show times and once people see a show time they click on that we call it host diversity and it's a metric that can tell us just how few and far between now before we put that aside just imagine what this means for a site like weather.com pretty much every weather related search has a knowledge card it has that giant thing right there in the middle sunshine rain rain rain rain rain Seattle rain actually it's been kind of nice here but that means that every single serve that they rank for for the most part only gets a 25% click-through rate to organic so they've got a quarter their estimates from the beginning if they want to be accurate. Finally volume depends on disambiguation we figured out the click part but now we've got to figure out what percentage of the volume is actually dedicated to this keyword as opposed to the 30 keywords that are grouped together I did an analysis of a leading keyword provider outside of Moz right before I joined when this disambiguation problem came out 47% of their keywords were grouped when Google put in this algorithm that meant that on average every time you pulled two keywords that were related to one another their volumes were already going to be grouped together which meant that they had this enormous enormous boost in estimated search volume it was just all wrong now like I said we use click data to solve this problem at Moz simple formula where we take real click data from our clickstream provider and then run this formula and ultimately after all of that work we come up with a new formula and I'm not asking you to like memorize or look at it in fact this is just a generic kind of understanding but instead of just a simple curve times volume equals traffic we've got complex machine learned functions for every little bit of the process it's not just multiplying two numbers together anymore we're talking about huge amounts of data all with the simple goal of telling you one number 3,218 that's how many clicks we think you'll get at number three for ROI all that work for just one number and that's what I mean by appreciating data quality so much effort going into giving you the right answer now if you take the number out of Google planner and multiply it by a normal click through curve you're going to get more than three times that because Google groups together the key words ROI R period, O period, I period return on investment all the misspellings of return on investment all of the potential similar acronyms that Google thinks are related and it's going to ruin your estimates and your strategies and your targets and your competitors going to know better at least one of them will so what are the takeaways number one examine examine all of your data sources closely and carefully at the end of the day you have a responsibility to yourself your company and your customers number two inform them let them know this will give you the flexibility in the leeway that you need so that when you do run a campaign you can tell them look there's a margin of error here that we just can't beat and that margin of error is the slack that you need so that you can be creative and finally demand when you see something wrong at Moz tell me send us an email tell us that we demand better data from all of our providers and so if I leave you with anything just demand less sucky data you're paying for it you deserve it have a great mosque on thanks