 Listen to users. I'm Jeff Bowen, I work at Automatic. So we run WordPress.com and a slew of other things around the web and internet. And I'm on the data team. So data, that kind of works with the talk I'm giving. My title is actually Data Wrangler. And people ask me, what is a Data Wrangler? Can't really see that for the sun very well. So a Data Wrangler, what better way to talk about a silly title than a series of memes? So I'll do that. What my friends think I do, I don't know that it'll help. I'll just try not to look down at my notes too much. A little better. So no, I don't plug patch cables in and out and work with servers directly too much anymore. What society thinks I do, I'm sitting in a basement and put out fires. What my mom thinks I do, I see some people have kids. What I think I do, just walk around the matrix, not even seeing code. But what I actually do is try to figure out how this cursor data works for my application. So in order to describe what I'm talking about, what I do for a living, we need to talk about what data actually are. So what are data? Data are facts and statistics collected together for reference or analysis. And let's take a minute and look at what those words actually mean. So facts is pretty self-explanatory. It's almost hard to come up with a definition for what a fact is, right? It's a truism. It's something that at some point in time mint the thing that it says and says what it means, OK? So it's kind of hard to go any deeper into that, like we're going to get into the splitting hairs. Statistics are really just kind of a collection of these facts and mishmash through, put through functions to look at numbers, trends, and we'll get to some of that later for reference or analysis. And reference and analysis is really what I consider like the data wrangling part of it. You collect this data and then you have to make something meaningful about it. These data are really all around us. There's all sorts of data. There are site profiles, it's really kind of a document. Other documents like PDF files, Word documents, keynote. This keynote presentation is a series of data. The products in your cart when they left your store, the bytes in and out on your bandwidth bill. And I like to think of data as a little bit more of like an abstract term. I'm a programmer so everything kind of comes back to code. Is it's the sets of input and output for some function and then even more mind-blowingly, the function itself is data? So we can talk about that later if you want to come by the happiness bar. OK, I know what you're probably thinking. OK, so Marty Pants, how's about data science? So you've probably heard the buzzword data science. Is it a buzzword? So here's a quote that I came across. Data science could disappear to be eclipsed by the next buzzword. But it may stick around just like another incongruous joining of two words computer science did. And if you look really closely, you probably can't see it, that quote's from 2013. So it's already a couple years old. Becoming a little bit less of a buzzword, we're forming like actual foundations to do some really cool stuff with this data, this glut of data that everybody's got. And so hand in hand with data scientists, so you get these fancy data scientist merit badges. And you can go down the list, if you go to a data conference, there's going to be a talk on each and every one of these things. Business intelligence is kind of what most people talk about when they're talking about data science. It's looking at the process of your business, collecting data, deciding what to do next, and moving on. So you can go down the list. There's big data unstructured, no SQL, IPython. All of this stuff is way beyond the scope of this talk. So moving on. Here's a quote from Nate Silver. And I added some emphasis on a particular word he chose. Nate Silver, this was asked in an interview, seems to think that data scientist is a glorified term for a statistician. And there's some truth to that. There's a lot of statistics involved. And we'll get into some of that in a little bit later. But I think that I was kind of kind of come up with my own definition of what data science is. And some time ago, I'll attribute this to myself because this just happened in my head. I was thinking that data science really is just science. So kind of think back to fifth grade science. You've got your little notebook, and you're taking notes meticulously every step of the way. You're trying to reproduce an outcome and build a set of data body to come to some conclusion. And that was taught as the scientific method. And so like I said, whenever I'm trying to think about a data problem, I think about what the prior research is that I can build on. I take those steps, the notes every step of the way, and have my notebook so I can go back and look at it later. And then I look for repeatable verifiable results. And I was thinking when I was writing this talk, what the actual names for these things were that I do in my process. So you have a question that you ask. Every time you have a data problem to solve, it starts with a question. What are you asking your data? What are you trying to do? How many or what percentage of your users or a cohort of users do this? Or do this than that? Or some sort of frame for the collection and the analysis of what you're trying to do. So something that's neat about using a data driven approach is that often these questions and the process to get to the answers will just lead to more questions. And a lot of things are like that in life. But you can, whenever you get to the answer of all those questions that I just mentioned, like how many people had something in their cart and then left your site. And you can say, OK, well, how does that product compare to other products? And how should I be targeting this? And if I was targeting it like this, how is it different? So a lot of times these questions come up and you just, it's job security, right? Is that it on that one? Yeah. So after the question, it's often helpful to come up with a hypothesis. So what do you think the data are going to tell you? What do you think the outcome is going to be? Why do you think that? Really think hard about why you think that. And then challenge every single assumption that you have. So if there's something that you're basing your conclusion on, you really want to have your hypothesis kind of lower. And if you're not absolutely sure that this previous work that you're building on is true for every case, run a test on it. Challenge everything. After your hypothesis, of course, is your procedure. And this is where your actual process comes in. You're formalizing a plan to attempt to reject your hypothesis. And that might seem a little counter-intuitive because you're really trying to come to a conclusion. And your hypothesis is what you think it's going to be. But in science, it's really easy to fall into what's called a confirmation bias, where you think the outcome's going to be this. You see a model that supports it. And it's all over the news that that sort of thing happens all the time. But so you try to disprove your hypothesis. And if you can ever disprove it, the hypothesis is obviously not right. Form a new one, move on, iterate. Again, I'm a programmer. So usually the process part, the procedure, comes down to what do the data look like that I am trying to get out of this function? And what do the data look like that are going into this function? Putting a shape to it, putting a size to it, and working out in an actual process. Not really in code, but in a step-by-step algorithm how I'm going to look at it. And this doesn't have to be in code. We'll talk about this in a little bit later. And there's the data. I mean, this is almost like anti-climactic after you have a good plan. You just execute it, right? You step-by-step. And the data comes rolling in. And that's all there is to it. You don't massage it. You don't clean it up at this point. You just collect it. You don't try to make it fit into anything. After that's done, you have a good set of data. You try to form conclusions. And it's almost an art form. It's hard to say step-by-step, here's how you form a conclusion. But that's why you do all this other work. You have the prior research. You challenge your assumptions. And then at some point you can say, OK, the data support this. Or the data do not support this. So we should or shouldn't do this. It's important that you have any statistical significance. And that's a little bit beyond the scope of this talk. But you should really look into your sample size, how many points you have. And there's a slew of sites to do some test, some significance testing on your actual test to see if your conclusion is statistically valid. There's one called isvalid.org that a couple of guys I know and worked on. You can kind of put in some variables and it will tell you with a certain level of confidence whether your conclusion is supported by the data. And in a conclusion, it's very helpful to come up with something actionable. So usually when you're doing applied science, this is not really a pure science. But when you're doing applied science, it's because you have some business need or you have someone's asking you for this data or there's something waiting on it or you want to know what to do next. So whenever you're coming up with a conclusion, either try to come up with that next test to run. Or this performs better so we should move to it or that sort of thing. So always have some sort of action in your conclusion step. And then just like anything else in the kind of software that I work in, not all software is like this, but it's very iterative. So you want to go through these processes, learn, plan, test, measure, and do it again and again and again. And it can lead to some pretty deep understanding when you get five or six iterations deep into looking at a particular process or how someone's using your site. So at this point, you can ask yourself some questions. Does any of the process need improving or tuning or was it too slow? Did it take two weeks to get this data? Can we speed this up? Can anything be further automated? What future testing can be done? Is there any work that needs to follow, like I was just saying, is there anything actionable that needs to come out? Are there any more metrics that need to be tracked so that I can come up with a more clear conclusion or was it indecisive? Do I need to start over from scratch? And with great power comes great responsibility. So we've just kind of gone through a process of talking about how we're going to run experiments on real life people sitting on the other side of the browser, okay? So in an academic setting, anytime that you do an experiment on a live subject, there are formalized processes for making sure that ethical guidelines are followed, that animals are treated humanely, that humans aren't being subjected to some ridiculous stressors or something like that in psychological experiments. In data collection and on the web, there's not really like a hard and fast rule that you have to do this. So every time that you run something like this, deeply consider what it's going to do to them on the other side. Are you crippling your website intentionally to see if a feature was bad or something like that if it gets used? What if someone was using that feature? Facebook was just in the news, I don't know, eight months ago or so about that they were intentionally trying to depress their users. Now, I don't know if I buy that that they were intentionally trying to depress their users, but I do think that they were using sentiment analysis and trying to discern the actual mindset of the users while they're using their site because it's very useful information and they can kind of drive the kinds of posts that they see if they can figure out the kind of frame of mind that you're in when you log on to Facebook that day. So it's just all of these things that you need to keep in your head whenever you're creating an experiment that involves real people. And it's not just ethics, there's also legality at play. For example, if your website is viewed in the European Union, there's specific rules on what you can keep long term in your logs about what people do, the data that you can collect and keep. You can't keep personally identifiable information. And we'll talk a little bit about that a little bit later, but it's just something that you have to keep in mind. If you are collecting data, most sites will have a policy on their site that's clearly marked on the kind of data that you're collecting, what you're using it for, what third parties are involved. And all of that's really important. You don't want to get sued. You need to think about how you're storing it, how long you'll retain it. If somebody comes in and buys your web dev shop, do they have the rights to those data? Do they get lost? Are they supposed to be shredded? How does that all work? It's just having a plan for this sort of thing is important. And I already mentioned this, but how could the actionable insights, your conclusions provide, affect your users past, present, and future? So it's kind of interesting to think about that there is someone who has been using your site that you're going to change something on them if you apply some sort of testing to them. And just keeping that in mind is kind of important. Kind of back to what's actually going on whenever you're tracking data. Most often you're doing what's called event tracking. And at some discreet point in time and space if you want to get technical, someone is clicking something on a computer or on their phone and interacting with your site and it is an action. And you're recording that and there's a really rich set of information that's involved with every single action, every event in the universe. And so event tracking is kind of the practice of taking that information and distilling it down and recording it as efficiently as possible. There are a lot of off the shelf tools and services to kind of kickstart your data collection. These are all that a lot of people need sometimes. If they don't need to do any custom really fine-grained analytics, you can install KISS metrics and get pretty detailed information on what all of your users are doing on your site. Same with Google Analytics, Mixpanel on mobile. And Jetpack is at the top of the list. It's not really in the same category as a lot of these other things, but it does offer some kind of high level analytics for who's using your sites, your traffic, how many unique users, where they're from, which countries and that sort of thing. And the APIs are all open so you can feel free to download those data and work on them any way you'd like. Jetpack's open source. If you can see a way to fix it, let us know, we will fix it. Submit a pull request. KISS metrics is the off the shelf offering that I've had the most experience with. We've used it to track individual usages of features. If someone did this, then did that, that sort of thing to be able to offer reports. We'll talk a little bit about that later. I know I keep saying that. Let me stop. Does anybody have any questions before we move on? Yeah, so for Google Analytics, I mean, you'd still have to think about it for your regular server logs if you maintain those. If you don't maintain those, then you don't need to think about it. But for Google Analytics, for KISS metrics, all of those things are really obscured from you. The way that it works is a web browser will come up onto a web property that's running Google Analytics and it runs some sort of process that assigns them some sort of unique ID for the browser, for their session, and it tries to person as much as possible. Like, that gets a little creepy that they're trying to tie this ID to a person, but that's how it works. So at that point, you don't need to think about it. It's all anonymized for you. You get aggregate stuff and you get kind of demographics for free out of that, but you don't get to necessarily say that this was this person on this computer that did this thing and tied to them. They kind of hide that from you on purpose. And most of the time, there's no reason to really need to know that sort of thing unless you're doing logged in user tracking activities. Good question. So these services, they all offer a lot of value and we'll talk a little bit more about Google Analytics in a bit. And each of these would really merit its own WordCamp talk, so I'm really gonna talk more a little bit about kind of how they work behind the scenes at this point. And it's pixel logs. Wait, I mean pixel logs. Pixel logs are the ultimate source of truth for the bundle of events that they contain. Okay, so we talked about facts at the very beginning. A fact, if you look at this, does this have a pointer? No. Top one? Yes. You can't see it anyway. There's an IP address at the left. I've just got one, two, three, four and some other information. There's a timestamp, a path, and then it's query arguments that really get to the meat of what the actual event is. And this is just kind of an example of what a pixel log would look like. Does anybody not know what pixel tracking is on the web? Okay, somebody doesn't know, so a pixel is a one by one image. It can be a zero by zero image, but traditionally it's one by one. It's the little smiley face at the bottom of WordPress if you've ever seen like a little teeny tiny picture of a smiley face. That's what that was originally intended for. And it's put into the webpage so that when a browser loads it, it pulls this is a GIF image from the server. And just like any other image, it's going to pull down the image and then leave a trace in the log. And the intent is to tack on all of this other metadata with that request so that in the log it looks like this. It's got an event name and then a segment. And so that can be like user commented on a blog post, but they did it on a Saturday or something like that. Or they did it from a mobile app or they did it from a country that you're interested in or something like that. So it's basically as deeply as you wanna track, you can segment all the way down. So this is just one line that's kind of wrapped along here and pixel logs are usually like lines after lines after lines at these things. And on wordpress.com and with Jetpack stats, we just get like pages and pages and pages of these, I think they're two and a half megabyte files that they're measured in the numbers of those files per second that we process. So it's a lot of data. So if you think about this line of information, as a fact, as an event, it's a record of some real interaction at an exact point in your user's story. Like a user's coming onto your website, they're doing a thing and you think it's really important to record it and you did so. So in the record, it's got an identifier. In this case, it's the IP address and the user agent kind of act as an identifier. That's the who, the who did this thing. It has an event name. In this case, I called it fictional event name and this segment, those are event names. That's the what, what happened, what they did. It has a timestamp, which is the when. It has, it often has a where. So sometimes it'll be like a country code or even like a locality code or even if you wanna get really precise, you can have GPS coordinates in your log and those are pretty easy to do with free tools. But in this case, we've got an IP address. If this were a real IP address, it would be pretty resolvable to a series of a set of GPS coordinates. So there's your location. So you've got who, what, when and where. What does that leave? Why? Why and also how. We'll get back to that later. I keep seeing that. Good thing I have a lot of time, right? So this is a front-end talk but it's important to know and it's useful to consider that there are front and back-end tracking kind of mechanisms at play. Usually what you think about with Google Analytics or Kissmetrics is front-end tracking. You have the gift that's getting loaded from the browser that never touches PHP, right? That's just your browser requesting an image. Then you have Google Analytics has a library that you load and you just feed events into it in JavaScript and that gets queued and sent off to the server and that's all in the front-end. So doing front-end tracking in WordPress is kind of easy if you think about it. It's you enqueue your JavaScript library whether it's Google Analytics or whatever and you get some stuff for free when you do that. Like I was saying before you get demographics, you get raw usage numbers and they'll even come in and they know a little bit about WordPress so they can kind of see where some conversion stuff happens and you can get some reports for free. But for anything custom, you have to define the events themselves like if someone's gonna click on step three of your checkout wizard and you wanna know that, you have to tell Google Analytics that that's what you're interested in and there are GUI tools to do that where you can go in and say if this ID button gets clicked then record an event or something like that. Like I said, I'm kind of a programmer, definitely a programmer so I shy away from that sort of thing. I like my stuff to be in code, inversion control and all of that so anyhow. And there's another step if you're not defining it in the GUI and the web tool to go in and actually set an event handler on the button click in JavaScript and then use that to enqueue and send the events to the server. So that's a lot of steps. It's like you have to know PHP and JavaScript a little bit to do it and it's a little fragile if you think about it. So to illustrate what I mean, think about when someone comments on a blog post. When are you going to fire the event, register that the event happened as a factual thing that's true forever and ever? Are you gonna do it whenever they click the button to post? Are you gonna do it when your browser gets the response back from the server that the comment was successful? Are you gonna do it when Akismic clears it for spam? I mean, there's a lot of logic that you have to think about. That's kind of like an elementary example but anything that's more complex than that, there's a lot of logic that you have to think about on when you're gonna fire this event. So to the rescue is back in tracking. And this is PHP code and this is just mainly to illustrate. You don't need to copy it down or anything. It's mainly to illustrate that WordPress has a series of actions, a bunch of handy dandy actions that correspond to when someone actually does something on your site. So in the case of posting a comment, you have this comment post, an edit comment action that get handled for you automatically by the WordPress software that all you have to do is hook a function into it and do your recording. You don't have to do any JavaScript, nothing happens on the front end. You don't have to worry about the logic of, well, you know, what if they clicked post and it didn't go or something like, or they closed the browser before it happened or something like that. All of those things are considerations that you'd have to take if you were doing it on the front end and you don't have to kind of get that. All the logic is handled for you, especially by comment post because comment post, if you look at the code, it actually happens after it does spam checking. So you can flag it a spam right from there. You don't have to wait, check back or anything like that. It happens right in the action. Um, so if you're new to WordPress actions, write down this Google search. It's WordPress action reference. It goes to a Codex page. It's a Codex page. It's one of my favorite Codex pages because it kind of lists them all out in the order of the typical requests. And actions are something that's really cool about WordPress. And if you don't know them, pick them up if you're into coding at all. All right, so the board is set. The pieces are moving. We have the who, what, when, where, but we need the how and the why. And if you consider the types of events that you track, you can build a rich understanding of what's going on on your site just based on what's important to your business, what's important to what you're tracking right now, what's important to the tests that you're running, to new products that you're bringing on, new processes you're trying, new authors, all of this stuff, you can build some really cool stuff just based on what you decide to track. So think about what is really important to your business right now, and think about that iterative process. What are the data that you can collect now that you can analyze tomorrow without collecting everything? You don't wanna collect everything, but what's important? What's really important to have on hand? So getting to the how and the why. Let's get to the how. And this is really just kind of an overview. You don't, again, have to copy down anything that's going on here. This is really just showing in the theme of connection how you have server logs on the left from various servers across various data centers, and they go through some sort of process. They get extracted, transformed, and loaded into some other data store so you can query them. So querying a log file is possible and there are tools, there have been tools around to do that sort of thing, but it's a lot of grunt work and a lot of kind of esoteric knowledge, low-level knowledge at this point. So this isn't the only architecture, there are a number of variations on this architecture for how data flows through a system. This is really just kind of designed to show you how it does, hence the moniker, the data pipeline. So it goes in one side, kind of comes out the other in a different format. And so the actions that a user takes, this is the process that you attribute, you can go in and attribute that log file to their user. In WordPress we have a concept of the user ID and so this is where if you had in your log file some way to attribute that action to the user like their user ID or their login name, that's this is where you would do it. And be able to query that Jeff Bowen did X, Y, and Z on your website. And down to the elusive Y. Y is kind of where the money is. So it's, again, it takes some intuition to get to being able to say that this is Y, the data look like they do. Y, the year usage patterns are like they are. Y, things slack off in December and come back in January. So none of those things are really clear cut, but there are some tools to help you along your way. Here's a really awesome formula. You should definitely write this down. It's gonna be a quiz. This is for the R value. And statistics in R value is what you use for a correlation. It describes a relationship between two data sets and let's see. It's a description of how dependent two variables are. In other words, it's as the magnitude of value X changes, is there also a measurable change in series Y? There. Blow that away. This is a much easier way to look and reason about correlation. These graphs are what you call a scatter plot graph. And these are describing, on the left, you have a perfect positive correlation and this is a linear correlation that means it just goes in a line. And you can see that the dots are all kind of close to a line through your data. They kind of form a pattern. Linear is not the only pattern that you're looking for. There's also sinusoidal, I think I said that right, patterns through your data that look like an S depending on what you're looking for. So you can see that on the left, those have a perfect correlation. Everything's on a line. You can definitely say that these data sets are correlated with a high factor. And then they get looser and looser related as you go towards the right. The dots get further apart. And if you look at the bottom, you've got 1.8 and 3 or 0.3. And as this tends towards zero, the dots just fill the space through everywhere. You can't discern any sort of pattern. I think it's interesting to look at scatter plots like this because humans are really kind of ingrained to find patterns in data. You can look at the one on the left and intuitively reason that those are on a line without having a line going through it. But it's important to know that correlation does not imply causation. So this is from Spurious Correlations website. There's a bunch of these. My favorite's are the ones with Nicholas Cage. Per capita consumption of mozzarella cheese correlates pretty strongly with civil engineering doctorates awarded. So if you look, you could kind of draw not really a straight line, but they kind of form a line with one another. And it's really meaningless. There are two distinct discrete data sets that have nothing to do with one another. There's some other process at play here and it's your job to, when you're looking at data sets, to not get in the practice of saying, unless you're very sure that this thing causes this other thing, or that I know that this is related to this other thing. It's traditionally very, very difficult to prove causality with data. There's some recent work that's called the additional noise model. And it basically says that you have two data sets and there's a concept of statistical noise. And statistical noise is like an outlier, something that doesn't fit the model. So this theory, this work that people are writing about now is that if there is noise on one side that you manipulate the noise, then you can see an artifact of that on the dependent variable. And what that means is that if you fiddle with things here and you change things here, there's definitely a causal relationship that direction. But vice versa, if you fiddle with the statistical noise on this side and you don't see something on the other side, you can definitely tell that the arrow goes that way. So that's a very cutting edge and it's not really proven yet. So unless you are like a deep, deep statistician, you wanna really shy away from like saying that something is the cause of something or that you absolutely know that this is why this is happening. You can just say that the data supported and these are the actions I think we should take. These are the steps that I'm going to take to validate that that step was the right step to take. Any questions on correlation before I move on to trends? Okay, trends are, they're another way to look at your data and like I was just saying, the trends are a way to kind of validate some of your conclusions. If things continue on a trend or you change a trend, you can use that as supporting evidence that the conclusion that the causality, the relationship you're trying to prove is more likely to be correct. So if you look at this data set, this is a feature launch on WordPress.com and probably can't see it all the way at the back, but it kind of starts out, little ticks here. That's internal testing and then it ramped up to a portion of views for a beta test and you can kind of see it goes up and then it opened up to everybody. And these are what you call a time series data. If you look at the bottom access, access, everything is a date. So this is unique users of this feature for each particular day in this time series and they're plotted on an XY graph. The green line and dots represent the measurements for the given day. The blue dashed line that you can see kind of running underneath is that's what you call a trend line and this one is a trend estimation that's run off a process called a seven day moving average and it's kind of rudimentary for trends, but you can see it kind of pegs the tendency of this data. So these ups and downs that you see, the kind of oscillations, they correspond to weekends. You can see they dip down on Saturday and Sunday and interestingly it's a little bit more on Sunday and then up on Monday and then back down, but the trend line smooths that out. You can get a kind of a better idea of what your seeing average is a funky word to use with statistics, but you can kind of tell what the expected usage is over a period of time and it even lets you form a little bit of a predictive model about if you look at the right side, you can kind of guess what's gonna happen if nothing changes, that you can kind of guess that this is gonna look kind of like that, it's probably eventually gonna level out. All of those things are guesswork, but it's a rudimentary prediction predictive model. And there's all sorts of trends, you can get really, really deep into the statistics of trends and like I said, this is a seven day model if you were to change it to a three day model, you would start to see the dips in the average. It's a little bit straighter because it's spread out over seven days and the way that works is you take the value for here and add it, add it, add it, add it and then average. You get the arithmetic mean and then you do that for the next day and do the same thing over and over and over and the last one falls out, so it smooths it over time. Yes, exactly. So when I was playing, the question was I used the seven day model because I noticed the drops in the weekly pattern. Exactly, so I kind of, whenever I was building this, I played with the number of days and I wanted it to get straight while still kind of fitting a realistic view into the data. If I was to spread this out over a month, it wouldn't really mean anything, right? It'd just be a straight line that kind of went like that. This kind of gives us an idea of in the past week what has our data tended to do. Does that make sense? There's a pretty common metric that people use in analytics, it's called 2D7. It's how many times in the past seven days has a user logged into your service and that's a pretty good engagement metric. This is a very simple funnel and if you haven't heard the word funnel before, you can kind of think about an actual funnel that you use to pour liquid into a jar or something like that. It's big on one side and small on the other and it's kind of a little different than that and like a traditional funnel you hold up and everything eventually comes out. The idea of a funnel in analytics is that some things come in one side and then only a smaller amount comes out the other side. So this is a series of events. So this is modeling, this is visualizing the percentage of a population that did one thing and then did the next thing. So visited your store page, put something in your cart and then the next step in the funnel could be proceeded to check out, so on and so on, that they actually converted into a purchase. So you would have a funnel that modeled that and if you wanted to get really deep, you could put funnels on individual tests, like how many people who are in the population that are in your test population visited the test version of the site and visited the test version of the store and so on and so forth. So this is really interesting to keep track of performance over time of your various features. I use store a lot because I think it might be relevant to more people here but it's really useful for any sort of feature tracking like as we're showing a new feature that we launched on WordPress.com. So if you wanted to see who was using variation B over variation A or something like that, which one was more effective? And if you find something that is letting more people through your funnel, then that's probably the direction you wanna go. But you wanna be very limited in your scope, you don't wanna have a funnel for your whole entire site for one particular change. You wanna target it to the test at hand. So if you're changing the string on a button or you're changing the layout in a very particular way and you're testing that very particular way, you're not just launching a new site, you wanna target it to that particular item that you're testing and be able to say with confidence that this version, I'll perform that other version. Does anybody have any questions about funnels in general or, and one of the last kind of statistic seed tools that I'll talk about is cohort analysis and what a cohort is is just a group of some users of your population. This is kind of hard to see on here, but the lines, the key says that each of these colors responds to month of retention, month zero retention, month one retention, month two retention on up. And so each of these groups of users is starting at the same time, but they're in a different group. Yeah, I don't think that's right on the bottom. So they have actual dates at the bottom, but you would say first month, second month, third month based on the date that you started. So I would say the group, that group started in this week and this group started in that week and that group started in the other week and how did they last over time? So it really lets you kind of focus your tests on what a user is seeing on your site at a particular time because you don't really wanna test a feature against users that are coming on when your layout is significantly different than it was two weeks ago. Like the test of this button text isn't gonna be valid if the button's also over here and hidden by a couple layers of UI. So you really wanna make sure that your cohorts are seeing as similar of an experience as possible. And that'll help you really focus in on changes at a given period of time. And sampling, this isn't really that big of a concept. It's pretty easy to think about if you have a bunch of data coming in and it's impractical to basically record all of it, you need to come up with some sort of plan for recording the portion of it that you need. You wanna be fair, you wanna be random, as random as a computer will let you be. But still kind of be able to come to some conclusions. So I mentioned this statistical significance earlier. You wanna make sure that your population size is big enough that you're gonna be able to make some claims from the results. And you kinda wanna make sure that you're sampling the same way over and over and over. So going back to fifth grade science class, I remember that science experiments, you're supposed to do things three times. It's not really a hard and fast rule in science, but you are supposed to make sure that your results are reproducible. And part of that is having a sampling procedure that is also reproducible and reproducibly random each time. So something I like to do, again going back to WordPress user IDs, you can do what's called a modulus and just get the user IDs that end in zero or something like that. Or the user IDs that end between zero and five would be half of the users. If you just get the user IDs that end in zero, that's gonna be 10% of your users. So you don't really have to do any extreme math to figure out, you can just say, does a user ID end in zero? Okay, they're a member of this sample. And kind of going back, this is one of my last slides. This is kind of a tangent on the ethics topic. So whenever you're listening to your users and you're trying to discern what they're telling you, you're doing an experiment and you're often doing it to try to drive engagement, to try to push one behavior over another, right? And again, with the ethical questions is you have to think about what's it gonna take to push the user from the left side where the triggers fail to the upper right side where the triggers succeed. This is a behavior model that's, BJ Fogg is a college professor, I think it's Stanford who has this behavior model. It's behaviormodel.org. And what it basically says is that the probability of any action happening, any behavior happening on your site is the product of MAT, B equals MAT. So M is the motivation. What's kind of getting them to click that button? What's, why would they do it? Are they looking to buy that product on your site? Are they looking to create an account? Are they just responding to a comment? Are they clicking like? What's their motivation? A is their ability. How easy is it for them to do this thing? Does it cost them money? Does it cost them effort, cognitive or otherwise to do this thing if the button's right in front of them and they click it, that's really easy. They have the ability to do that. Or just typing a key on the keyboard is pretty easy, but if you hide it five layers deep in your UI, that lowers their ability to do it. And then triggers. Trigger is just at any point in time you are kind of putting this in front of them. So you have to have all of those three things for the behavior to occur. You have to have the user's motivation. They have to be able to do it and then you have to prompt them to do it. The button has to be there. And this is a graph of what it looks like on the left side. The action doesn't happen. On the right side it tends to. So it's just kind of interesting to think about pushing your users over this line. And don't be evil. That's a pretty excellent slogan, right? And I'll just end with a quote here that I found on Airbnb's blog. A good data scientist is therefore able to get in the mind of people who use our product and understand their needs. And I think that all of the statistics aside, like I said, humans are really good at looking at patterns. And put the stuff, make a visualization, look at it, see what makes sense, and just keep experimenting. Thank you. Yes. Are you doing qualitative research? Excellent question. I was hoping somebody would ask that. So qualitative research is really, really helpful. Surveys are awesome. Like if you're creating a new feature and you want to get some user experience, you want to get some user feedback. Surveys are great. User testing videos are super awesome. There's sites where you can just send your alpha product and you just watch somebody react to it. That stuff is super invaluable. And I think that those things, qualitative information is really good at helping you kind of build on a process, right? It's useful in validating some assumptions that you think users are going to like that. You think this interaction is going to be intuitive, but maybe people have a hard time because the dropdown's in the way of your text or something like that. And you just never thought about that. You never looked at the screen that they're using, the screen size. Qualitative research is great. Then answer your question. Cool. Thanks everybody.