 Hi everybody. This is Fabrizio Salzano, presenting for Product School. Let me start introducing myself quickly. I'm clearly an Italian professional from Accent, a computer scientist by background, and a passionate e-commerce in a bit testing. I also became recently proud that he too. Let me skip quickly to what is my experience. I'm a travel professional for more than 16 years, currently working at boogie.com. Prior to that, I've been part of the Expedia company. I run more than 500 bit tests, and that's why today we're going to talk about why you can always a bit test. Or as I say, in the line, a quick guide to a bit testing and more. Now let me start saying that this will be a very practical presentation for those looking for an intro to a bit testing, and it's possible alternatives if a bit testing is really not on the table. Now, everything starts with a definition or common sense based, as I say, definition of a bit testing. A bit testing is, to the core, a quick way to check in with your users to see what they really think about a feature or a change that you come up with. Why a bit testing? There are at least three reasons I can think of. The first one is that nobody can predict the future and not even your boss. The second is that is a good way to democratize learning and decisions. What I mean here is that giving exposure of how users react to your change enables everyone, regardless of their pay grade, of making a decision about the data. And the third is also enables you to have a progressive learning. Given the nature of a bit testing and the small steps approach. Let me tell you what are the building blocks of a bit testing. And so there are at least five main building blocks. There is an hypothesis, KPIs, type of test, the setup and the runtime or sample size. Let's go through them one by one. Starting from the hypothesis. The hypothesis is composed of four elements. What do you want to change? You can say those are the features that you are testing. Who is exposed to the change, your users, or user segment more specifically? What is the effect on the user that you expect or behavioral change? Last, not least, how do you measure the impact of the change? Typically KPIs. Let me give you an example. So let's say that you are adding a shopping cart on every page. Your user is all the returning customers. And the behavioral change will be for them to find with more ease the visitor than bookmarked properties. KPIs that probably you want to test is an increase in bookers, multi destination bookings, the cart interactions. This is what actually hypothesis look like one way. Second is KPIs. You can characterize KPIs in three buckets. Primary is actually the one that you will help you in making a decision. Second is supporting. Usually those are behavior KPIs, how the users are interacting with your product. And last, not least, health metrics. Usually those are the metrics related to the system. Now here, the first step. It is important to define, in my opinion, for every experiment that you run, all of the above. So primary, secondary, and health metrics. And try to agree and keep that agreement and this consistency within your organization. So do not keep changing your KPIs every time unless it's strictly necessary. This will make your strategy much, much stronger. Let me give you an example. Again, the case of the cart. The primary, again, business metric will be net bookers. The secondary or supporting is the amount of users accessing the cart. So that's a behavioral metric. And the health metrics in this case, let's say those are system metrics, what is the uptime or the page load. Let's go on the third block, which is the topology of tests. This is a bit of a technical topic, but what is really important for you to know is there are three types of tests. A superiority test, which means that you expect your change to be better than what currently your customers are exposed to. A non-inferiority test, which means that your change will not be worse than what you're offering to the customers. And sometimes you also define by how much will not be worse. So what is the acceptable cost? And for your knowledge, it's also the third typology is equivalence test. An equivalence test is when it says it's not better, but it's not worse. It's really not as frequent to run this type of test. Now, the tip of this section is the following. Do choose a non-inferiority test for those changes need to happen. So think in terms of strategic value changes. In technology, there is a lot of re-platforming. So if the platform needs to happen, there is no point in hoping for a superiority test. Any non-inferiority test will do it. Otherwise, always go for a superiority test. The reason being that I don't imagine why you shouldn't really offer a better experience, a better product every time, over time to your users. Number four, test setup. Here's a bit of the bike of the test. Here you can define how many variations of your change you want to offer to the customers. Which court of users you're tracking, et cetera, et cetera. A couple of tips here. First of all, be aware of false positives. Now, a false positive means that you see an impact where actually there is none. So your metrics turn out to be conclusive in one direction. In the positive direction, I thought it's a false. It's not true. Now, this is much more common if you keep running consecutive tests, keeping base, not changing base and keeping testing small variation of whatever is your feature. So it's much better to pack all the variations in one experiment and run it if you can if you have the power in as a one single experiment. If you don't have the power to do this kind of experiments with many variations, then choose one variation stick to it. The second thing to design the test for decision making. And not only based on what you can do. So basically, what this means is that based on the data that your product test will produce, you should be able to really know what to do next. Now, last, not least is runtime and simple size. It's important to really calculate how many users will need to be exposed to your change in order to detect a significant impact on your main KPIs. The tip here is to always, always, always calculate the minimum sample size. It is not true that more data is always the better. And there is plenty of online calculators to do that. Now, there are some instance where actually you cannot be test. You can think of you are introducing too much friction to your users. Or let me put this differently. There are some instances where you should not be test. So first case is you are introducing too much friction to your users. Think of those B2B users. In our case, we, we know it works in the accommodation business. I season, our partners usually in summertime are very busy with the operational efforts, you could say, destructing them or disrupting them with a bit testing, usually not a good, a good idea. Another reason not to be test is when to test for the sake of testing. So you don't have a clear hypothesis or no idea on how to use the test results back to designing tests for decision making. And the last is for moral reasons, right? You should not be test testing a dark pattern. Think of the situation of hiding valuable information from your users. But I would say you should never apply this change in general, not only by a bit testing. Now, what if you actually you cannot be tested instead. And I can think of a few reasons. The most obvious. You lack the testing platform or the test itself is too expensive. Sometimes testing some features can be really expensive for business so they cannot afford it. The second most common is really when you don't have enough users as small sample size. The implication of having a small sample size means that your test might take too long months and maybe to reach conclusions maybe never. The third, you don't have the product yet. So you have nothing really to offer. And last, but it is very like uncommon is actually that the change you have in mind is really not good fit for the testing methodology. Now, what are you going to do if you fall in any of these buckets or other reasons that push you to not really use a bit testing. Let me introduce you to the pyramid of evidence. What the pyramid of evidence refers to is the fact that the base on the type of methodology that you choose to produce an evidence, you can actually stack them up. And each layer will be qualitatively better than the layer before. So, for instance, we know that randomized test or a bit testing is usually superior to an expert opinion, but this is kind of intuitive. Now, another takeaway from the pyramid of evidence is that not only we know that there is an increasing quality of each methodology. But there also there is usually in my experience any case in cost. Now, if you work in the tech industry, now, gathering an expert opinion, it costs usually nothing. Running a randomized test usually cost money, or at least the investment in the bit testing infrastructure, you know, the time of the developer. Let me then walk you through each level to give you some alternatives in case you don't have a bit testing. Let's not discuss meta reviews because meta reviews is actually reviewing a set of randomized tests. Very few companies can afford to do that. This means having a lot of a bit testing of the same typology around the same topic. So we will not I will not comment about that. So let's start from randomized tests. So I be testing what we talked about it. These were actually booking.com sits and is actually champion of a few things to keep in mind while I think about the bit testing. And also you would say best practices. It is good idea to always test a single change at the time. The reason for that is because if you test multiple things at the time you don't really know what is the contribution of each change to the KPS that you're measuring. The second is again, good practice to focus on a fast delivery means like roll out the test quickly and make the test again small so that actually you can learn fast and then reiterate on the learnings from the test. And this is actually also introduced the third tip iterate continuously keep iterating to keep improving your products, but beware of false positive. We mentioned what false positive is, and they're way more commonly you think. Now, slightly less qualitatively than the randomized test, there is a court research. Now, what is a court research. A court research means that we are investigating how different courts you can think of different groups of users multiple groups of users with similar attributes. behave with regard to a specific feature. You can think of those an example of two courts like maybe. Users that that like oranges and users that like so you isolate those two courts and usually you analyze and how they behave with regard to buying fruits as a feature. Now this is a the word research usually implies that is an observational study you don't really, you don't really experiment on the user just observing how they behave. And typically comes without randomization so you're not really randomizing how the court is composed, you're actually following a specific court of users that has a specific attributes. So it's actually quite biased. So, again, there is no intervention on experimentation just observation, no randomization. So you can identify what is called correlation between the user and the behavior, but not really causation. Now what this means in practice how you can call run a court research usually in an e-commerce business by running user search interviews. So you identify your bucket of users and then you ask them questions about specific feature or behavior that they are showing. Now one step below that there is the case research. Now what the case research is usually, again, word research so it's an observational study is more when you can think of a user individual or group individuals that submit a specific report. It's also called case reports. Now, again, a little bit less qualitatively less than court research. First of all, in volumes. It's not group users but single users. Same observation, no randomization attributes. An example of a case research again in e-commerce or many business today is comes from customer service tickets or guest reviews. So by analyzing them, you actually are researching through those cases to mine some learnings. One step below and this is I think is as obvious as it gets is the expert opinion. Now expert opinion on dismiss them too quickly. And of course you have to keep in mind that can be heavily biased because it's we came from randomized test to group of users. Court research to single reports which is the case research to a single user really asking an expert opinion. It is a fire starter not the end so it is good to trigger questions but won't say the end destination. An example again, spelling out the obvious. You could interview or engage with the business craft expert that is promoting an idea. I do think that they can they have their place and their time if you find the right individuals. Now, you could argue that what the pyramid of evidence is saying is that the higher you climb, the better is the view is that the case. So you everybody should do matter of use, if you want, or randomized test. Why would you ever go to a court case or expert level. Now, there are at least a couple of reasons right that we also mentioned them during this is this talk. The first one is that a bit as a bit testing can be quite expensive. Again, as we said, the platform itself or the test. The second reason is that a bit testing is not always advisable not technically possible. And this is back to where we're talking about one and not to a bit as a word where you are not able to be tested. Now, what is the pyramid of evidence really telling us is telling us that every approach has downsides and limitations. Now how you interact with this downsides and limitations up to you. What this means is that you can choose the right approach based on your context. And so what is the level of tech and quality level that you aim for you can afford. The second thing that is less obvious is that ideally you should walk the entirety of the pyramid of evidence bottom to top. Think of combining every layer again bottom up to build knowledge. So you can think of progressively invest in higher level of quality or you know technology investment, starting from an expert opinion. Then this could give you a lead into maybe analyzing some starting the case research. In this case research potentially could lead you to identifying cohorts of users that expose the behavior that you're looking for the insights from court research could lead you to design and a bit test that allow you to give the ultimate form of validation. Now, one last remark before, you know, closing this talk, and you might have heard this, you know, left and right. So the two options doesn't imply causation with correlation mean the two events happen at the same time. A causation means that one event is causing the other. And now I love always use this picture right we have this cat sitting on this roof bent exactly in the spot where the cat is clearly the cat didn't produce the band in the pillar. At the point there is correlation, the cat and the band are, you know, showing at the same time, but there is no causation so the cat didn't cause the the band. Why I'm saying that is because in the alternatives to be testing that I talk to you about. There's a correlation so there is an indication of two events happening at the same time, but there is no causation is not saying that one event is cause is causing the other. So be mindful in every non randomized methodology that you apply to not really make the mistake of confusing correlation with causation. And that's all folks. Thank you very much for your time.