 Hello everyone and welcome to another episode of Code Emporium and we're making now another video on Bayesian testing. So I've kind of illustrated the points of Bayesian testing in my last video where we talked a lot about the comparison between hypothesis testing and Bayesian testing and why Bayesian testing would potentially be, you know, more interpretable, easier to read and also how Bayesian testing the results more represent probabilities as we understand them. Now I'm going to be repeating certain concepts that I mentioned in that video just so that it flows into the current topic of the video which is Bayesian testing for continuous variables. So let's just kind of get started with everything. Alright so the first is well we define our experiment we're using a currently a data set so I wanted to make sure that there's at least a story around this because there's always context to data. So let's say that now you are running an e-commerce company and you want to test some changes that you made to a page. It could be an algorithmic change just a small button change or some physical changes onto the page. The control group they're half your users who get the old page that is the current version and then you have the treatment group who are the new users that get the new page with the new changes and let's say for now that the primary metric that you want to track and decide whether or not you are going to be accepting or rejecting these changes is going to be purchase conversion. Now purchase conversion is a binary well purchase conversion is well technically a binary metric of like you can either convert or not convert but when you're taking an aggregation it can range from a value like a bunch of users together purchase conversion ranges from like zero to one. So it would be the number of converted users to the number of exposed users kind of like what we're seeing right here. So converted exposed users are users who have seen the page during the experiment time hence they are exposed to the page and of those users how many of them actually converted will give us like a percentage or a number between zero and one and that will be purchase conversion and we're using that metric to decide whether or not to go forward with the changes. Now the entire notebook right now here kind of just represents us using Bayesian analysis to say okay if the purchase conversion of control is this much and the test is this much there might be a difference and we can now quantify like how confident we are in that difference. In the next notebook which I'll get to soon if we add another metric that's not just a binary metric instead of purchase conversion it could be like GMV or that is like pricing information for example how would we use Bayesian testing or analysis for assessing the difference between the control and test group in that situation. So that's kind of like what we're gonna be covering in this entire video and yeah let's just let's just get to some good old fashion code. So in this second section over here we're using certain AB test data I've just imported that as a CSV and you can see a couple of columns here so a couple of samples here sorry. So every row is an exposure so we have a user ID the timestamp at which the user was shown the page. They're the treatment group if they got the new page or the control group if they got the old page and this number is either zero one zero not converted it changes to one if a user makes a purchase within like seven days of their first exposure. So pretty hopefully straightforward to understand data. Now in this data set we actually have three weeks worth of data and there's over almost 300,000 users which are split between test and control. So we have quite a quite the generous data set here. Now I've noticed that in certain situations though that certain users are double exposed so we have users who you know have both received both the new page and the old page which could have probably been some technical glitch in the systems and what I'm doing right now is since there is only like I've assessed that there's 1% of these values. Yeah so 1.34% of users have been exposed to both then I just decided to remove them. It won't affect the analysis that much and so we now have a clean data set. I've also added a week column which is basically going to simulate the idea of how an experiment is carried out like you don't make a dashboard and after the end of like four weeks and then just start looking at the dashboard then you'd be looking at it at you know different intervals during the experiment time and so like I'm gonna say like what I'm gonna be doing in the future is well all of those exposure timestamps that happen in the first week I'm only considering them and I want to see what the results looked like at that point of time. After the second week I want to see how the results looked with two weeks of data. Third week I want to see how it looked with three weeks of data and so on so we can actually see how these numbers change over time. Yep and yeah that's it so onto this section over here is basically going to start by illustrating the Frequentist approach and comparing it to the the Bayesian approach. Now with the Frequentist approach we tend to make we tend to make hypotheses we use hypotheses like okay let's say that since we're using purchase conversion which is a binary metric we would use probably something like the chi-square test where the null hypothesis is that the control and treatment are independent of each other and you know how hypotheses tests work right so at the end of the test you'll probably get a test statistic and a p-value and using the p-value like if it's 0.05 significance level if it is above or below that we will choose to either reject or not reject the null hypothesis but there is a little bit of a problem here so in this case let's say the chi-square value is 1.28 and the p-value it is 0.288 which is well above the 0.05 level. Now what it's actually saying technically is there's a 28% probability that a more extreme chi-square value than this 1.128 would have occurred by chance. Now obviously well here maybe the chi-square test is just not the best test to use here specifically because it is kind of a weak test in general here at least and even if this was a stronger test though we're seeing that this probability is in terms of the test statistic itself which is like the chi squared a statistic and we'd have to map that to purchase conversion which is really the metric we care about and so doing that mapping it kind of gets a little confusing about like okay what is the actual difference between the lift in purchase conversions between the control and test so it's a little hard to interpret right and that's kind of what I've summarized in this figure here too so in the frequency approach we have control and treatment data we have a hypothesis which is our null hypothesis we get two scalar values and from which we can make like an assumption like a statement like this but if we use the Bayesian testing approach where we have instead of a hypothesis we input a prior which is you know knowledge about about like the about purchase conversion before we have conducted the experiment and this is a distribution by the way which I'll get to the output of this experiment Bayesian experiment is not two scalar values but it's two posterior distributions and when you have two distributions you can do whatever you want we can make statements like oh we are x percent confident that the lift is y like the probability that the difference between control and treatment is like two percent is this percentage and that since it's in terms of the actual metric of interest it is so much easier to communicate results and understand what it actually means so because of this like Bayesian tests just becomes so powerful and it's very important to understand and hence the point of this video now in the Bayesian approach I kind of like illustrate exactly like each of these components and how how we actually come about you know constructing the prior and the posterior and whatnot so in this case let's just see in the Bayesian approach right now so we need to construct the prior the prior is a distribution it is a distribution of purchase conversions which is the metric we care about now note that the purchase conversion is a probability it's a value that lies between 0 and 1 and a distribution of probabilities is best modeled with something called a beta distribution now there is an entire resource on how to understand like what the beta distribution is I'll link this to below this article I'll link it below and it's a really good read just to understand what it actually represents and how you can update these values to create a posterior distribution posterior meaning data after you see the data what is the distribution you have now okay let's go back here alright so with the prior data like I said now in in our current experiment I only want to use the first weeks of data for the prior now in actual you know actual working data you already have this prior information you know just like from before the experiment time so you don't really need to do this but this is a data set and my only way to construct a prior is to use just the initial data itself so I'll consider like the first week of control data to be used to construct the prior and what we're doing is now we're sampling we're sampling like for from the prior set we are sampling like a thousand samples and then we would determine purchase conversions for each of them and we do that ten thousand times so we have like a list of this like these numbers between zero and one which is a size ten thousand and you when you have numbers like that you can now use that to fit a beta distribution to those numbers which is exactly what we do here now I put num weeks here is to which basically says hey let's just look at from week one to week two and that will technically it's only one week right beginning of week one to beginning of week two we want to see what the experiment would have said like after one week of running this test right and so I basically come up with the statistics right here which is essentially after one week of running the test we would have seen that the difference in purchase conversions is 0.009 with the control slightly higher than the test right and what I'm doing now is I'm using the priors that I constructed to create a posterior distribution a distribution constructed after seeing data so essentially I update the alpha value to include the users and the number of users who have converted and then the beta values to up by updating it by like considering the users who were exposed but did not get converted now there's a there is quite a bit of mathematics that's involved in deriving like why you have to update a prior in this way to construct a posterior but you know like I said before I think it's best to like defer to this blog post here where they kind of do the same thing where they're kind of modeling hitting rates and hitting averages or batting averages in baseball which kind of models the same thing as like a probability kind of like purchase conversion and this is the posterior here so I give it a read there's also a math at this link to derive exactly why this is the case and why this update happens here so take a look at that so now that we have our lift we don't know what this lift is and we have constructed our posterior distribution we can then sample now we have two distribution so we can do whatever we want so let's sample like a thousand from the test distribution a thousand from the control posterior distribution and start comparing them and lo and behold if we start comparing them and we can finally actually get the probability of why the treatment is greater than control so basically it's a mean of the values of where you know the treatment is greater than control would essentially reflect the probability that the treatment is greater than the control group right and we see that there is like a 32.6% chance that treatment is greater than control which is fine it kind of makes sense and yeah that's that's kind of it here now if I let's say that we wanted to run this test for another week and then see how it how these numbers shake out so let's change this 2 to 3 and essentially now from this line it'll take week 2 and week 3 as the data and okay now we now have new numbers still the control is slightly better off but when you execute this now you'll see that the probability that the treatment is going to control is even less so essentially what's happening is that the distributions are getting narrower and narrower and more precise so it's actually easier to quantify even smaller differences and this is also owing to the large amount of data that this data set has to offer I can say that the distributions are getting narrow because I can see this see you the variance here from the previous case was 2.24 times 10 to the power minus 6 but now if you execute it it's literally half of that right and even with statements like this we are able to say you know okay the probability that we're seeing a 2% lift is 6.3% which is very very low and I think you can see now how it's now easier to conclude okay the treatment is not really increasing purchase conversion that much so we shouldn't really be moving forward if that's our main interest altogether so yeah that's kind of a good summary and overview of like what we talked about in the last video too and also if purchase conversion is your primary metric and how to assess that with Bayesian testing but let's say that for for this situation we are not necessarily just interested in purchase conversion as a metric but we're also interested in you know what what what about the changes in price of purchase that would have happened so instead of comparing purchase conversions we want to compare a continuous metric like prices how do we do that with Bayesian testing and for that I have another notebook notebooks are so fun so in order to understand this well we would want to first like just understand how do prices behave in an actual e-commerce marketplace so I kind of grabbed data from two data sets just to illustrate the point of like how prices are distributed I plotted the distribution here's the first distribution that I just like took the data and I plotted the prices for the orders out so you see that well first of all it's always above zero right it's it's nature of price it is there's a lot of orders that were made where the prices were low and a few orders that were made where the prices are high and this number just keeps decreasing and decreasing as the price gets higher and higher right um interestingly if you take this if you take the log of every single point in this data and you then you take the log or the many plot the distribution you get a bell curve which is like a normal distribution and this isn't just this data to like I took another data set from this like Pakistani in e-commerce data set and what happened here was well I saw the same first distribution of prices and then if you take the log of those prices and plot a distribution you get a normal distribution which basically says that well in both cases the purchase data is can be modeled with a log normal distribution right so the distribution of the log of the prices actually follows can be modeled by a normal distribution typically in e-commerce stores and like how how I model beta distribution for purchase conversion I can model um for uh purchases of prices I can model it using a log normal distribution so cool we got that as step one now I'm gonna be using the very same data set that you saw before but I am going to do some magic over here in order to incorporate pricing information so you saw it's the same exact data set before but we have a price column to so what I'm basically what I basically done done is like from the two data sets I've constructed um two distributions that I can sample from for treatment and control so we have a treatment price sampler if if like the if the group is control it's going to sample from the control uh price sampler if it was treatment it would sample from the treatment price sampler over here and so we have technically two distro different distributions of of data I've artificially also made treatment slightly more than the control as you can see here in the price change right here um I've just artificially made it more so that we can actually see some changes here um in the results but just letting you know that's how I did it now from here though let's take a look at a little bit of Bayesian math so the goal of Bayesian testing in general is to find a posterior distribution like we mentioned before posterior distribution is the you take your prior you conduct an experiment and you update your beliefs based on what you see in the experiment and you have another distribution which is your posterior so that's that's essentially modeled like p of theta given data where theta are the parameters of your distribution and data is your experiment that you're conducting right but we know well we want to actually model a log normal distribution because we want to find a log normal distribution for for represent so that we can sample prices from it um now in order to find a log normal distribution it helps us to actually first find a normal distribution and so let's assume that this p is now a normal distribution and because it's normal it is represented by two parameters that is means and standard dv or mean and variance right which we represent here and so this becomes now our posterior but this is in the form of like probability of a times b and we know by modification of Bayes rule and also like the product rule that probability of a times b is the probability of a given b times the probability of b which we represent over here now there is some mathematics that revolves around you know exactly like simplifying this but I've linked a book on Bayesian analysis over here and according to this book if you kind of go to chapter three and you actually refer some of these sections it will say that and let me go back here that this first term it can be sampled it essentially takes the form of an inverse gamma distribution and this second term takes the form of a normal distribution so what this is this tilt over here means is sampled from right and this this term basically means in order to compute the mean you need to take the standard deviation from you need to take the standard deviation that we computed up here first so it's a dependency right and all of these values that you see here that are except for the sigma squared they're all constants they are prior constants you see their subscript by zero which indicates they are prior hyperparameters that we just pass into into into these distributions and then we can we can also compute posterior values of the same constants right over here I've just given formulations and you can refer to that link that I showed before right up here go to page 68 chapter 3 and it will it will tell you everything I mean I literally took the screenshot from there so it'll tell you everything and I don't want to explain too many things that will will kind of stray away from this entire explanation so I do encourage you to go check it out but anyways all of this math kind of we whatever you see here we're gonna be coding it out and so this is essentially a two-step process we want to essentially get the means and standard deviations right that's how you get the normal posterior distribution and you get first we'll compute the sigmas by sampling from the inverse get the inverse gamma distribution and then we'll use the sigmas the user uses sigma squares to different to determine the means that we sample from a normal distribution those are the two steps that I wrote here too so yeah and if you code this all out in mathematical form I get this function right so this chunk of data is for sampling this in the sigma squares from the inverse gamma distribution and this chunk of code over here is used to sample from the normal distribution and so we get a set of muse that each of these are lists so this is like a set of like ten thousand or whatever I pass in muse and a set of ten thousand sigmas which are essentially represent or can be used to construct a posterior normal distribution so we kind of let's say right now we have our posterior normal distribution at least in terms of its parameters mu and sigma square but now we want to convert it into a log normal distribution right so what we can do is just use a simple mapping so if mu x is the mean of the normal distribution then this term is the corresponding mean for the log normal distribution right and I literally just code that out in another function right here I don't really care about the sigma square terms like I could code this out to but like it's kind of an I think the book calls it a nuisance parameter where it is important in the internal workings but I don't really need it myself in in terms of understanding and analyzing so I just leave it out so essentially I take my data which is the list of like prices it's in log normal form I convert it now into I take the log of all that data to get the normalized version of that then I draw the muse and sigmas using the function I showed above so you have this list of muse and a list of sigma squares and then you can now convert these list of muse into well well the mean of your log normal just your log normal distribution and note that these now new values are like sampled values from your they're like actual prices now right from your log normal distribution and so when you have actual prices you can actually construct a distribution of them right pretty pretty simple pretty straightforward I hope that all made sense so moving on now let's say that we're again we're trying to just compute the data for a week or yeah the week after getting the prior so the prior like before was just one week of control data the experiment data is like everything after that now I'm doing this I'm fitting the same beta distribution for computing sell-through so it's actually the same code that you saw in the in the last in the last notebook so I've defined this beta distribution for it is a distribution of purchase conversion and now I'm doing the same thing the priors I'm defining the priors for the prices to where I'm fitting a log normal distribution for that based on the prior knowledge like all the information I had before the experiment which is the first week of data in this data set now we can use now this chunk of code again was from the last notebook where we construct the posterior distribution of purchase conversion purchase conversion specifically right so yeah you can see it's literally the same code where we update the parameters and then now we have now we have two distributions we can sample from them RVS means random variable sampling so it's essentially going to get a list of thousand purchase conversions from control thousand purchase conversions from the treatment sampled from those two respective posterior distributions now what we do is we're going to take the we're going to well basically compute a couple of priors initialize a couple of priors and use those to draw our you know the musin sigmas right and well musin sigmas and hence also draw the values for our pricing that is the log normal distribution right and so these are essentially like 1000 samples that are prices the corresponding prices in the treatment group 1000 samples that correspond to prices in the control group right now before exact and the thing is that even so now we have our prices right now but instead of just comparing these two distributions directly I'm choosing to actually multiply each of these values by a corresponding like purchase conversion value which is why we're doing this purchase conversion step here too and we're doing this because we don't want just GMV we want the expected value of GMV so that we just eliminate any biases that may occur just because the product is higher price like it's evident that prices are higher but you might be cannibalizing some other like you might be cannibalizing sell through because of it and this this piece of code kind of takes takes into account all of those aspects if we multiply them and now this will essentially this treatment expected price samples will be like a list of expected values like a list of expected prices about of size of thousand this will also be the same same kind of list with the same size and from that we just correspondingly can calculate the lift and I'm printing it out right here so in this last line it says there is a 90% chance that the that the lift isn't it's actually not is price the lift in price is $6.87 the difference here is actually between the experiment treatment groups and the experiment control group that we're actually seeing and it looks like well based on this I'm already super confident that the price is that this difference is already pretty significant which is great and again this probably owes to just like the large amount of data that we have now coming back to this section just so that we're we're clear we compute a lift and essentially it's basically saying hey there is for for we have 1000 samples right and for 0.0 0.2% of them the lift is less than this value right and so we want to find the location here where the price lift is zero and above that's exactly what I'm doing here and then essentially well the if it says like okay that happens at like 10% 0.10 where the price here is zero which means that for 10 for 10% of the samples the price lift is under zero which means that the control is greater than the test so that means 90% of the time here where you know the test is greater than control that means that this is positive value is and it happens 900 out of 1000 times so hence we see that there is a 90% chance I hope that all makes sense now we can then run this you know let's say instead of two weeks like we did before we can you know try to run this for three weeks and then you know just just execute everything here and make sure you know okay do we get even better results or more confident results just give me a sec you know I could have probably just like run this from from the top but now I'll do everything hardcore and when this runs what we expect as well we don't expect too much to change it might be a slightly higher or so you know chance now that you know those two are distinct groups because now that we have more data we can kind of make more precise arguments and distinctions with them so if you look here yeah there's now like a 98% chance that the lift is 7.66 now if this number is you know acceptable like in industry terms you're like oh yeah we wanted that if this number 7.66 sounds like a good lift then you can probably go forward with the change because we're actually very confident that this lift exists and it is around this number so now with just one statement you can see how like Bayesian testing just becomes so much more interpretable there is definitely a lot of math when it's associated with continuous data like this but I think if you if you probably go through it a couple of times and just like try to understand where these numbers are coming from it it becomes a very powerful tool I'll tell you that and so yeah that's kind of all I had for this today I'm gonna put this code kind of where I put the same notebook in the same repository previously so that it's all in one little jumbled place I hope you enjoyed this video hope you're learning a lot we are just we just hit over 100 videos so I'm super glad and excited thank you all so much for your support and yeah I guess I'll be seeing you in the next one bye bye