 In this short video, I want to talk about transforming your data. Now, where we want to compare means, for instance, we would use a parametric test. But our data needs to conform to certain standards. There are some assumptions that must be met for the use of parametric tests. And if they don't, we obviously can use non-parametric tests. Sometimes though, you still want to transform your data. That is where we change every value for a variable based on some function. And the function that we're going to look at is the log, the logarithm, and the square root transformation of your data. So you might find that in your literature and the field that you are in, people still make use of data transformations instead of using non-parametric tests. So in case you need to know how to do these data transformations, let's have a look at this video. So in this short notebook, here we go, we can see it's not going to take us too long to get through this. That's all the code and some paragraphs of text there for you. We're going to go through data transformation. So just transforming your data for the use so that you can use it in parametric tests. Now, I would really advocate for the use of non-parametric tests if your data does not meet the assumptions for the use of a parametric test. But at times, data transformation is still used and I just want to show you how it's done. So we're going to generate some data as you can see there. So if you download this notebook, it's going to be on GitHub. You just click on these links here and it'll take you to the appropriate section in the notebook. So we're going to generate our own data. As always, we're going to start with descriptive statistics just summarizing the data so that we can understand what's going on and visualize it and just see what we can learn from those two steps. We're going to look at the assumptions for the use of parametric tests. Now, we're not going to go through all the assumptions, just the ones that pertain to what we're trying to achieve here with the data transformation. And then we're going to talk about the non-linear transformations that we can do. Now, so those are going to really be the log and the square root transformation, although others are available to, we think of the inverse cosine function transformation and others. And then we're going to compare the means. So let's just get tracking. Well, let me do that. If I just click there on generating data, it takes me to that section in the notebook. So what we're going to do, we're going to have this random variable. It doesn't matter really what it is, but we have our participants in the study divided into two groups. And so in both of those groups, we're measuring the exact same variable so that we can compare, aim is to compare the means. Is there a difference in mean between the two groups insofar as that variable is concerned? So let's just generate the data. So we're going to just simulate that we have 200 participants in each group and are these participants human beings or in some organism doesn't really matter what it is. We're going to have these two groups. I'm going to call them var group A and var group B for the variable in group A and the variable in group B. So it's the same variable across both. So we're going to just see the pseudo random number generator there. I'm just using 17 as the integer there to see the pseudo random number generator. Just do that so that every time we run this code, we all get the same pseudo random numbers. So I'm going to use the random real function there. The random real function and I want to draw in, there's the in there as my second parameter and that's 200. So I want to draw 200 random real values from what distribution? Well, from the normal distribution and the normal distribution with a mean of 10 and the standard deviation of one. So that's the bell shaped distribution and I'm going to draw 200 values from that at random. And then for the second group, I'm just going to add some random noise to that. So look very carefully, the data for group A for this variable is going to be normally distributed or at least it comes from a normal distribution. But we're going to add this noise, random noise to this. So we take every element of these 200 values from group A and we add to that 200 values again from a different distribution this time. So we're going to use the chi-square distribution with the five degrees of freedom and we're going to draw from that distribution and add that individually to each element in var group A, the values there and then just subtract five again so that we get back roughly to the same sort of mean. So let's run that code and there we go. You see I've got the semicolons after these lines of code so that we don't see the expression of the execution of the code on the screen. So let's just describe this data and to do that we just use some of these statistics. I'm going to just create this utility function that I call describe, you can see there and it's going to take a single argument v and that v is going to be a list. So it's going to take a list as an argument and what I want to do is to return a list and we use the semicolon equals sign or assignment operator there just for delayed execution. So we're not executing anything here, we're just creating a function. So I'm going to return this list, as you can see it's inside curly braces there, the mean of the list that is passed to this describe function, the standard deviation, the median and the interquartile range. So that's just a little utility function so every time I pass a list object to this describe function I'm going to get all these results and let's look at that. So I'm going to call the describe function and I'm going to map it to remember this slash at that is shorthand for map, the map function and so I'm going to map the describe function to both of these lists, the var group A and var group B lists that we created there. So I'm going to do the describe function on both but I'm wrapping that as the first argument in the table form function so that this is neatly printed to the screen as a table and I'm even going to add some headings to my table. So I'm using the table headings parameter there and I'm setting the rows to none. So I don't want any names for the rows, these 200 rows of data, but I want mean standard deviation, median and interquartile range for the column headers. So let's have a look at that and there we go. We can see group A group B there and we can see the mean 9.9, 10.2 so not very different standard deviation but of difference there, median and the interquartile range as you can see there. So that gives us some indication of course what's going on here and the next thing we can really do is just to visualize this data. So here in figure one, we're going to do a box and whisker chart so that we have a box plot for each of these two groups for the same variable. So I'm passing var group A and var group B, those two lists as a list. So it's a list of lists there. I want to plot any outliers should they exist. I've given a plot label figure one and chart labels so that underneath each of the box plots I have group A and group B. So let's run that and there we go. So we can clearly see the difference in distribution here as far as group A is concerned. You can see the summary statistics pop up there if you hover over there but you can see it's normally distributed. But if we look at group B, the measurement of the same variable in group B seems to have a right tail here and quite a few outliers. So we already know just from the information that we have from the summary but especially the data visualization that we are not going to meet the assumptions for the use of a parametric test. And that's what we get from data visualization. Another way to do this is just to look at instead of the box and let's get charges just to look at the frequency distribution. So we're going to do a histogram but what we're going to do is a paired histogram and then as a paired smooth histogram so we can get this idea of the kernel density estimate of the distribution of the values. So let's use the paired histogram function there. I'm passing var group A and var group B so you see there I'm not passing that there's a list of lists system separately. A plot label and chart labels again. So if we run that we see group A here on the left hand side if I hover over there we can see it's a frequency distribution of the histogram how many values appear in those bins. And we can see this long right tail so you have to sort of flip it in your head when it's on the right hand side. So this up if you have this x axis here horizontally this would go out to the right so this is really positively skewed and let's do a smooth histogram same thing as before and you can now really start seeing this right tail and this sort of normal distribution to the variable in group A. And just to prove that let's just have a look at the skewness and I'm using again short end notation so I'm mapping the function skewness to both of these lists and you can see almost no skewness for group A but quite a bit for a positive skewness here for group B. So let's just do something else. I mean that's a visual sort of idea that we are not going to meet the assumptions for the use of parametric tests and one of those important ones of course is the test for normality. And the two things that we can commonly do of course is a QQ plot, a quantile plot and a statistic test to see if we meet the assumption at least of normality as far as it comes to the use of parametric tests. So let's use QQ plot, the function there we're going to go for group A and we're going to just label this figure three A so that just corresponds to the text that I wrote here as notes if you read this notebook afterwards. You can see well it sort of sticks there for group A but clearly see that it doesn't for the variable in group B. You can clearly see that from this quantile plot that it deviates from this line which represents, would represent if all our values were on this line it would represent the fact that it is taken from a population in which the variable is normally distributed. Let's be a bit more precise if I can use those terms and do a Shapiro walk test for group A and we're going to return a test data table so that we get both the test statistic and the P value. And you can see there the null hypothesis for this Shapiro walk test is that the data comes from the samples that you took comes from a population in which that variable is normally distributed. And we see our P value if we have an alpha value of 0.05 is not there and hence we cannot reject the null hypothesis. Let's do the same for group B now and we see a P value less than an alpha value of 0.05 or even if we had chosen smaller 0.01 and we have to reject the null hypothesis and accept the alternative hypothesis then that states that this data does not come the samples at least don't come from a population in which the variables normally distributed. So we fail one of the major tests for the use of a parametric test. So I think this is a good idea just for a little break I think at least YouTube's going to add a little break here I'll see you afterwards. Let's carry on and just talk about non-linear transformations now it seems all funny but you've all seen non-linear transformations even at school y equals x squared so your input is x you take a real number x you input it into this function or mapping and out pops an element of another set. Remember x if it's an element of the real numbers the x-axis is a real number line and the y-axis is also a real number so these are two different sets they happen to be the same set but you take an element of the one set and you map it to a different element in another set. So if I say y equals x squared I put in an x from my x-axis real number set and I out pops a value on the real numbers from the y-axis. So that's what a function is and that's non-linear in other words that line, the curve that you draw for that is not a straight line. So if we talk about single variable calculus of course or algebra even that is a straight line would be the outcome if it's linear and x squared is definitely not a straight line y equals x squared if you plot that so that's a non-linear transformation. So I do have a plot here let's use the list plot and we can use the thread function so we're gonna thread var group A and the log of var group B I should say and it's log. So what that does is what the thread function does remember I have two lists and it's going to combine element one from the one list with element one from the second list into a small little two element list and then element two with the corresponding element two and element three with element three because that's what you need here for the list plot so I'm just gonna thread those and I'm also going to have another one where we have the variables in group B that group B list and the square root of that and some plot legend so that we can have a look at it. So log 10 there as you can see there that is the function just to take the log base 10 and remember what the log base 10 function is it just says 10 to the power of what gives me my input value. So log 10, a log base 10 of 10 would be one because 10 to the power one is 10. And anyway you can clearly see that these two are not straight lines they're both curved so they're both non-linear transformations in blue here we have log base 10 so for every value that we have in variable for var group B and we transform it into its log value that's what we see on the y-axis and then let's take the square root we see there in orange at the top. So none of those are linear transformations so let's do the log transformation and remember it doesn't really matter if we take the log or the natural log doesn't matter what the base is because no matter what base you have there's just a constant difference between the two transformed values so if I take the natural log or I take log base 10 or even log base 2 there's just gonna be a constant difference in the outcome so that would be more linear, that would be linear so it doesn't really matter what you decide your transformation to be. Remember in mathematics textbooks you'll see this Ln there, that'll be natural log so that is Euler's number as the base or the base E and then just if you see normal log like that there in a textbook that's log base 10 usually and for the two corresponding functions log 10 would be log base 10 and log just like that well that would just be log base E so let's do a list plot we're going to take the log of B and then the log base 10 and the normal log so I just wanna show you here that that is indeed there's just a, it's a straight line so whether you take log base E or log base 10 it really doesn't matter it's a constant multiple difference between the two the linear difference between the two so let's do here in 6A and 6B we're gonna show a histogram of the data transformed with both of these and just draw a histogram of that and if we look at this, just the log base 10 transformation it's appearing to be a lot more remember that data had a right tail quite right skewed it's looking a lot more normally distributed once we do that transformation and if we do the natural log sort of same sort of thing it's still, it's a little bit of a tail there but let's do a QQ plot now of this transformed data and remember it wasn't on that normal line at all but now whether when we transform these whether it's just a log base 10 a log base E as I said that doesn't really matter we can see that this data now appears much more normally distributed and let's do the Shapiro walk test on they're gonna be exactly the same because there's no difference between really between those two transformations you can see those QQ plots are exactly they look exactly the same and indeed it doesn't matter if I what log transformation I do I get the same p-value back and that's a p-value from which we can't reject an all hypothesis so now we can say that this transformed data comes from a population which the variable is normally distributed at least the log of that variable so let's also look at the square root transformation so we take a histogram here and we just now take the square root of all of these values and you see we still have quite a significant right the positive skewness here so in this instance for us the square root is not really gonna work and we tend to use the log transformation when for continuous variables and the square root more for ordinal variables although you have to decide you have to look at it what you should do really of course is decide on this beforehand you can't do this and then do p-value and you say oh that's not significant for your final test oh let me try another transformation maybe it will be you've got to decide that before your data analysis you've got to stick to that a couple of things also that one more thing for both of these transformations remember that the log function and the square root function we are those are only going to be defined for positive values so if your variable contains data point values that are negative you'll have to shift them first through some linear transformation in other words just add a constant to them before you could do these transformations so just keep that in mind so let's then compare the means let's just have a look at what happens if we take the log of var group A because we'll have to do that we can't just now we can't just take the log of one of the two groups because then our comparison becomes meaningless we can't have the values untransformed in the one group and transformed in the other group that makes no sense so if we look at that we see we kind of just just about make it it's 0.08 if we round off there so we still as far as the shipper where work test concern cannot reject the null hypothesis so let's take this bar, bar, box and whisker chart now and we see both of the transformed data clearly you can see there for group B that it is now normally distributed and still to a great extent for group A and we have that single outlier so we'll have to check that out as far as the assumptions for the use of parametric tests are concerned but certainly a single outlier a t-test would be robust enough not for that to have any influence so let's just look at the description of the transformed data so I'm taking the log natural log of both now group A and group B and we can now report on these transformed data and you have to have a look in your kind of literature how you report the transformed data and stick to what is commonly done there so what I wanted to do with this with this transformed data I have a mean and I have a standard deviation so I can draw two density plots so probability density function here so I'm just using PDF and all distribution with the mean and the standard deviation for group A and then again for group B and we plot those two probability distributions based on these transformed means and standard deviations and you can clearly see there we're not gonna find statistically significant difference between these two so let's do a t-test then t-test on the log of var group one and the log of var group B and our hypothesis is that there is no difference between the two I want to test the data table and verify the assumptions for normality and if we do that as we can see there we see a p-value above 0.05 that's our alpha value we can't reject the null hypothesis that there's no difference between these two groups so this is for curiosity sake just like we have a look at if we just use the t-test on the untransformed data and I'm gonna say verify test assumptions to none if you set that to all it's gonna give you a little bit of a complaint there saying well one of your sets at least does not meet the requirements I'm just forcing that parameter to none so we just have a result and we still see a p-value of more than 0.05 so in the end this transformation has not changed the outcome of this and the t-test is robust enough to handle a bit of shall we say not upholding all the assumptions for the use of a parametric test but you can clearly see that there are gonna be circumstances in which you are going to get results that might be significant when you should not have used a parametric test or at least you should have transformed your data in the end I just wanna circle back all the way to state that data transformation should not be the first thing that you do so that you can use a parametric test think about the non-parametric test do them I think that is the appropriate thing to do and of course here we compare two groups to each other so the test for that would be the main Whitney U-Test and we have the main Whitney test function there we pass the two groups now remember we are comparing the median but we suggest that there's no difference between the group and again the property that we're sitting there is the test data table and if we return that we see our p-value again not above 0.05 so we still cannot reject another hypothesis so rather stick with that but in case you have to you see now at least two data transformations to get your data into a form from which you can use parametric tests I hope you found this video informative if so remember please subscribe to this channel and leave some comments down below it's nice to interact with the community of people who watch these videos