 Okay, next similarly we can also do Poisson distribution, but in the Poisson distribution we will say like how to actually one figure we can use to index multiple figures or like in one figure we can plot multiple figures as subplots, okay. So, here I am creating subplots and now I am specifying 1,4 that means what I am trying asking Python to do is I want four figures in one row, it is like a one cross four matrix and four figures I want which are placed adjacent to each other and the command to get the Poisson sample generated is you just need to invoke this function random dot Poisson by passing the lambda value and how many samples you want it will generate that many samples, okay. Now I am going to decide what should be the size of my figure and that figure size is mentioned here and now I am going to draw the histogram of this data sample at the first element in my array of one cross four, like I have declared to get four figures and now if the first figure I want to put my first figure in the first position and for that I am going to use this AX equals to AX0 and then tell plot the histogram of X to this function HIST plot, okay. Let us first get that so and now you can set the label here Y label as count and your X label as X and this one is the figure that corresponds to it and similarly if you want to generate data according to another lambda parameters and see its histogram but you want it to be the second element in your array you have to just use set AX equals to AX1 when you pass the argument to the HIST plot. Similarly you can generate third and fourth plot in that case the only thing you want to do is like whichever data you generate you generate and then specify that handler number of that figure to a HIST plot to generate this plot. So here we wanted four figures placed adjacent to each other and that is what we got on here. So this is about basic distributions function that are available in Python. I just talked about basic normal distribution, exponential distribution, gamma and beta distributions. So you can play with other distributions the usually this Python libraries are very well documented so just to go and play around it, okay. Now we will briefly touch upon how to apply or how to use Python in some of our other tests we did. We will go into talk about hypothesis testing here. If you recall in hypothesis testing we wanted to check whether another hypothesis is true or not and for that we come up with the decision criteria and based on the decision we accept or reject the null hypothesis, okay. Here I am interested in comparing performance of vehicles when they use diesel or petrol, okay. So I think this petrol I do not know they use simply the word gas here. So maybe we will just say diesel and gas here. So this data set, so for the to understand this we are going to use a data set which is a 1985 Watt's automotive yearbook, the column names here specifies what is the data type it is. For example what make, yeah so the column are like a descriptions like for example the column could be make which provides the manufacture of the vehicle and the column number of cylinder use explanation about how number how many cylinders are there in the engine and the column peak RPM use what is the maximum revolution per minute like that, okay. So in this we are going to compare whether the performance of diesel and gas vehicles are going to be the same or not and we are going to do that using this data here. So to play with data we are going to use this Pandas and use NumPy to see some visualization we use math plot lab and also to some statistics we use this Cypistats package and from that we use this test particularly, okay. So let us execute this and then let us read the file and as we discussed yesterday when I read this file I will see the first five rows with description about the headers and the data in it up to five rows and this info will give me a brief summary of all the columns I have there are looks like there are about seven columns here sorry eight here and they have non-null values here, okay. So they have all the about this count 205 so okay there are eight columns okay and that is saying there are 205 entries in here which are indexed from 0 to 204 and all these columns have some values that is why this is all of it is showing 205 as a non-null values. Okay now if you want to check there is any missing values in it like for example if there is no missing value either that place would have been left empty or there would have put a question mark and since it is saying here 205 null values maybe there is a question mark if there is a data missing there. So let us see this I am what I am trying to do here is I am trying to find in columns checking in each columns if there is any question mark there. So for that I will define this variable column and taking it through all possible columns and under each column I am checking if any of this value is a question mark. So to check that we have used this function value count and telling what is the value we are looking into okay and this will give us how many times question mark is appearing in columns and that is printed here. So notice that this looks like few lines of code but this could be very compactly written like this in one line okay. So in May column there are none which have question mark, fuel type also there is none and in horsepower there are two with question mark like that. Okay now you may want to remove this missing values. So to missing values first we identified the columns where the values are missing those are horsepower, peak time and price. So we have put the values replace the question mark by falls here and now if you again can describe and now everything is value here now we can count the statistics like mean standard deviations and all okay. Now what we want to do is we want to now look into the mileage. So we want to look into their highway mileage. So and I want to look into the vehicles which are gas and diesel type here. So I will look into all the columns which were all the vehicles which are running on gas and list their highway mileage. So MPG stands for miles per gallon and then again for fuel the vehicles which are running on diesel I again list their mileage highway mileage and let us see when I do this I am going to get this low the first okay maybe I am printing too much here yeah first I printed this gas. So there looks like a gas there are about 185 element vehicles which are running on gas and the second part corresponding to this print which is showing me that there are about okay this just did not show me the length here but when I look into the length of the diesel it is showing me that is 20. So here I have this diesel vehicles sorry the vehicles running on gas are about 185 and one which are running on diesel are about 20 here. Now I store them separately in a new data frame which the gas is stored in a data frame DFA and the diesel one showed in DFP here. See notice that storing getting extracting the column information is very and very easy in Python all you have to do is first look into all the data where we fuel type is gas and when you pass it to only those indexes to your data frame which is storing all the data it will only return those datas and not the others and similarly here for the diesel case. Now what I want to do is I want to take 20 sample from each one of these samples and use it to do my hypothesis testing whether diesel and gas have the similar mileage okay. Now you see that when I do the describe both okay before that yeah to achieve that 20 getting 20 samples I am going to use a random function first the random function is seeded so that you get the same value of this randomness and in the first I will get 20 samples from the DFA and another 20 samples from DFP. Now when I look into the summary of this DF 120 you will see that there are 20 elements in that and we are seeing this mean and median okay and highway also has similar values populated here okay this is you can also see. Now often these are like just like statistic count maybe one can want to use it pictorially for that one can use the box plot here the blocks plot to do this we are going to use box plot function in this SNS library and the plotting for that we are going to use this command so X is simply going to be the fuel type we have to just say X is the string is fuel type Y is highway MPG so highway MPG here and X is a fuel type here and the data we are going to give is this concatenated value of DF 1 and DF 2. Now when I plot this you see this 1, 2, 3, 4, 5 kind of bars the bottom one here corresponding to the lowest value that is the minimum value which is 22 here and the middle one so the second one here is going to correspond to the 25th quartile of the data and the middle one is the 50th quartile which is also the median and this is like a 75th quartile and the last one here going to represent the maximum value. So now we have named this figure as box plot highway MPG for gas and diesel. Now based on this you see that if you compare this just take the mean it gives a sense that diesel has a better efficiency has better mileage now we want to test this. Now our hypothesis is we want to test the hypothesis our average hypothesis mileage of both of them are same and my alternative hypothesis is they are not. Now I want to test it and I want to use the T test for it. So if you recall in hypothesis testing we accept or reject by comparing to a threshold and now we want to first compute the statistics and the p-value of the statistics. So to compute the T states I am going to use this T test IND which is the T test we had already defined the T test right we did we define T test here before no. So this T test is a function which takes the two data sets that we want to compare and here equal variance tells whether I should be taking the variance of these two tests to be the same or not here we have set it false. So this will not take this data to be the same variance and recall that when we are talking T test it is and assume that the data is following a Gaussian distribution. Now this function is going to return me to value stat and p stat will be the value of the statistics that we compute and p is its probability that statistics is going to be larger than some number if by default that is like said the significance level is said to be 0.05. So we see that now p we compare against this significance level 0.05 if it is true we say that okay null hypothesis accepted and otherwise we are going to say that it is rejected. Now it so happens that in this example p value turns out to be minus 2.229 and whereas the p value is 0.037 and clearly 0.307 is going to be smaller than 0.05. So we are going to see that the null hypothesis is going to be rejected that is the performance of these two are not the same. Like this you can do here I just showed you that let me see what is this happening I had some more things here. So here based on this you can here with the 20 random samples generated you concluded that yes these your null hypothesis that both are the same performance is rejected. Now you can do this again but this time instead of not necessarily taking 20 samples for both maybe you can take all the possible samples available for each of this and compare and you can do the same process here I am not going to go into the details but again to get a sense here you can first put the blocks plot and again here when I take all the data also like looks like again diesel is performing better and then you can try to test this and when you do this test at significance level 0.05 you again the null hypothesis is rejected here. So you see that like before we applying this hypothesis test sometimes the visualization gives more idea about how the data looks like. So I want to quickly spend one or minute on this various visualization tools. So that is why data visualization is important and this C bond tool provides a pretty good functions to do that. So one function which we already used this histogram when you have a data you can see that like for example earlier I do this Pocman data right in the Pocman data let us say I want to see how this attack column looks like I can generate the histogram like this and also there are other plots called scatter plots box plot we just saw and there is another things while in plots these are other good visualization techniques I encourage all of you to play with them. Okay so with that we will stop but this python is a very comes with a very rich libraries so you can readily use it to do good statistic analysis. I hope whatever we discuss is going to be helpful to you. Okay.