 Welcome to this section of tutorial 4, Image Statistics in Python. So, till now in the previous section, we have been trying how to do graphical analysis of data using 1D, 2D and 3D plots. And we also understood how to summarize the information content in a data using box plots, histogram, relative frequency histogram and also we just played around with empirical distributions pdf and cdf. Towards the end, we also discussed about how to plot scatter diagrams both in 2D and 3D as well as how to fit a linear regression line and display the same along with r square value. So, in this part of tutorial 4, we shall try to analyze how to perform hypothesis testing as well as how to deal with autocorrelation particularly with respect to time series data which is predominant in hydrology and water resources engineering. So, just to give you a preview, we finished part 1, part 2 and part 3 and now through this section of the tutorial, we are going to focus on part 4 where we shall try how to do hypothesis testing that is student t test for mean as well as f test for variance and how to deal with autocorrelation and lag plots. Now, before I begin hypothesis testing, typically we collect sample data, try to generate statistics from the sample data and then we use this information to infer about population parameters. So, before I start hypothesis testing, I want to make a clear distinction between population and sample because in hypothesis testing just to reiterate, we collect the sample data and using the sample data we try to generate statistics. Now, the statistics can be mean or standard deviation or variance and using them we try to infer about my population parameters. So, say we assume a certain value for population mean. So, to test the validity of our assumption after collecting sample data, say we need to find out the difference between the hypothesized value and the actual value of sample mean. So, in hypothesis testing, we usually have something known as a null hypothesis that refers to the assumption which we need to test which is denoted by H naught. And, if our sample results fail to support the null hypothesis, we must conclude that the next that is alternative hypothesis is true which means whenever we are rejecting the null hypothesis, we are accepting the alternative hypothesis which is symbolized by H 1. So, just again in the screen that is present in front of you, what you see is the population mean variance and standard deviation and the sample mean sample variance and sample standard deviation. So, the whole purpose of hypothesis testing is to make a uniformed judgment about the difference between a sample statistic and a hypothesized population parameter which is why the slide is shown. Again, the sample statistic can be mean or variance or standard deviation and typically in hypothesis testing, we support it with p value. In this tutorial, we shall also show you how to arrive at two-tailed test for mean and variance. Now, the significance level typically in hypothesis testing indicates the percentage of sample mean that is outside certain limits. With this background, let us try to understand how to perform hypothesis testing in Python. So, here we will be discussing about two typical cases, case one is nothing but two sample test for mean using pooled t-test. Without going too much into the statistical part, let me try to show you how the commands are written in Python. As I mentioned earlier, in hypothesis testing, we always try to check whether H naught that is a null hypothesis is acceptable or whether H 1 that is the alternative hypothesis is acceptable. So, the first step is I am going to define the null hypothesis and the alternative hypothesis. So, here as a data analyst, I am going to assume that observation from each population are normal. So, we have two observations and null hypothesis says mean of both the observations are equal and alternative hypothesis says that means are not equal. Now, so the commands I am going to write is I am going to generate rv1 that is random variable 1 using stats.norm.rvs. And for this, I need from SciPy, I need to import the stats package, which is what is written which means all the statistical functions are located in the sub-package known as SciPy dot stats. So, that is why we are importing stats and then I am generating random variable 1 and random variable 2. So, here default underscore RNG is nothing but the constructor for the random variable class generator. You know, there are several means by which one can construct a random number generator, one such means is shown here. And stats.norm.rvs is nothing but random variates and stats.ttest underscore ind. So, this is the function to calculate the t test for the mean of two independent samples. It is basically the two-tailed test for null hypothesis. Again, assumption is that populations have identical variances by default. I also want to calculate the p-value. p-value is nothing but the largest significance level at which we would accept the null hypothesis. So, I am going to try type the commands that should be p-value. So, if p-value greater than 0.05, it should print accept null hypothesis. Again, when we are saying that we are accepting null hypothesis, it simply means that the sample data does not allow us to reject the null hypothesis. For this section of the tutorial, I am assuming that you will be having a fairly good idea of the basic statistics. So, what I will do is I am changing the location and then trying to see the result. It says accept null hypothesis which means both Rv1 and Rv2 are having same mean. So, in the same code, I can also change the location to say 20. I can change to 10 or 5 and then check. You will get the result as reject null hypothesis. So, what did we do? Instead of taking actual dataset, I have created two datasets, two random variables Rv1 and Rv2 both are normally distributed and then we try to check whether the mean of Rv1 and Rv2 are same or not using pooled T test. Now, similarly, we can also perform two-tailed test for variances. Two-tailed test for variances. So, my first step is to define what is the null hypothesis and what is the alternative hypothesis. So, null hypothesis I am going to define it such that the two variances are the same, two variances of what of Rv1 and Rv2. Alternative hypothesis just reverse the two variances are different. So, this is an inference about the population variance. Remember, I made a distinction between population statistics and sample statistics in the beginning. So, I am going to use np.var which computes variance of a given data. This step calculates the f value. Dof is nothing but the degrees of freedom, len returns the length of a list, a string or a dictionary etc. So, I am trying to define the degrees of freedom of Rv1 and Rv2 but as before I need the p value. What is p value? It is the largest significance level at which we would accept the null hypothesis. So, here I am going to use the stats.f.cdf function. I need to print the f statistic and p value. So, as before I am going to type if p value is greater than 0.05 print accept null hypothesis else print reject null hypothesis. So, there it is run to display the results of f statistic p value and the final inference that is accept null hypothesis. So, in a similar manner Python allows us to perform hypothesis testing, one-tailed test or two-tailed test. So, two sample codes were shown as part four of tutorial four. Moving further, let us now look at the fifth part of this tutorial that is autocorrelation, partial autocorrelation and lag. So, in hydrology we deal with time series data. Autocorrelation and partial autocorrelation are heavily continuously used in time series analysis and in forecasting because they graphically summarize the strength of a relationship. Here observations in a time series with respect to observations at prior time steps are used. So, to define autocorrelation refers to the correlation of a signal with itself in space or time. We saw what is correlation in part one of the section, isn't it? So, now let us try to achieve the same in Python as before. I am going to import numpy as well as matplotlib.pyplot. Let us assume a periodic signal. Let us use either sin theta or cos theta. So, for this example, I am going to use a periodic signal np.cos and what I will do is, let us try to corrupt this signal with noise. So, I am trying to synthetically create a data, corrupt it with noise using np.random.random, len gives the length as mentioned before and autocorrelation can be plotted using the plt.acorr function. Let us try to display the labels x label and y label. I want both the plots to be displayed one above the other. So, I am going to try to use plt.acorr function that takes x that is the synthetically generated periodic signal as input. I am going to keep the max lag as 50. I am going to keep the plt.grid as true. Typically, when you have lengthy lines of code say suppose you made a mistake somewhere, you know somewhere there is a syntax error or some other error, the code is going to prompt you so that you get to check and correct it before you run the code. And by now, I am assuming that you are familiar with the plt.grid or plt.x label and y label commands because we have been covering it over the series of tutorials. Let me also type the limit of x from say 0 to 100 and as before label it, I have to check and correct the spelling of label. So, x is shown and autocorrelation is shown below it. Now, let us try to increase the periodicity. Let me try to multiply the variable with 4 just to show you different ways to understand. You see lag as the x-axis and the plt below, you see autocorrelation as the y-axis and the variable that we generated is plotted above x as the y-axis and lag as the x-axis. A simple example using synthetically created dataset for us to understand what is autocorrelation and what is lag. So, just to reiterate autocorrelation refers to correlation strength of a signal with itself. It can be either in space or in time and particularly in hydrology and water resources, it has a very prominent part because it is used in time series analysis and also in forecasting. Moving on, Python also allows us to use certain tools like some packages in stats models and stat tools and it will be useful for us to understand how to use these as well. So, let us try to import the functions ACF and PACF. ACF stands for autocorrelation and PACF stands for partial autocorrelation. Let us try to import it from statsmodels.tsa.stat tools and for plotting the same, I am going to import plot underscore ACF and plot underscore PACF from again statsmodels.graphics.tsa plots. So, partial autocorrelation function gives us the partial correlation of a stationary time series with its own lagged value. What I mean by lag is here I am going to keep n lags as 5, number of lags as 5. What I mean by lag is lag 1 autocorrelation is the correlation between values that are one time period apart. So, that is the meaning of lag. Again, I am going to keep the lags as 5 for both x and y. Now, as we have calculated both AC and PAC for x and y, let us try to plot and visualize the results how they look like. As before, it gives us several options to print the results as well as, you know, beautiful ways in which you can graphically represent the results and change the font size or color of graphs. Let us try to use a few of these options. I want to specifically print the title as autocorrelation of precipitation up to 5-day lag and partial autocorrelation of precipitation up to 5-day lag. So, for the next two lines, the there is only a slight variation. So, in the interest of saving time, I am going to copy the commands and make the changes wherever necessary. So, instead of precipitation, it is going to be temperature because by now we are aware that when we started the tutorial, we started with two variables of precipitation from I mark and temperature from era and rim. Small correction in spellings, x is precipitation and y is temperature. I get to see the autocorrelation function of precipitation up to 5-day lag. I can see the values printed here. The autocorrelation function of precipitation and partial autocorrelation function of precipitation followed by the same for temperature up to 5-day lag. Though we have the results in the form of numbers, let us try to visualize how they look like. So, as before, I am going to use the plt.subplots command 2, 2, 2 rows, 2 columns, figure size. Let me use the plot underscore ACF function and it was taken from, if you remember, statsmodel.tsa.statTools. It was imported from statsmodel.tsa.statTools. What I am doing is I am specifying the variables, the title and where I want to see the figure displayed. 00 is first row first column towards the left, 01 is first row second column and as before, because there is only a small difference in the commands in the interest of saving time, let me try to paste the commands and then make the changes wherever necessary. So, in the third and fourth lines, only the variable name is changing that is y which stands for temperature. So, let me make the changes and I want this figure to appear in 10. Let me make those small changes y and again 01. This is going to be the last figure and then partial autocorrelation function of temperature. Let me try to use the plt.sctp set p function and display the x label and finally use the plt.show function. So, there you see the results. The first figure gives us the autocorrelation function of precipitation, the one towards the right side that is first row second column it gives us partial autocorrelation function of precipitation and similarly we have the figures representing temperature. Simple examples so that when you work with an actual time series data, you will be aware after this tutorial that several functions exist that help you to carry out these operations to help you to analyze the autocorrelation and partial autocorrelations and also to that help you to understand about hypothesis testing. So, as part of this tutorial, we started with how to visualize a set of data. Now, this data can be synthetically generated data which is what we saw in this tutorial. We try to synthetically generate the data sets using mathematical functions of sin theta and cos theta and we also try to add a little bit of noise to see how the spatial plots look like. We covered the visualization of data using box plots and histograms and we also had a fairly good idea about the empirical CDFs, what libraries have, what functions that help us to plot the empirical CDFs. And finally, we started with scatter plots moving on to linear regression and ending it with hypothesis testing and trying to understand autocorrelation and plotting the same for two different variables of time series of rainfall and temperature. Let me hope that you found this tutorial informative. I will see you in the next lecture. Thank you.