 So, as part of tutorial 4, we were trying to look at different ways in which we can visually represent data as well as infer information from the same. Till now, we have seen histogram, box plots, relative histogram and cumulative distribution function. So, now let us try to look at plotting in 2D using scatter plots. So scatter plots lets us know whether there is a relationship existing between both the variables or not. So visually we look for patterns which indicates whether or not the variables are related. Say the example in front of you, you have on the x-axis the examination marks conducted at the middle of the semester, mid-semester exam marks and in the y-axis say you have the marks of exam that is conducted towards the end of the semester. I am going to use these 2 variables and say for example, the university is conducting a survey on whether or not there is any relationship between a student's exam conducted at the end of a semester and the exam conducted at the middle of a semester and say that sample data you see are the data corresponding to a particular course that runs in a particular semester. So, shown in front of you is a sample scatter plot. So, here each pair of data is plotted as a single dot as a single point and when we view all the points together we get to visualize the relationship that exists between 2 variables. Now the relationship between both these variables can either be linear that is one variable increases with increase in the other variable. So, here the relationship between the data points or variables can be assumed to be represented by a straight line. So, we can assume a linear relationship here. Now, it is possible that the relationship between both variables can be curvilinear that is having the form of a curve. In each case, the direction of relationship can give me information on whether it is an inverse relationship or a direct relationship, ok. So, in the example shown here visually we can guess a linear relationship. Now please also be mindful that it is also possible that both the variables are not at all related that is also possible, ok. So, with this background let us try to see how we can create scatter plots in python, ok. Scatter plots in 2D let me use plt.scatter, ok, data 1, data 2, this is the correct syntax. As I mentioned earlier I can even design or read the data into smaller names X and Y, ok. Now let me use plt.scatter, Y is already existing so X, Y and as before I can set the title as well as the X label and Y label. So, let me quickly type the same as scatter plot of precipitation versus temperature X label and Y label. I am going to use as before plt.show and there you have the scatter plot displayed in front of you. Now most of the data what we deal with in hydrology are time series information. For example, rainfall precipitation in millimeter per day which occurred during the June, July, August, September Indian summer monsoonal months over say IIT Bombay campus that is going to be a time series. To represent such data we need the X axis to be the time axis and we need the Y axis to represent the data. Time series analysis is actually a method by which we get to determine any patterns in data that has been collected over a period of time. So just to summarize to reiterate by time series we refer to a group of information that is accumulated at regular intervals of time. So in this context let me introduce forecasting or predicting which is an essential tool in any decision making process. So shown here is a 1D plot of a mathematical function actually two mathematical functions sin theta and cos theta. Now that we have seen plotting in 1D plotting in 2D and 3D is also possible. We have already seen scatter plot in 2D image as I mentioned always is nothing but a matrix of numbers and within a geographical information system that is GIS there are two ways in which data can be stored that is either as a raster or as a vector. Now raster images are in the form of individual pixels and in python we can directly display the raster images. So how to open the LO spalser data and sentinel imagery have already been covered in previous tutorials. Similarly we have dealt with how to display the raster images in python. So here let me try to create a raster data using some mathematical functions and then to represent it spatially and towards your right side we have a 3 dimensional scatter plot where in each dot represents information along the 3 axis that is x axis, y axis and the z axis. So we have already seen 2D scatter plot this is a 3D scatter plot we also have the third axis. Moving on see normal distribution we all know it occupies a very very prominent place in statistics with several mathematicians being involved in its development including the mathematician, astronomer, Gauss and hence the normal probability distribution is often called as Gaussian distribution. So shortly we shall see how to generate plots like these using a set of random numbers. So as I mentioned earlier initially we started with a sample data, real data of precipitation and temperature for one particular grid point and we have conducted certain visualization using box plots and CDFs and histograms and relative histograms and 2D scatter plots. Now here we will just change and try to generate random numbers, synthetically create the random numbers and then try to understand how the plots can be created. So let us see how to achieve the same using Python. So as I mentioned earlier scipy dot stats is a very useful package that you should try to work with. So now we are going to create spatial plots using synthetic data sets. So for that I am going to use np.range and np.sign and let us try to plot the values using plt.plot and of course assign labels to it to both x and y axis and as before mentioned the title. Let me quickly type the command and then we will see how the plot looks like. This is a plot of sine theta on the x axis you have variable x on the y axis you have sine theta, a simple 1D plot. Now let us try to plot say 2 plots in a single diagram. So let us try to plot or let us try to create numbers using sine theta and cos theta we are trying to synthetically generate the data and then have them plotted in the same plot. A slight difference in commands I am going to use plt.plot x, y1 and x, y2. So x is going to remain the same whereas y1 and y2 are going to vary, y1 we have already seen as nothing but np.sign x and here we have specified y2 as np.cos x. As before you know let me give the titles because once when you understand the command rest becomes very easy because it feels repetitive because you are already familiar with the commands. So here you have 2 plots in the same diagram represented with different colors and the legend is also shown here. As I mentioned earlier we can have similar plots for time series data that are prominent in hydrology. Now as we have covered a 1D plot let us try to come to 2D plots. For a change let us try to generate synthetic data. So first let me try to use the np.mgrid and then I am going to use plt.subplots. I want 4 figures to be represented, figure size is kept small so that it is visible. I want the first figure to show sin theta into cos theta. So I have written np.sign x multiplied by np.cos y where x and y are the variables. I want the second figure to show np.sign tan x into np.cos tan x. So I am going to increase the frequency of the variable. First figure is sin theta into cos theta. Second figure is sin tan theta into cos tan theta. Let me increase it further here, remember the I have used instead of theta I have used x and y to differentiate between 2 different random variables. And let the fourth figure show sin x square plus y square, yes. So I have 4 spatial plots represented. The figure towards the left is sin x cos y, figure next is sin tan x cos tan y. I can change the figure size and finally what you see a circle so it is nothing but we have specified sin x square plus y square that is why you see the circular pattern. I have used plt.maxo. Now assume that your data contains some noise, some errors. So I am going to create a noisy data which is nothing but original data plus noise which I am adding in the form of a random number np.random.random y.shape. So I have forcefully added noise to the spatial plot that you see on the screen and then let us see how it looks like, okay. You see the difference in color, is not it? Just some simple examples for you to understand the commands better, okay. Now let us try to see how to plot a 3D scatter plot, 3 dimensional scatter plot which has 3 axis, the x axis, y axis and the z axis. So as before I am going to import numpy as np and from MPL toolkits mplot3D I am going to import axis3D, okay. So let me create a set of 3 random numbers. I am going to use the command np.random.ran.dn, 50 random numbers. So now I have the 3 sets of variables ready and now let us try to plot it as a 3 dimensional scatter diagram. I can keep projection as 3D and then type the command ax.scatter and even I can assign color to it. So here I am going to assign the color blue and I can give the marker as well. Let me quickly give rest of the commands of setting the labels and titles, okay. So there you have the 3D scatter plot, okay, alright. As I mentioned earlier even though the commonly used distribution in hydrology is the normal distribution or the Gaussian distribution, we do have other distributions which are used in hydrology like the chi distribution, exponential distribution, uniform distribution and so on. So let us try to create probability distribution functions as well as empirical CDF. So firstly to plot the empirical CDF, I am going to import pi plot and from numpy.random import normal and I am going to use stats models dot distributions dot empirical underscore distribution from there I want to import ecdf, okay. And say I want to create a sample data which is normally distributed to start with, okay. Take it as an example of course you can have a set of data which has any other underlying distribution. So to specify normal distribution I need a location parameter and a shape parameter. So here I am going to keep the location as 50, okay. For the shape, for the scale I am going to use it as 5 and the size as 700. So I have created two data sets which are normally distributed and I can use h stack, okay. And I can directly fit a CDF to this data using ecdf and to obtain the cumulative probabilities I can even type commands like you know say I want to print the values less than 20, values of x variable less than 20 to print the values I am going to use this specific command, okay. Please take care of the syntax because you know while typing if you make any small errors knowingly or unknowingly when you run the code it is going to prompt you to correct it, okay. So these lines are printed because I want to know the specific values wherein x is less than 20, x is less than 30 and x is less than 70. So here I am using ecdf. Whenever you are trying to use a function in Python you need to install the associated libraries. So as we have covered this part as part of tutorial 1 I am not going to repeat it now, okay. 70, 30, 20, okay. And finally plotting for which I can directly use pyplot.plot. So I want to plot the empirical CDF of both the random variables x and y. Let me correct the spelling of size, okay, there you have it. So now what I have done is I have plotted the empirical CDF. I have created the CDF using synthetically generated random numbers and I have plotted it here. I can also create normally distributed random variables. So here I am going to use scipy.stats from there. I can also use matplotlib, okay. You need to import these only once at the beginning of a program. So I am just trying to repeat for the sake of clarity. So what I will do is I will try to create normally distributed random variables. I am going to name it as norm rv1 random variable 1. I am going to use st.norm which is having location as 0 and scale as 3. The second normally distributed random variable I am going to create with the same location, location 0 but then which is having a scale of 5, okay. I want to see the spread of PDFs of 3 synthetically created data sets. So for the third random variable also I am going to keep the location as 0 but then I am going to keep the scale as 7, okay. And I want the values to be displayed between minus 20 and plus 20. For that I am going to use np.lin space and to plot the probability distribution functions I am going to directly use the variable name that is norm rv1.pdf within brackets x. Let us plot and see how the results look like. So here what are we trying to do? We have created 3 random variables which are normally distributed which is having location 0 but which is having slightly different scales, shape parameter as 3, 5 and 7 and we want to display the values between minus 20 to plus 20, okay. So I am going to assign the labels accordingly and use the plt.plot function. Let me assign the labels quickly. We have options to specify the font size as well, okay. So there you have it. I have the PDFs. Let me try to plot the legend as well, okay which is having the plots are having location 0 and then scale as 3, 5 and 7 and the values are plotted between minus 20 and plus 20, okay. You know we can create, recreate this plot for different distributions because as I mentioned earlier in hydrology we also have distributions that are frequently used such as chi or exponential or the uniform distribution, okay. So what I will do is I will copy paste the commands and then change wherever it is required. For example instead of creating a normally distributed random variable I want norm rv1 to represent chi square distribution. Now chi square requires as inputs the degree of freedom which I am going to add here as 2 and of course the location and scale parameters. The second let it represent exponential distribution. So I am going to instead of st.norm if you see I have changed it to st.expawn and let the third random variable that is norm rv3m I want that to represent a uniform distribution that is why instead of st.norm I have kept it st.uniform. So there you have it PDFs of synthetically created random variables that represent 3 different distributions, uniform, exponential and chi, okay. So till now we have tried to visualize the data using plots like the box plots and we have tried to do the plotting in 1D as well as we have tried to create spatial plots using synthetically generated data sets and we have also tried to understand about how to create the probability distribution functions as well as cumulative distribution functions. So in the next part of the lecture we will try to understand how to work with univariate and multivariate statistics. Thank you.