 Welcome to tutorial 4, image statistics using Python, whenever we do not have any clarity on say the underlying physical processes or say we do not have any information regarding the variables, statistics play a very crucial role. Hence in this tutorial, a few means of data analysis and visualization are presented. So let us move forward, see data is nothing but any number of related observations. Now this can be repeated survey observations taken using a GPS or it can be digital numbers of an image captured using a particular frequency. And before we analyze the data we call it as a raw data, why raw mainly because it is not processed by any kind of statistical method. Now many a time more data does not necessarily mean more information. So in this tutorial we shall learn about the different ways in which we can summarize and present data and for this purpose I have created small subsections for this tutorial. For the first section we shall try the graphical representation of data as in how can we create one dimensional 2D and 3D plots using synthetic data as well as data of variables like temperature and precipitation. Now section 2 shall focus on univariate and multivariate statistics and section 3 shall cover linear regression followed by section 4 where we will try to understand about hypothesis testing. And in section 5 we shall be introducing the concept of autocorrelation and lag. So let us begin with part 1 plotting using Python. Few exercises in this tutorial have been conducted using sample data comprising of precipitation and temperature. Now before we start the tutorial let me give you a brief overview of the sample data used. So the precipitation data is taken from iMark the units are in millimeter per day. Now iMark stands for integrated multi-satellite retrievals for global precipitation measurement machine GPM. This is a multi-satellite precipitation product. We shall be covering more details about this product in our lectures. So for this particular tutorial I am not going into the details of how the product has been created and so on. Understand that this is a gridded precipitation product and we have downloaded a sample data what you see here for one particular grid for these many days. Now the second variable what we have considered is the temperature in Kelvin that is dew point temperature in Kelvin from era interim. So era interim is a global atmospheric re-analysis product and we have used the dew point temperature here as I mentioned earlier as another variable. Now by dew point I mean the temperature to which air needs to be cooled to at a constant pressure so that we achieve 100% relative humidity. And again a few exercises as I mentioned earlier have focused on these sample data sets and few others have used synthetically generated data sets. Now moving further as I was mentioning in the introductory portion that whenever we arrange data in a usable format it equips the decision makers to make intelligent decisions. And nowadays even if we collect enormous amount of data we are able to compress them instantly into say tables or some graphs or numbers. So as part of this tutorial we will understand a few small ways in which we can summarize information from raw data. So we will start with histograms. Histograms comprises of a series of rectangles a sample histogram is shown here. So the width of each rectangle is proportional to the range of values within a class. So on the x-axis horizontal axis we have the values of the variable and on the y-axis we have the frequency in each class. So vertically the height of each rectangle is proportional to the number of items falling in that particular class. So graphs using histograms they clarify patterns which are not easily visible from tables and hence they are considered useful. In this context we shall also see how to create a relative frequency histogram which uses the relative frequency of data in each class. So to summarize say I have the digital numbers of an image captured in a particular region of the electromagnetic spectrum. Those digital numbers using those digital numbers we can create a histogram wherein the x-axis will be with certain classes, certain range of digital numbers and the height represents the frequency. How many times it is occurring in that particular class interval. Moving on we have box plots. What you see in the screen in front of you is a typical box plot of course I have omitted the axis x and y axis. So to perform exploratory data analysis that is EDA we have several plots by which we get to summarize the data very quickly. Now there are many more plots but here as I mentioned earlier we shall be focusing on just a few plots. So the box plot what you see here it gives a graphical representation of the median that is the middle horizontal line, the quartiles that is the top and bottom horizontal lines and the extremes that is represented by the whiskers that extend from the plot. So let us see how to achieve this in Python. So as before let us create a notebook in the documents folder. I already have NPTEL tutorial folder. I am going to create an empty notebook and name it as tutorial 4 and the name of the course as always. Firstly as we shall be dealing with plotting and analyzing data let me import numpy as NP. Let me give a subtitle so that we are clear part 1 plotting and analysis. I am going to import numpy as NP and we also need matplotlib. So let me import matplotlib.pyplot as PLT and let me import pandas as PD. Now pandas is a library of Python that is used for data manipulation and analysis. So the sample data that was shown in one of the earlier slides, it has been kept as an excel file, excel file. So let us try to read the data from the .xlsx file using the command PD.read underscore excel. I have named the file as data that is why within quotes data.xlsx. Now let us see whether it has been read. So this is the data. I have time, precipitation can be seen. The name is a bit lengthy, I have kept it that way so that you understand the source of the data and then we have temperature. Now let us check the data set. Every time I do not want to keep printing the variable name because as you saw the name of precipitation variable is kept lengthy. So what I will do is I will read the variable of temperature, assign it to Y. So the name is temperature within brackets K where K is Kelvin. So data within square brackets temperature K and reading it into Y which means Y now is going to represent the temperature values. Let us try to plot the histogram using this temperature values. Let us try to see visualize how a histogram looks like for these values. For that we are going to use plt.hist function. Plt commands plt.hist Y and we can also specify bins that is the number of bins you want and the width of each bin. You can even specify the color in which you want the histogram to be represented. So here let me keep it as magenta, plt.gca helps us give titles to the plot. So I am going to name it as frequency histogram or histogram either way. It also helps us to give the labels for the horizontal axis and the vertical axis. So here I am going to keep the X label as temperature and the Y label as number of data points. Then I am going to use the plt.show let us see how the histogram looks like. I have the title displayed, X label, Y label displayed and you can see I have specified the number of bins as 5 and I have specified the width as 0.3. You can play around and you know use different colors and I can even change the width and even the number of bins I will revert back and also you know if you want as you have two different variables just as we use the subplot function to visualize both the figures side by side. I can even visualize both histograms of two variables side by side. The commands remain same, I can specify the axis Ax1, Ax2 and use the plt.subplots as mentioned earlier 1, 2, 1 row, 2 columns and figure size by now I hope that you are familiar with these lines on how to create subplots and how to visualize. So I am not going into details, let me just type the title of the plots I am going to keep it as subplots of histogram. I can specify the axis here for the sake of clarity I have kept it as data within square brackets temperature K you can even use Y because you have already assigned the data of temperature to Y and as before I get to specify the width and bins and color similarly I can set the titles as well histogram of temperature also I can specify the labels as before X label and Y label as scientists and engineers they are typically accustomed to dealing with data sets. So let us now type the name of the second variable as I mentioned earlier if you fear that you will make a mistake in typing the lengthy variable names you can even assign it to a smaller name or read it to another name just like we did Y equal to data temperature. So now I have set the axis and I have named given names for the horizontal axis as well as vertical axis of both the histogram and I want them to be plotted side by side. As before let me use PLT dot show and I have used two different colors to visually represent the histogram of temperature and precipitation you see the figure is very long it is not fitting into a window is not it? So let me reduce the figure size much better. So now I have the histogram of temperature and precipitation displayed side by side. Now as mentioned earlier let us try to visualize a data set using box plot. So here I am going to use the command PLT dot box plot and within brackets I am going to specify the variable either Y or data within square brackets temperature K and as before I get to specify the titles of the plot and I can name the X axis and Y axis. I have to correct the spelling for the Y label. So now you see the box plot which tells you the extremes, the quartiles and the median value. So we have tried to cover two different ways of visualizing data. I can even replace it with Y because every time I do not want to keep on typing the lengthy variable name it will give us the same plot. So as I mentioned earlier we will be covering a few examples of how to summarize data because you know every time when we summarize the information content within raw data it gives us an idea about the system from where data has been taken. So in this regard cumulative frequency distribution is useful because in the previous plot that is histogram we are just counting the number of items within each class or interval and in a cumulative frequency distribution we are trying to understand how many observations lie above or below a certain values which means the information you get from a cumulative frequency distribution is much more than that of a histogram, is not it? And also relative frequency we will be covering it here. So by relative frequency it just means that we are dividing each class frequency by the total number of observations Y to obtain proportion of set of observations in each class. So instead of frequency we are trying to look at information through relative frequency. Let us quickly try to continue and see how to proceed in Python. Now in this context let me tell you that we have something known as a sci-pi library SCI-PY sci-pi library which consists of packages for performing statistical functions. We can also perform analysis on both continuous as well as discrete random variables and also we get to work with different distributions. So first let us try to plot the relative frequency of temperature. I am going to use np.histogram I have specified the number of bins here as 10 alright. Let us try to print the frequencies I just want to see the values of frequencies. Now as I mentioned earlier we can use relative frequency as well to summarize the information content which means every time I am going to divide the frequency by the sum of frequencies I am trying to take the proportions here. So let me plot that as well using plt.plot as mentioned before I can even give the color red and label it and I can set the titles as relative frequency of temperature okay. Single quotes let me specify the X label and Y label as before and I want to display the legend as well. So I will use plt.legend and finally plt.show. So I see the frequencies here, I see the bin values here and I see the relative frequency plot here okay to summarize the information content okay. I hope you understand the difference between what you saw as a histogram, what you saw as a box plot and what you are seeing here. As I mentioned before you can also estimate relative frequency using the sci-pi library which consists of packages for statistical functions. So I can from sci-pi import stats and I can use stats.relfreq, Y is nothing but the variable of temperature and then I get the result okay. Let us try how to plot the cumulative distribution functions that is CDF of the variable temperature. I can use np.cum, cumulative sum and then plot it using plt.plot and as before I can add labels and set the title, set the Y label use plt.legend and plt.show. So there you have the CDF of temperature plotted here okay. So till now we were trying to use certain sample data sets to visualize the information content in data using plots of histogram, CDF and box plot. We shall be continuing this section with a few other diagrams that help you analyze multiple variables. Thank you.