 For the moment I am going to continue straight with the discussion of pictorial statistics and the idea now is to talk about if you have got random variables and you have made measurements how am I going to convey them to you and I realize that from previous sessions in this workshop that you have already exposed to different ways of showing data. I am not going to spend too much time on the basics of visualizing data with things like bar charts and so on, instead I am going to try and show you unusual ways we wish to depict data and why in this day and age where we can use computers we should use some of these unusual ways. I want you to remember something that I think Professor Karmakar discussed in one of his talks which is that you must always present your results with a certain amount of numerical significance and that significance has to be scientifically relevant. Now the reason I deliberately repeat Professor Karmakar is it is one of the most annoying things in any scientific presentation to see people do shoddy calculations with results presented to 10 decimal places and you ask them how they have managed to come up with such accuracy and they say that it is basically the calculator of the computer which has done these calculations and given them 10 digits. Any time you present a number, the number of digits in your quantity that you are conveying every single digit must be significant to you, every single digit must mean something. If you do not have precision do not show that digit. Now this matters because when you are showing numbers you are not simply conveying how you are performing calculations but you are talking about what for example an average value is. So if I tell you that acceleration due to gravity is 9.8 instead of 9.80 I really mean 9.8 and what I mean by that is I do not know what is happening at the next decimal place. I do not have the experimental apparatus to tell you with more precision what might happen at the next decimal place. So you have got to limit yourself to only what is significant when you talk numbers and the moment you start doing this kind of rounding off or conveying only the right number of significant digits, that is important that you then worry about what happens with other calculations. So you have got to worry about every single least count for every single measurement that you make in your experiment. So if there is one step in there which is kind of a weak link and therefore you do not have precise measurements and the factor of the matter is that particular variable that particular measurement in turn becomes in a sense a controlling factor for you because you cannot gain more precision as you do more things. You cannot fundamentally gain more precision as you go along because you are forced to now round off your calculations to a certain number of significant places and simply adding additional digits at the end of a calculation because your calculator does that does not allow you to actually gain scientific significance in your measurements. It is a very annoying thing when you see students do it here and I wanted to be aware that you have got to when you present your numbers be cautious in showing only what you trust and believe are reproducible values. Now that also matters because there is propagate that is why if you come up with for example an intermediate calculation in your experiment which is accurate only to two decimal places or two significant places and you cannot suddenly come up with a final answer after additional steps which are accurate to many more decimal or significant places. So errors propagate so an error in one evaluation of one entity will affect everything else after it. Watch out for subtraction. Here is a very simple calculation I am showing you to convince this. So when you look at a subtraction of two numbers here 4.362 and 4.328 each of these numbers has in it 4 digits which are significant. So one easy way to figure out the number of significant digits is you convert it into a scientific notation so 0.4362 into 10 power 1 and 0.4362 means exactly that it means you do not know what is following after 2. And therefore there are 4 significant places and similarly 4.328 has 4 significant places and look at what happens when you subtract two numbers with 4 significant places. You end up with an answer which has only two significant places left. Now why does this matter? If you take this answer and you then multiply it by a 6 digit a number with 6 significant places it does not matter that you have got 6 significant places in the next value the product will have only two significant places that you can keep. Because the error in this calculation now dominates anything else you might do. So you have got to pay attention to these things in asking how errors propagate and therefore to what extent are your results numerically significant. So we are not talking about statistical significance yet, we are not talking about scientific significance yet, whatever it is you are calculating right now we have the right to claim that your numbers are numerically significant. So this is what I wanted to reiterate from Professor Karmakar's discussion. Now this also matters in the sense when you are trying to convey some of this graphically you have got to pay attention to the precision of what are the effect you are trying to convey. So again very quickly I am going to flip through some of these because you have seen these things throughout your schooling and also through basic statistics, textbooks so there is a line graph, there is a bar graph, there is a frequency polygon these are very standard ways in which you can take matrices of data and plot them. And the frequency histogram is important because we start working on probability table distributions and looking at how frequently each observation will occur. So here we are looking at heights of trees and how frequently these are seen. I am not sure if this is clear to all of you but basically this is the cumulative frequency plot with asking what is, so you keep adding up these frequencies instead of the previous plot where I had it as a kind of a bar chart. I am adding all these heights up and that gives me a y value between 0 and 100 in terms of percentage measurements observed. A good statistical plot is something called the dot plot and this is a quick graphical way to convey a lot of information. So in this plot focus first on the green box. So what is being plotted in fact focus on the top part of this plot, this entire section here. So what is being plotted at this extreme, at the left extreme is the lowest measurement you see in your data set and what you are seeing at the right is the highest measurement. So you are giving a person and I feel for what is the smallest and what is the largest value in your collection measurements. Then this line in the middle, this vertical line in the middle is telling you about where the median value might be, median being the middle most in your collection of values. Then this dot in the middle of the green box is telling you where the arithmetic mean might be and the whole green box itself is telling you where the middle 50% of your measurement. So think of it, you are breaking your measurements up into the smallest 25%, the next 25% and so on, so you have got 4 such quartiles of 25% each and what this green box tells you is, where is the middle 50% of your measurements. So in one shot you are able to convey a feel for without actually drawing a curve, without drawing a probability distribution, you are able to convey how much variation is going on and how they are spread about the middle value, the central value. So the dot plot in particular or a box and viscous plot as it is often called is a very useful way to do this. There is one good example of this stemming back to physics which is various experiments are done. So each of these vertical boxes refers to an experiment attempting to measure the speed of light. This was done 100 years back by a set of researchers. So again remember what the box is telling you, the box is telling you where is the middle 50% of a set of measurements and then the outer bars are telling you how much spread you have around them in terms of minimum to maximum and so what you quickly see in this kind of graph is that 5 different attempts to measure the speed of light have ultimately error bars. That is what look at the outer thin lines, 5 different experiments end up being clustered around the true speed which is horizontal line shown across and so they are all consistent within reason with what is claimed as a true speed of light. So none of them is actually contradicting the speed of light as claimed according to hypothesis. So 5 different experiments can be quickly seen to be in agreement with the true measurement or the true hypothetical speed of light. So here is a quick way to summarize a lot of work 5 sets of experiments with lots of replicates each down into one plot. So the box plot is a very useful way to do this. There is something called the waterfall chart which basically looks at for instance if you are looking at the units of something you have in stock, you have got 217 units in stock but then you find out that 55 of these are damaged and instead of plotting this 55 as a bar at the bottom you now subtract from the top itself. So you come down and then you find that you have got 37 additional units you can add back to your stock. So you go up so this height keeps going up and down and gives you a feel like a waterfall what your overall number of units that you have which you can sell is called a waterfall chart. So you can see there are unusual ways to represent data and you can also see what people are trying to do which is to use computer graphics to try and quickly convey to an audience what is going on. So there is a pie chart of course and I had shown you some time back Florence Nightingale's chart, the rose where she was trying to convey causes of death. There are a couple of lengths which you need to go through on your own describing the reasoning behind coming up with different kinds of plots. So I strongly recommend that you go through these and get a sense of the history which has driven the creation of some of these graphical pictorial data depiction forms. Of course now with access to software like Sylab you can generate very elaborate graphs of functions. There is something called a heat map which allows you to look at a collection in this case of patients. So up here at the top you have a bunch of patients on the y-axis is a collection of genes. And what a set of biologists are trying to showcase to you very quickly is to say that there is a bunch of patients who are grouped together as a cluster on the top here top left who seem to have some genes which are defective and the defective genes they are showcasing as green. So the point being that there is a huge amount of information of a large number of patients each where with a large number of genes being measured and yet they want to quickly try and convey that in patients where cancer for example develops there is a set of genes which are going defective and they are doing this in the context of colorizing each gene as a function of how unusual that measurement for that gene is in people who have cancer as opposed to people who do not. So the point in all of this is the eye the human eye very quickly spots variations in this and you quickly realize that there is something unusual about the top left going on and then you go look and focus at those people or at those patients at those genes and you focus further. So your attention gets drawn to the important parts of the plot here is a periodic table in the form you likely have not seen the guy in the center is Mendeleev who was the guy you proposed what you normally see as a rectangular depiction of the periodic table but the point is you can depict those different ways and it conveys different things here is something called a Chernoff phase now this is of use particularly in process control and pattern recognition and the idea here is when an operator for example in a nuclear plant or in a refinery has to keep track of a very complicated process is more likely than not paying attention to hundreds of variables with hundreds of measurements with lots of data streaming and he has to come up with some estimate very quickly whether the process is okay or not okay and so here is a trick for each variable which is important in this process that variable gets mapped on to some feature of the human phase so variable X might be the size of an eyeball variable Y might be the length of the nose variable Z which talks about how good the processes or how healthy the processes may be mapped on to the smile okay. So by looking at phases very quickly you are getting a sense of what is the magnitude of each variable and whether that variable is in a good range or a bad range and what you basically want to see is a smiling face as an operator and if you are not smiling you quickly want to if you realize it is not smiling you quickly get a sense that you need to do something about the process before it get things get worse on you. So for operators particularly online who are confronted with lots of measurements there is no way you can make sense of hundreds of charts of data at one shot so instead to map it on to these things and the reasoning behind this is very simple even a baby knows when a face is smiling without training so an operator is very quickly going to be able to spot by looking for a smiling face or at a sad face. Operators very quickly going to be able to recognize is the process is going about online well or whether there is a problem developing so we are just tapping into these kind of insights some of the better pictorial representations of data okay just to probe for the discussion there is here is an unusual way to represent the various countries as to how they fit into Africa of course the actual analysis here superimposes more data on to this to talk about the population sizes and the wealth associated with the countries over here you can see the US as a brown segment on the top left there is China towards the bottom and India as to the right all of these fit into Africa here is the globe but here is the globe as drawn representing population and you immediately see India and China of huge area to the globe and normally you see Russia as a huge country but here Russia has vanished because it has a very tiny population. Now I also in addition to those two links I gave you before I want you to visit this particular site and look around at the presentations in the software available we can also look up this gentleman called Hans Rosling and view his talks on Ted.com this gentleman has set up a very interesting way to visualize a lot of data relating to public health and ideas that the World Health Organization for that matter different garments have over the years collected lots and lots of data which talks about various indices of poverty health education and so on and the ideas to quickly find out relationships between these variables it is very hard to do it given hundreds of variables and lots of measurements data points. So he has got a tool I will just give you a screenshot of it but you can play around with this this is an animated tool so you can play around with the slider of the bottom you can see a slider of the bottom by playing around with this you can actually scroll forward in time and see how things have changed historically. So in this plot for instance each bubble represents a country and the size of the bubble represents a population and by and what is being plotted is on the y axis it is the number of internet users for 100 people and on the x axis it is the income per person and you are able to watch this as it evolves with time and the whole point is what would take your page or so of text to write down and explain probably can be explained in 5 seconds of animation. It turns out Google also now has an equivalent software called public data explorer so you can visit this site and again you can play around with this the entire data set is also for all kinds of public health indices is also available at this site. So you get an idea of how to convey lots of information fast accurately and be to the point. Now all of this comes at a cough so I put this as a word of caution the whole idea of putting elaborate pictures on a slide is to reduce confusion so as we add more and more information on to a slide you will reduce confusion but as you add back more and more information essentially the confusion that we have increases and so the question therefore is at what point do you really want to operate in terms of pictures and plots and how much data do you wish to cram and fundamentally you do not want as somebody observing a presentation to see very large tables of data you do not want to see plots crammed with lots of variables lots of bars. So the idea is to work on operating at a point where the confusion is least present information on a slide where your confusion is the minimum. So at this point I will stop my discussion of pictorial representation of data. R. Basilius College of Engineering Technology we have a question with regard to the point that you have raised that is the different statistics can be created as estimates of a parameter. How do we test these statistics are from the same population is there any stipulated test we have to adopt to test that these statistics are from the same population. So I will repeat that question because it is likely not clear to most centres the question is about whether different estimates of a measurement of a parameter do they all behave similarly and do they follow a similar distribution and at one point talked about how it is possible to come up with different measurement different estimates of an average height you can work with a mean, a median, a mode, a min plus max by 2 and those kinds of things. So they are all different estimators of some parameter. So the answer to your question is they each will behave slightly differently and the reason we work with arithmetic mean preferably relative to other estimators is that the arithmetic mean converges to what we really want to know the fastest. So since we are spending a lot of effort in terms of developing experiments to find out mu the point is that x bar the arithmetic mean converges to mu with the least number of replicates done. So with the smallest value of n and because invariably it is expensive to do experiments we want to keep n as small as possible which is why we really want to work with an arithmetic mean and then learn about mu rather than use any of the other estimates but fundamentally the point is if you have the ability to repeat your measurements any estimator will ultimately get you there. So if you have the ability to keep measuring and to then work out your estimate you will get there but if the objective is to get there with the least number of measurements then you must preferably use the arithmetic mean unless there is reason to worry about the quality of some of your measurements. So I will give a quick example of that I had talked about it when I talked about heights if you have got some measurements which you think might be faulty for some reason with the other. For example you have got some very low measurements another extreme some very high measurements and you think they are not to be trusted. Now if you try to compute an arithmetic mean you are going to end up involving all your measurements and then working out an average and that includes the measurements you do not trust. On the other hand if you work with the median the median does not involve these measurements of the extremes because you are trying to find the middle most value and hopefully you have got enough trustworthy middle most values. So the reasons to look for a particular estimator actually come back from the way you are developing your experiment and the measurements you are trying to make and how much you believe in your measurements in terms of the instrumentation of the procedure followed for collecting your data. But if there is no reason to worry about those then the recommendation is stick to the arithmetic mean and work with that as an estimate of the model parameter that you are trying to get. So hopefully that answers your question. Thank you sir. KMEA college go ahead if you have a question. My problem is scheduling oriented. So for that purpose we have to select the standardized data from any supercomputer centers likewise. So we are getting a lot of data how to select a specific data from that. Three months we can make out using the given data. Okay so it goes back to what I suggested as a guideline at the beginning of the first talk in the morning which is statistics. You are asking me which kinds of statistical procedure I should use to cope with the data that you seem to be collecting and it is not clear to me what the context is in which you are getting data. But the factor of the matter is the statistical approach really has nothing to do with the scientific significance of what you are doing. So it seems to me that you are asking what procedures can be followed statistically without having a scientific question in the first place. So if you remember I had the warning against that. So first come up with a scientific question which needs an answer which for which you should spend some effort doing experiments and collecting an answer. So it must be a question which needs you to look at statistics in the first place. So the question comes first, the hypothesis comes first and if there is a hypothesis then almost automatically there will be a statistical procedure which you can then employ to take on that hypothesis. So in effect what you are doing is actually going the other way around which is saying you have access to data. I think that is how I interpret what you have said and you are asking if you have lots of data what can you do with it. There are many things which can be done with data sets in general but the question is really is that effort targeted towards one end goal and that end goal has to be driven by a scientific question in the first place and that is the best I can do without knowing the context in which you are collecting the data. Okay thank you.