 So, good morning to all of you. It is good to be participating in a workshop where we can discuss various aspects of research methodology. I have been given a task of describing various aspects of statistics in research and I will try to do that from my perspective of both biology and chemical engineering. But you will quickly realize that many of the concepts that I talk about also apply to various other domains of science and humanities. I am going to start by giving you more of a historical introduction to the topic. And then at a later stage I will actually define the theme of the talks that will follow over the next few days. So, we of course have two sessions today and in today's session basically in today's two sessions I hope to cover the elements of statistics. Now, so at this point my focus will be on giving you an idea of what is important to understand and learn about. And during the course of the rest of the year and going into a subsequent workshop you will have time to read on elements of statistics on your own and then make a connection with the points which I will provide to you as part of my discussions over the next few days. So, let us straight away jump into this with a discussion of some of the historical aspects and I find that this is an interesting way to go about talking about research methodology because a lot of what we do now in terms of procedure arises because of certain historical events. There have been certain needs for development in mathematics at certain points in time and certain people have developed certain techniques. And when you now try to make sense of these approaches you really need to go back in terms of history and understand why things are the way they currently are in terms of what got done way back. And there are plenty of good examples as to how things were done in terms of historical development of this field. So, it turns out that the word statistics itself relates to governance and in fact the root of the word statistics is the word state. So, there is a connection between states and statistics and it goes back to the fact that 500 years back the governments needed to figure out what was going on in the population with the general population and especially two things were on every government's mind. One was where they are going to collect the taxes from and second where you are going to find people to build armies and go out and beat up other people. So, finding money for taxes came comes down to if you think about it figuring out who exists in your population and who earns what and therefore which corners of the world should you go towards if you want to collect money. So, a lot of the early statistics development happens in Europe where you have small countries in fact city towns trying to develop details of the amounts of people the numbers of people living in their vicinity and therefore the amounts of money that they could potentially gather from these people to use to finance among other things war. Sometime later it turns out that churches especially in Europe start maintaining records of births marriages and deaths. So, this is the first kind of systematic attempt at record keeping and of course once you start collecting lots of data you can do things with data you can analyze it what we now call as data mining. So, it dates back to the 1500s and then it turns out that disease strikes Europe in particular the plague and it becomes important to safeguard the king in particular the king of England from the plague and so how do you figure out how to safeguard the king? You have got to find out this of course is England and so the king is in London. So, you have got to find out in the different suburbs of London how many deaths are occurring per month as a consequence of plague and if there are too many deaths happening in one locality then the king definitely does not want to visit that locality for a while. So, now you have got record keeping but record keeping with a specific intention and that to record keeping being updated very frequently at the monthly kind of level. So, the causes and places of death start being stored start to be stored and the movements of the king now reflect local events and in fact here is one of the first available notices dating back to 1662 by a gentleman called Grant who is basically collecting data as to how many people are dying in a given period. Now, he does this and comes up with what we now call life expectancy and basically he points out that when you are born how long do you expect to live? So, how long may you be expected to live not just when you are born but also given your current age and he keeps tracks of deaths occurring for each age group and if you look at a table like this which is what he painstakingly collected if you look at a table like this what this immediately sets you up with is the ability to calculate the odds of any person living up to a certain age. So, you now start seeing the elements of probability to entering into this data analysis. So, data is being collected it is safeguarding the king in mind but this is data also allows a person to sit down and figure out at what to what age may be expected to live. So, Edmund Haley of Haley's Comet fame starts getting involved in these kinds of calculations and if you think of this life expectancy table this is immediately of use for calculating life insurance premiums because the amount of money you pay depends on how likely you are to die in the near future. So, you see the elements of data being collected, data being mined and a model being built and that model being used to now conduct business. So, in all of this it is about data analysis, data keeping and separately there is a bunch of people, separately there is a bunch of people who are developing theory associated with probability. So, when you now nowadays study probability you actually study probability and statistics as one topic, but the factors 200 years back statistics was looked at separately, probability was studied by mathematicians and the connection between the two had not been made. And so, even though there were famous mathematicians like Bernoulli, Gauss, Laplace these are people who are now famous for example, distributions like the Gaussian distribution, the role of a Gaussian distribution in connecting with statistics had not been identified. So, in the 1800s there is another major development which is that Francis Galton starts trying to figure out what is behind the inheritance of intelligence. So, here this is somewhat dangerous idea nowadays, but the idea was that intelligent people would have intelligent children and so, the idea was how to breed more intelligent people. And so, you try to work out the genetics of course, genetics was not fully known then, but you work that they were figuring out the rules of inheritance of various features in particular in intelligence, but in trying to figure out what correlates with the intelligence. Galton starts making a systematic study of a variety of measurements for example, the life expectancy of a person, the heights of a person and so on. So, one particular problem which bothers Galton is, when you take tall people and you then look at the sons of tall people, how come on average the sons of tall people are not as tall as the parents are not as tall as their fathers. So, in other words why cannot tall people give rise to even taller children. So, he makes his observation that though parents may be tall on average and so, he is trying to be general here on average the children the sons in particular of tall fathers tend back towards the average height. So, in other words the heights regress back towards the mean and here is in fact, the origin of the word regression. So, when we talk of regression actually what is being in implied here is that the heights regress back towards an average value. We of course, now use the word regression in a totally different sense, we talk about how a y variable regresses with an x variable and therefore generates for us for example, a straight line model. But in the original study about 200 years back, regression had to do with heights, a study of heights. So, inherently therefore both regression and correlation theory go back to this analysis. The next interesting use of statistics also happens in 1800s and it dates back to what is known as the Crimean War and interestingly enough it involves somebody who is not at all inclined towards mathematics. So, hopefully you are aware of Florence Nightingale who was a nurse who took care of wounded soldiers on the battlefield and she wanted to convey some message back to the politicians in London. These were British soldiers fighting a war in Eastern Europe, lots of Britishers were dying and she wants to convey a message back to the politicians that they need to do something about the death rates of soldiers. And so, she is the first person known to use a chart to represent deaths and therefore trying to influence policy, public policy and what she does is come up with something called, now called a rose chart, but it kind of looks like a pie chart. So, if you look at this chart, the pink sector, so each sector here, each wedge or triangle reflects a particular month of the year and she is trying to depict the total number of deaths occurring in each month of the year and the pink relates to, so the pink at the centre towards the centre reflects deaths by injury and the grey which is the outer part of all these wedges reflects death by disease and black implies death by any other cause. So, what is it that you immediately learn from this slide? You quickly learn by looking at this image that there are more people dying because of disease than because of warfare, because of gunshots. The grey area is much larger than the pink area and so she is using this as a tool to quickly convey that at specific months of the year there is a large number of deaths happening because of disease and why is disease happening? Because there is lack of hygiene with so many soldiers being forced to stay together in a very small area in a tent without proper plumbing and so really the request was to London to do something about the lack of hygiene and therefore to start sending engineers across to the war front to set up elaborate plumbing systems and as a separate note of curiosity one of the first people to come to India and help us out without plumbing was Florence Nightingale. So, she after doing this in the Crimean war then proceeds to come to India and helps out in the 1800s itself with trying to spread awareness about the need for hygiene and need for proper sanitary plumbing. So, this is one of the first recorded charts we of course do not like this rose petal type structure we have now more familiar with the use of pie chart where each of your sectors is of equal radius but you will appreciate that rather than send an elaborate table of data back to a politician in London the chart immediately conveys a message and at this point it is important therefore to realize that lots of data can be more better conveyed using pictures. So, towards the end of today's session I am going to talk about pictorial representations of data and how that can be used to quickly convey messages. So, continuing with our history the next big advance is in the early 1900s and it has to do with a gentleman called Carl Pearson and he develops many procedures to allow scientists to test for specific hypothesis. So, what we nowadays call a hypothesis testing procedure actually dates back to the early 1900s and Carl Pearson and of course there were other statisticians who developed these procedures at about the same time along with them. So, there is this very interesting story about a brewery Guinness which is famous for making beers of different types and around this time when Pearson is starting to work on hypothesis testing it turns out that this particular brewer in fact is a chemist. So, you do not expect to see mathematicians in a brewery. So, it turns out in those days if you are trying to make beer breweries would hire people with microbiology and chemistry backgrounds because after all there is a fermentation involved using a microbe and you would need to understand the basics of chemistry as you start manipulating both the recovery of your product and also the quality of it. So, it turns out that there are several problems affecting the brewery and if you understand a little bit of the beer making process you will actually appreciate a lot of the statistics that has been developed in the 20th century. So, let me kind of spend a little time on this one example and we will keep coming back to this example over the next few sessions as we talk about actual approaches towards hypothesis testing and the need for specific tools in statistics. So, what kinds of problems impact somebody trying to make beer? It turns out in those days and we are talking of early 1900s beer is made at a plant in England or Ireland, but the British soldiers are all over the world and they all want this beer they want this particular brand of beer. So, what must be done you must make the beer and you must ship it out to different corners of the world how do you send beer out to different corners of the world you do it using barrels. So, you fill up barrels full of beer put them on ships and then you send them to different parts of the world. So, one important problem immediately arises if the barrels are going to be on ships and the ships are going to take different amounts of time to go to different parts of the world how do you ensure quality control? How do you know that the taste of beer in a barrel and for that matter the amount of alcohol in that beer in a barrel how do you know that it remains the same? It remains the same after travelling on a ship for one month or travelling on a ship for five months that is one problem. The second problem that the brewery has is how do you fill up the barrel because somebody who gets a barrel which is half empty is obviously going to be annoyed because you have been sold less. On the other hand if you fill up a barrel with too much beer that is not profitable for the company either. So, now you have a kind of filling problem you want the same volume of beer packed or poured into each and every barrel that is going to be sold. So, we need some quality control on the taste and effect of beer and you need some quality control on the volumes being packed into barrels and by the way the taste of a beer also relates to the grains being fermented away to ethanol. The grains have starch, starch gets converted into ethanol and it turns out a lot of the flavour of the beer has to do with which types of grains and for that matter where the grains are grown and just like we have different varieties of rice from different parts of a country it turns out that you have different qualities of grain in this case the grain is barley. So, you have different qualities, different tastes therefore of barley and therefore if you now identify one good strain of barley the question comes about how do I mass produce barley and how do I ensure that I get the maximal yield of barley in a given plot and so roughly at around this time people also start noticing that if you add nitrogen to soil as a fertilizer you are going to improve the amount of barley or wheat grown. So, here is an interesting problem that the brewery comes up with how do I ensure quality control for the barley that is being produced or for the wheat that later on other agriculture is want to produce. How do you ensure quality control how do you maximize the yield and all of these involve problems and statistics? How do I know that adding fertilizer improves my yield? It is a simple thing that we take for granted now but 100 years back the question was how do you know that adding fertilizer systematically makes a difference? How do you know it causes an improvement in the yield and how do you know it was that increase in yield you might see is not because of some other factor. So, a lot of what this company did 100 years back and a lot of the agricultural biology research that was done 100 years back actually strongly influences a lot of the research methodology that we now use. So, it turns out that it is so many problems facing this brewery the lead brewer at this company a guy called Gossett the picture on the right of his screen he decides that we need some help from mathematicians and since there is no mathematics happening inside a brewery the only way you learn a little bit about mathematics is if you actually go to a university and enroll as university and you actually have a very interesting occurrence which is somebody trained in chemistry decides that they need help from mathematics and they actually take the effort and this is somebody who is already middle aged they actually take the effort of starting the effectively a new degree basic degree in mathematics why joining Carl Pearson's lab and getting trained in the basic methods of statistics. So, remember Carl Pearson is a guy who has developed all these testing procedures but the issue is this he is testing his procedures on fake data not on practical data which are of interest to industry or to scientists around him. So, here is a guy who needs this math in his process Gossett and he makes the jump and it is very courageous for somebody not trained in math to make this jump he makes this jump goes and spends time at this brewery quickly realizes that new math must be used for example to explain to what extent does the volume of beer in a barrel vary from barrel to barrel. So, if you are trying to work out what is the variation in volumes it turns out that the existing theory at that point in time 100 years back could not explain the variation. So, clearly there is need for new theory and who is going to develop a new theory nobody shows interest in this guy's problem of beer production. So, the only way out is for the chemist to go learn math and develop new theory. So, there is a moral in this which is that if you realize that you need a specific skill in terms of your research problem do not expect that you will find a ready made solution. Sometimes you may need to go out of your way learn a new skill and then implement it and come back and solve your original problem. So, after spending a few years at Pearson's lab and after coming up with this new theory and we will discuss this theory in a subsequent lecture Gossett comes back applies this theory to the brewery process specifically to this problem of filling up a barrel with beer and realizes that the new theory perfectly explains the empirical observations that he has the amounts of beer in a barrel. And of course, now we are all satisfied at the brewery a sad thing happens in terms of science which is despite developing new theory Gossett is not allowed by his company managers to publish his research. Why? Because if other breweries find out how to better fill barrels with beer then this competitive advantage that their company has just gained is lost. So, the theory is kept to themselves until a compromise is reached which is Gossett publishes his work but does not announce his name and does not announce that he is working at a brewery. He simply publishes the math theory and in his publications he puts down his name as a student of statistics. So, all his theory all of the good theory of dealing with small samples and trying to infer results given very few samples dates back to Gossett and we will see distributions like the students T distribution some of you have done statistics might remember a T test but unfortunately the scientist name was missing in all of this development. So, personally I find this story very inspiration particularly because of what I told you somebody goes out of the way to map a science problem onto a math problem realizes they do not know math goes and learns math comes back and applies the math back to the original science problem proves that it works and does it without necessarily trying to gain glory in terms of personal publications. Now, while Pearson and Gossett are busy trying to figure out how to deal with beer a chap called Ronald Fisher is investigating other problems in population biology and agriculture. So, Fisher has a role to play in genetics but really what I am going to tell you about is the role he plays in agriculture and in agriculture there is as I just told you a little while back there is this problem of should one add fertilizer to a plot of land and fertilizers not necessarily dirt cheap. So, is it worth your while to add fertilizer to a plot of land to get a higher yield in this case of wheat. So, Fisher is a person who spent his life developing many advances in statistics and to list some of these by the formal names analysis of variance and that has to do with how do you ensure that one batch of a product is equivalent to another batch of a product this is very important for example, if you are talking of batches of vehicles being manufactured or batches of drugs being manufactured how do you make sure for example, that the amount of insulin packed into a while made today is going to be equivalent to the amount of insulin packed in a while made tomorrow and that is very important of course for the end user because they should not see variation in the insulin per while. So, the variance of variation in the amount of insulin from batch to batch needs to be analyzed and that is analysis of variance then there is design of experiments which is if you think you have got yourself some process and you want to optimize this process and you think you do not understand the physics behind what is going on. So, how do you very quickly do a set of experiments to try and optimize your process and come up with a better operating condition. So, that is design of experiments then is also Fisher is also the person who developed theoretical concepts like maximum likelihood, pattern recognition tools like discriminant analysis and classification theory but as far as we are concerned is important to us because of the randomized experiment concept that we developed about 80 years back around this aspect of whether fertilizer should be added to plots of land. So, I will come to this in more detail in a later talk but in a nutshell the problems that Fisher had to deal with is if you are going to try and prove in your research that adding fertilizer makes a difference to your productivity you have got to actually rule out the impact of any other variable in your analysis. For example, if you are going to try to say that adding fertilizer to a plot of land is the reason why you are getting more of yield and how have you made sure that for example the water content of soil or the basic fertility of the soil itself how are you making sure that those are not factors which influence the productivity. So, this leads to a randomized experiment and in one of the subsequent sessions as I said I will discuss this aspect of randomization in great detail. Coming to India turns out that India also has a very interesting history in statistics. In fact, we now have the Indian Statistical Institute in Calcutta and it is founder Mahalan Abhis turns out to be a very prominent person when it comes to theoretical statistics. Many of his discoveries are used particularly in pattern recognition. It turns out that he is Ronald Fisher student, Fisher visits this institute in 1937 and there is also C.R. Rao who is a very prominent statistician in fact who receives the highest award that can be given a civilian in the United States for all the contributions he is generated in the field of statistics. So, that is basically my overview of history of statistics and so let us get down to given that I have already brought up a set of issues let us get down to what I expect to cover in my talks. So, let us start with before I list the exact topics let us start with a typical modeling procedure that you may expect yourself to be following. So, any scientist is going to be concerned with some model some physical model there is some process you are interested in and you are going to try to come up with some description of it and then you are going to come up with some mathematical formulation of that physical model and that model that you come up with will have parameters in it and then you will start making claims about that particular those particular parameters. For example, a physicist might claim that force is proportional to mass and in particular if you are trying to study gravity he will go ahead and claim that the theory is that force is proportional to mass and there is a proportionality constant there which is that constant g which we know of acceleration due to gravity. So, in step 2 the physicist might then go ahead and claim that acceleration due to gravity should take on a particular value for example, 9.8 meters per second square and then you get down to doing experiments and you start asking the question are these estimates of this value of g that I calculate are they equivalent to what is being proposed in other words supposing I do some experiments and I find out that my acceleration due to gravity is 9.7 I am left with this question of is 9.7 equivalent to 9.8 in which case the theory that I have proposed is has to be considered correct. So, it is invariably this iterative process there is a modeling proposed there are some parameters in the model you will identify and make statements about or claims about and you have got to go do experiments and figure out whether the observations that you have are consistent with the claims and the observations are consistent then you claim that your model is valid. So, where is the role of statistics in this? Statistics requires that any observations that you come up with be contrasted with your model ok. So, I want to be very clear about this and I will repeat this many times throughout my sessions statistics does not set you up with the hypothesis. So, statistics that has nothing to do with the first point you have a problem that you are looking at in your domain whether it is electrical engineering or mechanical it is for you to come up with some physical model you want to make a claim about a parameter in the model that is your business. So, Newton might claim 9.8 meters per second square, but that is just a claim for a model. Statistics comes in when you do experiments and you want to ask the question are the results realistic and are they consistent with the model ok. So, this business of concluding on the validity of a hypothesis is where statistics will come in and so really do not confuse the first creative task of coming up in the model in the first place with the systematic statistical task of reconciling your observations with the physical model. So, there are two parts to this and the reason I make a big fuss about this now is that if you go pick up any statistics book the statistics book textbook will spend a lot of time talking about collection of data how to describe it how to analyze it and how to come to a logical conclusion it will have nothing to do with how to come up with the hypothesis in the first place. So, you have already had some sessions I think by professor Karmakar talking about how to set up a research problem and how to identify a hypothesis in your specific problem. So, where I am going with this is that I am going to continue off from the point where you have already identified a hypothesis you already have a model you are about to start an experiment where you are going to collect data and we will spend time talking about what are the issues involved in the collection of data how to describe data how to analyze this data and how to identify logical conclusions from your data. Now, really it is really unrealistic to do a whole course in statistics in three or four sessions. So, before we go ahead I am going to point out a few textbooks that you can pick up on your own and read up later. So, remember all these books explain how to use statistical methods but they do not spend time on why you use these methods or when to use these methods and therefore what I am going to do is in my set of lectures I am going to spend time on why and when and hopefully that is of use to you as you try to formulate your own research problems. So, the reference books that I would recommend at this point and I do not recommend quite frankly any of these very strongly there are plenty of good books out there. These are some of the books that we prescribe at IIT for our data analysis course and in fact we have realized that in addition to physics, chemistry and mathematics the incoming undergraduate students the BTECs now must do a data analysis and interpretation course as a compulsory course in the first year itself. That is how strongly we feel that statistics and data interpretation is important to the new engineer and the books we have prescribed for that course on data analysis are listed here. But the pointers there are plenty of equivalent books out there pick up one of them go through these books and at some point revisit some of these lectures and handouts that will remain on our website and try to make a connection between how to use the methods in these books with why you are using them and when you are using them which is what I am going to spend time on. So, why are we using statistical approaches? Because invariably you have a hypothesis which needs some analysis the hypothesis will turn out is about a value that a model parameter will take about the relationship between variables and the form of the model itself. So, for example, if you look at the first point the value taken by model parameter the hypothesis could be about whether acceleration due to gravity is 9.8 or not. Alternately the hypothesis could be about whether two variables that you have in your process are related to each other. So, is there a connection between x and y? How do you prove that x is related to y? And finally if you already know that x is related to y what is the form of the model that you are using to describe y as a function of x? Is it a straight line model? Is it a quadratic model? What kind of shape, what kind of curve, what kind of function will you use to describe this relationship between the model? So, before we get into integrities now of statistics one final set of words of caution not using statistics in the search can be very dangerous and we are typically guilty of doing experiments for example once or twice and then making big claims about theories based on the few observations that we have that is invariably because we are ignorant of how to use statistics in our methodology but it is also possible for statistics without a proper question to be pointless or even dangerous. So, I will give you one example of where statistics can be dangerous and later on when I talk about pictorial statistics I will give you further examples. So, it turns out that there have been studies on IQ of children and the IQ in particular of children from rural areas versus urban areas and if you know the IQ scoring system typically if you get an IQ of around 100 or 120 that is supposed to be a good IQ, 150 is genius at least you can label yourself a genius if you have an IQ of 150. So, with an IQ of 120 if that is average the question that somebody might want to ask is to rural children for instance have an IQ on average less than 120 that basically implies that there are fewer opportunities and do not have access to a good educational system and in general do not have awareness of events around that. So, here is how the study for this particular hypothesis is done the hypothesis is to rural children have an IQ less than 120 and one can assume that 120 is the IQ of an urban child on average. So, are rural children worse off than urban children and that statement is being statistically tested by asking is the IQ of a rural child on average less than 120. So, there is a model there is a model parameter the IQ and there are experiments to be done and the experiments will involve going out now into rural areas carrying out these IQ tests on children correcting all these IQ scores and then compiling an average. And so this test is done and the average comes back as 118. So, here is the question now is a rural child worse often an urban child given that 118 seems to be the average IQ and that is less than 120. So, how do you prove what is prove this? It turns out that if I want to show that 118 is significantly less than 120 there are ways of doing it in statistics. There is a statistical procedure which will allow you to systematically show that even if there were measurement errors in the IQ test that 118 is so close to 120. But sufficiently far away in terms of this two mark difference in a large sample set. So, the net result is that whoever is carrying out this test has the ability if they do the test properly of saying 118 is different from 120 is less than 120 in which case the rural child is worse off. Now, this kind of a study will obviously impact spending for example, because now if you prove that a rural child is worse off an urban child then that implies that the government should do something about it and spend money which of course is a good thing for money to go into rural sectors. But there is one small problem with this which is that when the exam for the IQ test is set up it works with marks which are rounded off in sets of 5 in units of 5. So, the net result is 115 had no business being differentiated from 120 and definitely therefore 118 had no business being differentiated from 120. So, the way the test is being carried out there is no real conceptual difference between a score of 118 and 120 which means you should not have even bothered to ask the question is 118 less than 120. But once the question got asked and once you ask a statistician to evaluate all the data, the theoretician does his work and comes back to the conclusion that 118 is indeed less than 120 therefore one group is worse off than the other group. So, what this example tells us is that you have no business setting up that hypothesis in the first place. But if you do set up a hypothesis there is a procedure to take on that hypothesis and you can blindly follow that procedure and you can come to a conclusion. So, therefore there is this question of whether the hypothesis and subsequent use of a conclusion are meaningful or not. And remember the statistician cannot really comment on why you are trying to set up a hypothesis in the first place. The statistician does not know the physics of it or the sociology of it. The statistician just knows how to crunch numbers. The net result that if you ask the wrong question of a statistician you could end up with a meaningless result back which you then take forward into the subsequent research that you do. So, with that kind of a set of cautions to what I have said so far I now want to kind of summarize which are the important themes that I will try to put forward in the set of 6 stocks that I have been assigned to given this workshop. So, I am going to start with talking about random variables. I am going to spend a lot of time talking about random variables. It turns out most people do not understand the concept of random variable and we abuse this in terms of how we carry out our experiments. So, we will talk about random variables and how to figure out their values. Then we will talk about how to test a hypothesis the formal way. We will spend a lot of time asking how can I not just test a hypothesis but how can I do a hypothesis test. So, in other words how can we abuse the procedure for a hypothesis test. I am going to spend time talking about how are variables related to each other. So, in other words how are they associated with each other and it turns out in science we spend a lot of time asking what is causing what so causality. So, given that I know that some variables are related to each other can I go one step ahead and make comments about what is causing what and then finally we will spend time talking about building and testing models. How to given a set of variables how do I figure out which of these variables are relevant to me. And we will limit ourselves to talking about simple linear models and I will spend some time talking about some flaws in the procedure we normally employ in fitting a straight line something as straight forward and straight line. Towards the end of my sessions I am going to kind of provide an overview of all the various concepts that I have already talked and we will end up describing how various problems arise with published research and I am not necessarily talking about research that you and I may do this involves problems coming out of coming with the research coming out of places like MIT, Stanford the top universities. Everyone is guilty of this everyone is guilty of ignoring some element of proper systematic experiment design. And we will spend time talking about how we can just make up for such issues. So with that I am going to get into a discussion of the first major theme which is random variables and their estimation. Now what I would encourage you to do is some of this is going to be new to most of you. So if you have difficulty following with the discussion either in terms of not following the math or in terms of having specific queries about the philosophy of the approach itself, feel free to put up these questions on Moodle or on chat and I will get back to them as we reach the end of the session. For those of you who are not comfortable with math and I understand there are plenty of participants from the humanities. The point of it is that statistics need not be math intensive and I am deliberately trying to avoid a discussion of math. So really pay attention to the methodology involved not to the math involved and it should be fine. So our discussion of random variables literally starts with asking what is the random experiment because a variable involved in a random experiment will end up being called a random variable. And so here is a very funny notion there is something called a random experiment. So by the way there is also and I showed this in the previous slide there is something called a randomized experiment so these are two different things random experiment and randomized experiment. Randomized experiment I talk about later when I talk about how to devise good experiments. But a random experiment is something for example that you see when you toss a coin. So why then is something like a coin toss thought of as a random experiment. So the reason for me to start all of this is it will turn out that practically every observation that you come up with in your research is likely the outcome of a random experiment. You might not think of it as that you might think that you are in full charge of your research problem that you know what you are doing that you know how to control all the various factors in your experiment. But nevertheless even if you think you are in full charge of your experiment it will turn out that what you are doing is a random experiment. So before you think of your own research problems first thing of the coin toss. So why is the coin toss interesting as an example of random experiment. What are the features of a coin toss? So where do we use a coin toss? So think of a cricket match. So before the cricket match starts obviously there is a coin being tossed to ask who starts off with for example batting. So why do we like a coin toss as a way to start off a cricket game? Because we think basically that it is fair. It is fair to both the teams. We think that it is fair but we also think that the outcome of that coin toss is not known to us in advance. There is no bias in there. So each team has an equal chance going into the event of a coin toss and therefore each team has an equal chance of winning a game at the outside. So if I kind of write that down in a more formal way, what is actually important about the coin toss with three sets of things. The first was that you knew all the outcomes of a given coin toss in advance. You knew that when the coin was tossed that it would come up heads and tails. So that is important to you. The outcomes are known in advance. So when you do an experiment the outcomes must be known to you in advance and be careful here. I am not saying that the outcome of a specific trial is known to you in advance. All I am saying is carry out an experiment. For example if you are trying to measure a pH in a particular reaction then the pH is a value between 0 and 14. So you know that the outcome of the experiment of measuring pH is going to end up being a value between 0 and 14. So your outcomes are known in advance. The set of values that you can expect to see are known to you in advance with the coin toss. And very importantly for you the actual event occurring on the next trial is not known to you. That is why we like the coin toss. We know it is heads and tails but we do not know what is going to happen to us on the next coin toss. Which is what gives it that element of fairness. But then there is also this aspect that the coin toss if repeated again and again is an identically repeatable experiment. If the coin stopped being fair after one toss and suddenly started being biased where you are getting more heads and tails then that experiment with that coin toss where you get an outcome with an initial toss is no longer relevant to you because what is going on with that coin is changing with time. So that would be a useless way to start off a cricket match with that particular coin toss. So therefore when you look at your own experiment think again for example of measuring a pH or measuring let us go back to this acceleration due to gravity. If you are going to try and measure the acceleration due to gravity then the value taken by pH or acceleration due to gravity must be known to you in advance in terms of a range. You must know the set of values no surprises. You cannot do the experiment and suddenly see a new value come up. So you cannot in effect change the rules of the game in a half way through an experiment. So the outcomes must be known to you in advance. You know that pH is between 0 and 14. You know that the value of acceleration due to gravity is greater than 0 and less than infinity. The actual outcome of the next experiment you do will not be known to you in advance. You will do it. You will get a specific value for pH. You will get a specific value for acceleration due to gravity. But then it becomes important for you to be able to go back and do the experiment and repeat it under identical conditions. So the conditions that you are going to work on are going to change that creates problems. Now this impacts a lot of science which is where experiments are difficult to conduct. For example the coin toss of course is very easy to repeat. You can keep tossing a coin. But if I ask you to prove or carry out an experiment to study global warming this is not an experiment you can do easily. In fact this is an experiment that nature has been carrying out for years. As we have studied we changed our CO2 levels in the atmosphere. In fact there is no possibility of repeating this experiment at all. Once you have got global warming and once your climate is destroyed it is not like you can reset everything and start the experiment all over again. So that is an experiment you cannot repeat. You might have some theory as to what causes global warming. You might know of the set of outcomes for example of the temperature in the atmosphere as a function of global warming as a function of CO2 levels. But these are not experiments you can repeat at a global scale under identical conditions because you will never go back to that original climate condition. So then talking about random experiments in the context of climate research is totally different in the context of measuring pH in a reaction or in the context of carrying out a physics experiment to measure the acceleration due to gravity. So we are going to for the moment restrict ourselves to talking about experiments which we know we can repeat and which we know we can reset and repeat under identical conditions that is important and once we have got ourselves an experiment that can be repeated under identical conditions we characterize such an experiment as a random experiment. So what is random about this? What is random about it is that the outcome of the next trial is not known to us that does not mean we have no control over what is going on it simply means that we have a model, we have a hypothesis we have a set of values that we expect will show up but the precise outcome for one precise trial will remain unknown to us until we do it and when we do it we will be in a position where we get asked can you repeat your experiment and if you repeat your experiment you will get asked do all your values average out to something nice for example are all your measurements of acceleration due to gravity converging towards this value of 9.8 meters per second square so it turns out that a variable involved in a random experiment automatically gets called a random variable so the experiment is more important the variable follows now there is something important to us as we start talking statistics or data analysis which is that the moment I talk about variables I must have some numbers associated with this because in mathematics if I am going to compute things like an average I need numbers I cannot tell you the average of a hedge and tails for instance I cannot tell you the average of an A grade in a course and a B grade in a course there is no such thing but I can tell you the average of marks in an exam and why is that because marks are numbers grades are not so one of the key things that we need whenever we do our research is if you have got some phenomena which has to be described with things like grades A, B, C, D that does not help us much with further analysis so we quickly got to find out a way to map these grades to numbers once we have got these numbers we can try to figure out real values to try and map these numbers and then ask questions like is one value is one numerical value more likely to happen than another numerical value so practically all our research boils down to figuring out where are the variables coming from in a random experiment and how do we then map these two numbers and how then do we make comments about the numbers that we see so let us very quickly look at the coin tosses so if we come up with a set of coin tosses heads, heads, tails and heads we are not talking of a random variable so the coin toss is a random experiment but the outcomes as listed as heads and tails is not a random variable why because I do not have numbers there so I need a mapping to numbers and it takes some effort on your part to say that for example if you get heads then let x equals 1 or if you are looking at two coin tosses in a row if you are looking at two coin tosses then what should happen you have four possibilities heads, heads, tails, tails heads and tails and tails and heads but those are not outcomes for you those are not numbers but you can then define a random variable like x which says look count the number of heads you have when you toss the coin twice in which case if you have heads and tails and you have two outcomes in that one attempt at that experiment that you do then your random variable x is taking on the value 1 and similarly if you got two heads when you toss twice the random variable x takes on the value 2 so now for the most part in physics our variables already have numbers to them so for the moment for example if you talk about force, mass or acceleration these already have numbers, real numbers we do not have to worry it is only in problems where you have labels or categories or grades associated with your phenomena of interest that you need to worry about how to convert from these grades to numbers so remember at this point that though I have myself a random variable x which may take on values 1 or 0 or 2 it does not mean that I have no control over what is going on in fact the question now comes up how does this random variable that I have looked at help me learn more about what is going on in terms of the average value in my process and the variability in my process so the moment you have got yourself a random variable anything you do further using this random variable is also a random variable for example if x is a random variable which is the function of x is also a random variable because it depends on what the outcome will be for that next trial for that next set of coin tosses kind of sounds obvious but these are important to us because we invariably in our research stopping attention to whether terms in our physical model are variables or constants and it is important to recognize that if something is derived from something else if one variable is derived from another variable and that first variable is a random variable which has varying values which depends on the trial that you are about to do and the second variable cannot be a constant it also depends on the outcome of a particular trial or an experiment now this is something which every teacher in an engineering lab has probably seen so what happens in most engineering labs there are groups which will do the experiment and several groups between them for example if the groups are trying to do an experiment on in let us say let us take an example from electrical engineering now so if you are trying to use different resistors and you are applying different potential drops across the resistors you are trying to then you are going to be required to compute the current through the resistor this is a simple straightforward application of Ohm's law so this experiment involves now using different resistors applying different potential drops and then asking what is the current flowing through the resistor solve for i in v equals r i right so you will find that every batch of students who does this experiment will more or less come up with the same value in fact the tendency with most students is to try and match the value that they get in their experiment with the value that the previous batch got in their experiment everyone ends up with practically the same value with everyone is looking at everyone else's report and why is that you are all trying to somehow make sure that you are all consistent what you all think is a true theoretical value right so somebody might get if you are solving for i and that i in Ohm's law is supposed to be 5 amps if somebody gets 4.9 is the tendency to erase the 4.9 and write 5 in its place and then submit that lab report now why is that relevant to us it is relevant to us because of course on the one hand it is cheating because you are not reporting your actual observations but second you are not even acknowledging the fact that different people doing the same experiment can end up getting different results without necessarily contradicting the basic theory in fact that is far easier to see if you think of the coin toss if i ask you to toss a coin and let us say that it is a fair coin if i ask you to toss a fair coin 100 times what do you expect to get it is a fair coin it is being tossed 100 times you will all answer that the answer for the number of heads if that is the variable of interest so x is the number of heads 100 tosses fair coin 50 heads if you actually do the experiment the question is will you see 50 heads and if you take all the people in your auditorium and you get each of them to do this experiment are we saying that each of them will see 50 heads out of 100 tosses each of course the answer is no it is highly unrealistic that all of you will see the same value in fact the probability that all of you will see 50 heads in 100 tosses all of you is practically 0 even if i take one person to carry out this experiment per centre and you toss this coin 100 times the odds are that you will see a value 30 31 32 all the way up to 49 and then on the other side 51 52 all the way up to 80 the odds are higher that you will see a value in that range compared to exactly getting 50 right yet the tendency is to report back and say that experiment was done tossing coins the coin was tossed 100 times the answer that we got was 50 so what is the point to this the point is that we tend to bias ourselves in how we report our results we do not acknowledge when we all report the answer back 50 we do not acknowledge that we have seen different measurements and fundamentally we are not acknowledging that the process underlying the coin toss experiment is a random experiment ok so remember what the random experiment said it said that you could expect a range of values what values you could get any number of heads when you toss the coin 100 times between 0 and 100 that was possible it also said that you do not know what you will get as the actual outcome right so you can take a fair coin where one would expect to get 50 heads out of 100 tosses but it is perfectly fine to see a variety of outcomes near 50 but then theoretically it is also possible to get 5 heads out of 100 tosses by sheer chance ok when you do this experiment so if you have 168 centers in our workshop carrying out this experiment independently at least one center may be expected to get a highly abnormal number of heads in 100 tosses and that is nothing wrong with calling that an experiment with a fair coin that is reasonable so it is reasonable to once in a while see an extreme observation or a measurement ok it is reasonable and consistent with the theory that you are proposing so the question fundamentally then comes up at what point is your measurement reasonable and at what point is it unrealistic and you think you have got yourself a biased coin and therefore a biased experiment so it is important to recognize that we are working with random variables and not with constants so I do not want for example in a coin toss experiment to hear back that we are getting a value of 50 each because that 50 implies that it is a constant value it is not a constant value it is a range of values that I expect to see as the experiments get done and when undergraduates do experiments on labs it is a range of values that I expect to see as different batches carry out the labs and not the same value because if it is the same value going on something very suspicious is going on so now in terms of random variables if there is a range of values that I can expect to see for any measurement that I make in any experiment and not one fixed value I am talking in the language of probability I have got to start asking what is the probability that I will for example see 50 heads in 100 tosses and what is the probability that I will see 30 heads in 100 tosses and at this point both are perfectly reasonable outcomes when I toss a coin 100 times of course 50 heads out of 100 intuitively occurs more frequently and is more probable than 30 heads out of 100 so exactly how much more probable is something I need to work out that requires that I describe a probability distribution which is basically a statement of for each outcome that you expect what probability will that outcome happen so what is probability itself then so how do I define probability so probability is the likelihood of something happening that is one way to describe it but if you are going to talk of it in terms of actual experimental procedure it is about what is the frequency with which you see an outcome you like or expect relative to the number of trials that you carried out so if you think of the coin tosses 100 coin tosses of the number of trials what frequency are you seeing heads given all the 100 tosses so that is a probability so the moment we talk about measurements not necessarily being constant and measurements varying on us the probability distribution and the moment we talk about probability distribution we are then stuck with asking questions like what is the value on average so 100 tosses what is the value on average we say 50 but how much variation can I expect to see around 50 so what is the variance so the moment we realize that I express a random variables we need three other things immediately we need a probability distribution we need a mean value and we need to describe variation or variance and I will just quickly put up two probability distributions that are commonly used again if you pick up a book on statistics you will be able to go through these in some detail so the binomial distribution for example is what describes the likelihood of seeing k heads in n tosses so when we are talking about 100 tosses here is this function which will tell us this probability is a value between 0 and 1 so this small p on the right hand side here talks about the proportion of heads that one might expect to see and since we are talking of a fair coin this p on the right hand side is 0.5 because we expect to see heads with probability 0.5 on a single toss and given the probability of heads as 0.5 on a single toss what is being asked on the left hand side is what is now the probability of getting k heads in n tosses so that is the binomial distribution and if you look at a sketch of this binomial distribution it is a discrete distribution it is discrete because k can take on only integer values you cannot have a fractional number of heads so if you look at this distribution if the proportion p of heads is 0.1 and I am tossing this coin 10 times so my n is 10 here if I am tossing this coin 10 times and I am looking at the number of heads then I should not expect to see a large number of heads because this coin is biased towards giving me tails that is what p equals 0.1 says and that is what this shape of this distribution is telling you it is not symmetric around 5 it is in fact leaning towards the left leaning towards getting more tails than heads and in fact if I work with a fair coin so p equals 0.5 instead then you will realize that there is value 5 and if I go to the other extreme and go with a coin which is biased towards giving me heads 90% of the time I get a head then this now is a set of black dots on the plot which tells you that you have got yourself a distribution skewed towards giving you more heads than tails this is a binomial distribution we use it quite often to describe processes where we have to work with integers but really the most important distribution for us in research is the Gaussian distribution as it is also known as the normal distribution it defines or describes the variation of continuous measurements this is a distribution which has two parameters in it it has a parameter mu which is the mean and it has a parameter sigma square which reflects the variance in the measurements so what kinds of things would follow a Gaussian distribution for example the distribution of heights of people so if you look around you in your room if you look at the set of people in your room and you get the heights and you come up with the histogram of the heights and you ask what is the probability of seeing somebody of a certain height that distribution will follow a kind of bell shaped curve around the average height so that is a Gaussian distribution so the Gaussian is a symmetric distribution the mean corresponds to if you take the red curve here the mean corresponds to the center point of this distribution and the variance sigma square basically tells you how fat each curve is so the variance is small like for example with the center curve here the tall curve the variance is small it is basically telling you that you have got very little variation in your data and you have got yourself a high probability of seeing the mean which is the center point of that curve so there is also something called standard normal distribution and this comes about in statistics as something important for us because if I want to compare a distribution of the sets of heights of people with the sets of weights of people then it turns out I am going to look at different curves so if I look at this particular distribution you will realize that I have an x axis and on the x axis I am plotting for example heights of people so the heights of people may follow let us say the red curve so what is the proportion of people for shorter than 3 feet what fraction of people are shorter than 3 feet so you will have to say here is the distribution of heights following the red curve here is the average height for the sake of argument let us say the average height of people is 3 feet then the proportion of people less than 3 feet is the area under the curve to the left of this particular location and that is 0.5 of 50% of the people but then you could be asking questions like now knowing the heights of people but also having measured the weights of people and that is the green curve you will start asking what is the correlation or what is the connection between heights and weights are heights and weights of people related and now you can start asking more advanced questions like is there group of people with heights less than 3 feet related to a certain group of people with weights less than let us say 50 kilos how do I come up with such comparison so essentially I am comparing an area under the red curve with an area under the green curve and that gets a little awkward comparing areas first of all between different curves and of course heights and weights have different units remember so plotting these on the same axis is itself problematic so it turns out that one convenient way to deal with this is to standardize standardization involves taking a measurement and asking how how much does an observation in this case xi to what extent does an observation xi deviate from its average value the mean relative to its standard deviation so I realize I have not yet defined the standard deviation but the standard deviation is a measurement of how much deviation you see on average from the mean so the question fundamentally being asked now as you standardize the measurement is relative to the average value and given the variation that you see in general with your data how unusual is this one observation xi that you have so it turns out that if I take any variable and you standardize it it will consequently not have units so if you look at the units of z remember that xi, mu and sigma are the units the net result is it is easier to work with transformed variables because they do not have units and it allows for easy comparison of things like heights and weights and given that in our model building we are looking at a variety of variables which all have their own units and because you want to make kind of sensible comments about all of these variables you got to figure out how to quickly compute probabilities associated with the variables and these cannot be a function of the units so why do we need these parameters in our distribution so if I go back to the Gaussian distribution the Gaussian distribution at two parameters the mean mu and the variance sigma square so what is the mean telling me the mean is telling me where is the average measurement likely to happen and what the sigma square is telling me is how much variation I can expect to see so therefore how much spread exists in my data set and why is that it is useful because when we do our research and we collect our measurements we are going to end up with a huge matrix of data and really you do not want to be walking around with a large matrix of data what you want to do is to figure out is there some curve which explains the spread of data that you see and it is far easier to walk around with a function describing this curve than to walk around with the entire matrix of data so it boils down to asking what shaped function describes the data that you have seen and then how do I represent that function that I have just identified in terms of values of average or centrality so the mean is a measure of average and how do I also reflect the spread that I have seen in my data and how do I do that given that I am now representing my data with the equation or a function so at this point we are working with random variables we are working with these because we know that every single measurement we do has some uncertainty associated with it the moment we have a random variable we know that we have got to describe the variation we can expect to see the measurements that requires us to talk of probability distributions the moment we talk about probability distributions we need to be in a position to talk about what is an average measurement we can hope to see and how much spread might exist with the measurements and analysis by the way there are other ways to describe the shapes of curves one can ask is the curve leaning to the left and we saw that with the binomial is the curve leaning to the right we also saw that with the binomial so that is Q we can ask if a curve has been flattened that is a higher order description of the shape of a curve but basically in one crude simplified sense what we are doing is when we collect our data we are asking what kind of shape function describes the spread of data that we see and knowing this function where basically the head is asking when we go about getting individual experiments done with individual measurements is that individual measurement of a variable is it normal or is it a very unusual event so if I happen to know that the coin toss is following a binomial distribution using a fair coin where I expect 50 heads out of 100 tosses then I need to know that that shape of the distribution and given the shape of that distribution for each and every outcome of coin tosses I need to then be prepared to ask the question when I next do an experiment and I end up observing that I get 20 heads in 100 tosses is that 20 a normal event or is it too unusual particularly relative to this value of 50 that we have claimed is average and effectively where this goes is you start asking if somebody is making a claim about some model parameter if somebody is for example claiming that acceleration due to gravity is 9.8 meters per second square and you would start doing a set of experiments and you come up with your own measurements of G and you set up a distribution for values of G the question will get asked when as individual later on goes and does his own experiment and sees a value of 9.5 instead of 9.8 how unusual is 9.5 is 9.5 reasonable enough estimate of 9.8 given all the measurement errors that you have that actually what you are seeing is not contradictory to your original theory or it will turn out the alternate conclusion you can arrive at is 9.5 is so extreme relative to 9.8 and relative to all the values that others have seen that the claim that acceleration due to gravity is 9.8 is wrong so it turns out that if you want to make any informed comment about any parameter in a model given that you have some uncertainty with your measurements you are going to have to come up with whether the observations you are seeing are reasonable or not and that sets us up with ultimately the hypothesis test so we will break for now