 Good morning and welcome back here for the third and final day of the session. Today morning we have Dr. Santosh Narona, Professor of Chemical Engineering at IIT Bombay. About six years ago when we started, when we revamped our BTEC program, there was a Biswas committee, Professor Biswas committee which did that and the Biswas model has essentially been implemented. In the core program which is essentially in the first year, spills over into the second year first semester, we added a pair of courses, a theory or a main course and lab course. The theory course is known as IC102 acronym but it is data analysis and interpretation. Based on that today we have his lecture on understanding and describing research results and conducting experiments. Before he starts, I think he will spend a few minutes on himself and his links to this. Thank you. So, welcome to all of you for what could be a session in math and how it applies to research methodology. So, there is a very long pair of titles for my talks but quite frankly what we really need is this at the bottom here. Hopefully you can make out that we are talking statistics and research methodology. So what I am going to do is basically point out interconnections between various modes of research, things that have been described to you in the first two days particularly on the first day when you saw those videos of Professor Kallmarker and we will point out how some of these concepts actually emerge from a data analysis perspective. So, there is the need to come up with rigorous insights in some physical process. You are going to have to look at data and invariably you are going to have to come up with some description of a process in terms of statistical significance. So, we will try to kind of distinguish between various concepts here of significance from a scientific sense, from a statistical sense, point out how invariably these get abused in terms of wrong research design. So, this is not a course in statistics as such. It is more an occasion to go back to a few basic concepts in statistics and talk of how they interrelate with good research design. I am not sure how many of you have done a course in statistics. So, are you all familiar with statistics? Ok, that is a small number. From what was just told to you, you will recognize that we feel at least in IIT that data analysis and interpretation is a key course. In fact, our entering undergraduates BTEX are required to now do this as a course on power with physics, chemistry, mathematics and for that matter we are about to introduce a core course in biology. And the fact of the matter is data analysis and interpretation is required in all these fields because at the end of the day when you are talking about new research areas whether it is nanomaterials or biology or even systems, the modeling of large systems you have got to come down to trying to understand what is going on in terms of the data that is available to us. And how do we then come up with conclusions in a rigorous systematic set. As we go along, one of the things I am going to do is to try to distinguish some of the classical ways of doing research. So, what did we do 50 years back and what is it that we are up to now? So, the way the DNA molecule or the structure of DNA was identified 50 years back is totally different from the way biology for instance is done now where whole genomes of organisms are sequenced in a week. So, when you had the bird flu outbreak in the far east within a week the genome was sequenced a large amount of data was analyzed and a vaccine was developed. And this inherently comes down to understanding data at large scales and manipulating data at large scales. So, we will try and understand all of these aspects. So, while initially we kind of partitioned these lectures as two areas understanding and describing research and conducting experiments. I am actually going to talk back and forth between these two topics. And so a generic title might be statistics and research methodology, not statistics in research methodology. We will talk of the interconnection between the two. I am in the department of chemical engineering, but my research areas have to do with bioprocessing. So, you might see a bio example or two as we discussed today. But what I say applies for all areas. Because bio is at the forefront of statistics and that is very simple. You want a drug, a pharma company comes up with a drug. You have got to prove that this new active molecule is a drug and it is not the same as a sugar pill. And how do you do that? You do that by doing clinical trials. You take so many people, you give half of them a sugar pill. You give the rest to the drug and then you prove that the drug actually seems to work on that bunch more so than it does on the other bunch. So, there is already a hint of a research design for you. You need two groups. You need to give one drug. You need to give the rest to something else. We will discuss why we need to give the rest to something else and why couldn't we have just left them alone. But the point is that if you want something to be marketable as a drug, you got to do some trials and there is statistics. In fact, the pharma companies have more statisticians than any other domain. But then you need statistics in engineering. You need statistics in standardizing your production processes. So, the Japanese are famous for this, of cutting down errors in production and improving quality, so quality control. But these days we are also looking at interesting domains like trying to figure out from data, let us say in a nuclear power plant. Can we, from all the sensors in the reactor, figure out when a catastrophe is about to happen? So, how do you know in others that your process is starting to go offline or develop a fault? So, that's got to be done in a statistical sense because your data is not going to be nicely trending in some fashion. It's like your stock market index, right? It's fluctuating back and forth. So, how do you know when it's just noise fluctuating back and forth to knowing that your reactor is about to experience a meltdown and that you better get out of there? So, how do you do this? It's statistics. So, what kind of research methodology would you employ in that kind of a domain? So, we've got different scenarios. And in a nutshell, you'll recognize that these days we've got lots and lots of data. Just look at the net, lots of data and you want patterns in data determined fast and that guides what you do next. So, in other words, a lot of our research methodology at the current time is driven by the fact that we've got huge amounts of data and we've got to do something with it. And that again boils down to statistics at some point. So, let's kind of go through what statistics means. If you're going to spend so much time today talking about it. Obviously, it involves collecting data, describing it, analyzing it. That's the key part. You don't just collect it and keep it. And you then try to derive logical conclusions. And we're all familiar with some very standard domains and statistics, the sensors or market survey requires all of this. You'll realize that in the sensors, we are quite often guilty of collecting data by doing nothing with it. Okay, so we miss out on the analysis and conclusions aspects of this. But a similar phenomenon occurs if you're trying to identify genetic aspects of a disease, for example, defects leading to cancer. So, there I'm going to have to figure out what's going on in terms of the sequence of a gene, not just in you, but in several people. And then ask, is there a subgroup of people who are prone to a particular type of disease? So, how do I prove that? And how do I prove that it's just not a sampling issue, that you have a crazy gene and that it has nothing to do with cancer? So, how do I prove that? Okay, but then if you go back to something very simple, all of us have at some point or the other done a chemistry lab experiment. We've done some titration. We've done some validation of some model. For instance, that the absorbance of a species is proportional to its concentration. So, you typically go to a colorimeter or a spectrophotometer and you ask what happens if I load different amounts of a reactant of some reagent into a tube? So, how much color does it, they would develop, right? So, now all of us are used to the business of getting a few data points and fitting a straight line. We'll discuss that problem in some detail and talk about how that relates to research methodology and then how many different ways we tend to mess that simple task of developing a straight line up. And as I pointed out, there's also this aspect of looking at reactor operation, okay? And so you can have in the case, in the extreme case where you do not understand the physics of what's going on. If you don't have a model to tell you what's going on, but you still have built a reactor. And one good example here can be, for example, operation of a refinery. You look at the petrochemical refinery that Reliance has in Jammu Nagar. It's a huge facility. There's so many plants operating at the same time. What does Reliance want? Reliance wants to make profits, right? So you've got to optimize the operation, not just of one reactor or one unit operation, you want to optimize the performance of the entire plant. And for this, you've got access to various measurements in the different units across the whole campus. So how do you collect all these measurements? And how do you now figure out how to optimally run this plant? And your problem is you do not have good physical descriptions of the operation of each part. And therefore, you have no ability to predict how the whole system will function, okay? So now, how do I do a few experiments, a few isolated experiments? Reliance doesn't have the patience to do systematic experiments. So how do I do a handful of experiments? And from there, figure out what's my best optimal operating condition for a plant, so that's a design of experiments. That's a domain which is extremely important, particularly to engineers. So in a sense then, the outline for what we're going to cover is, what's the role of statistics in devising a research method? And we're going to touch upon various aspects of this as we go along. And then we start by asking, how do we test a hypothesis? What's the hypothesis? How do we test it? How do we set it up? How do we represent and visualize data? How did we represent and visualize data 50 years back? And what can we do about it now? Now that we've got computers and multimedia and fancy ways to convey information. How do we figure out relationships between variables? Because if you're trying to investigate some research problem, at the heart of it, you've got to figure out which variables are relevant to your problem, what is it you're going to control? What is it you're going to measure? So which variables are relevant? Given that there could be 100 odd entities of interest, okay? And then we'll try to understand how, at the current time, the published research uses or does not use statistics. We've got some surprising insights there. And that we'll do towards the end of our session today. Of course, when we're looking at this narrowly in the context of engineering, we're interested really in hypothesis testing. Engineers really do not get to what I'm in hypothesis. We are not really trying to discover that new drug. We already have a drug, somebody's given us a drug. And typically a chemical engineer, for instance, would be asked, can you figure out how to scale up production of the drug? So it's not like you're discovering it or you're not trying to prove that it's a drug. That's already done for you, right? So I don't, this burden of proof is not so much there on me to prove that it's a drug. I just need to know now how to standardize a process, how to optimize a process, how to scale up a process. So what we however do increasingly come under pressure to come up with new materials. So can I come up with a nano catalyst of some sort? You have to carry out a reaction faster, even my insights and catalysis and so on. So there we have to employ hypothesis testing. Then invariably we are asked to build models for what's called regression. You're all familiar fitting that straight line to our calibration curve data in chemistry. But also classification. For example, when I talked of that nuclear plant, that operator who's staring at, let's see, literally 100 measurements. And this is a practical problem. It's a practical problem which BRC for instance has come to us. The operators sit and watch 100 measurements. And there's no way your brain makes sense of 100 measurements. So can one build a research model which takes as inputs the 100 sets of measurements. And comes up with the final classification. Is the reactor normal or is there an abnormal condition? That's what ultimately people want. That's what the operator wants to know. Of course, if it's trending towards becoming abnormal, you're going to go do something about it, right? So that's classification. Can we come up with some kind of a model which tells us that a behavior is normal or abnormal? But that's exactly what your diagnostic kit does for you. You want to know whether you've got a disease or you don't have a disease. So now how do I come up with a statistical design for that diagnostic kit? In other words, how do I control the sensitivity of that kit so that it maximizes my ability to correctly diagnose you? If that kit is too sensitive, I'm unnecessarily going to call you, let's say HIV positive, in which case you can imagine the consequences, right? You're going to have to go start a round of unnecessary medication. On the other hand, if I didn't tweak it right and the other extreme, I fail to detect those are truly HIV positive. You can see the disaster associated with that. So now when I, therefore, come up with a diagnostic kit, I want to tell you at what point should I tweak the chemistry of my kit so that it's got optimal prediction performance, optimal classification performance. So regression and classification. Integration, you're fitting a model to your data in classification. You're also fitting a model to data, but you're asking also the question, how do I best categorize two sets of groups or two or more groups? And then finally, there's this notion of design of experiments, which engineers in particular are interested in, which is, what's this minimum set of experiments that I need to do? And here we assume that experimentation is expensive or difficult, so you can't be doing infinite experiments. So if I do a limited number of experiments, how do I quickly figure out what's going on? And I don't have some physical model to tell me what combinations of variable values to try. Now, without knowing that, so my physics is incomplete. Without knowing that, what can I do to gain some insight? And of course, once I gain some insight, that kind of gives me a hint as to what's going on with the physics, okay? So these are three major domains where engineers need to be worried about some kind of going to start off by asking why employees statistical methods in the first place, okay? And then we'll get into understanding in particular the role of random variables. So start with this notion of that you need to test a hypothesis. So there's something going on, okay? We wish to evaluate a hypothesis and invariably that's at least in terms of modern research design for the last 50 years, this has been the paradigm. We set up a question, okay? We set up a question, that's a hypothesis, and we say let's try to design an experiment to prove or disprove that hypothesis, okay? And you can't immediately solve big grand questions at one shot. So you say let's come up with a narrow, tiny question. And let's do a well-defined experiment to answer that narrow question. And based on the outcome of that experiment, it will define for us a new experiment and we keep going bit by bit. And then we try to get one large theory solved bit by bit, okay? Now, there are a few rules which are few insights thanks to the abuse of statistics over the years, so really statistics, the way we are going to project it as a course or as a tool for a researcher is about the use of an algorithmic approach to address the problems. And now that somebody's asking a question, somebody's asking a question. Some chap says, this new molecule I've discovered in my lab is a fancy drug. It's going to be the next blockbuster drug. To test that, I'm going to have to do some kind of experimentation, that clinical trial that I talked about. And then I'm going to have to collect all the numbers and somehow come up with a conclusion, yes, it's a good drug or no. Nothing special about it, right? If there's no question to be asked or that question that's being asked is a silly question. You can still do the statistics, you can still do statistics. And our literature and for that matter, policy making is full of such examples. So I can give you one example to convince you of that. There have been studies, for instance, on children from rural areas or slum areas versus children elsewhere, well-to-do neighborhoods, okay? And people have attempted to figure out if the IQ of children, the intelligence quotient of children from rural areas, is that different from people in urban areas? From that video you saw a couple of days ago, you'll recognize that there's no real relationship between IQ and intelligence. Do you remember that point that Richard Feynman had a reasonably average IQ? That is thought of as one of the best researchers ever, right? So now, how does one go about trying to figure out if the IQ here is different from the IQ there? We collect a bunch of people, rural kids, give them a test, get the numbers, work out an average, similarly do the same for the other group. Then ask the question, is this average less than that average? Now if I want to prove that 118 is different from 120 on average, actually both 118 and 120 are very good IQ scores. Richard Feynman had what, do you recall? 120 something. So in this analysis, it actually turned out 118 was being compared against 120. So 118 is being compared against 120. I can prove that 118 is different from 120, statistically. Meaning that 118 was a function of the group of people analyzed. If there had been a different bunch of kids, then maybe it would have been 190. It's a function of the samples at my disposal, right? But however if I sample enough, and this again your intuition hopefully will tell you, if I sample enough, I should be in a position to say that 118 is different from 120. And the more I sample, the more confident I should get about my interpretation. And so this is what I was done. So we kept sampling to the point where 118 turned out to be proveably different from 120, right? So therefore rural kids are worse off than urban kids. That's an example where the statistics has been done right. What's the problem there? Can any of you think of a problem with this? So the problem basically is that the IQ test, fundamentally, is incapable of distinguishing between 118 and 120. In others, the questions being asked have scores assigned to them. There is no real difference between 118 and 120 if you think of the associated round off. So there's no business trying to distinguish 118 from 120 in the first place. But assuming you decide to go down that road of trying to distinguish 118 to 120, there's a statistical procedure for you. So the statistics as a research methodology is employed all right. But it's actually being abused by the researchers. And this, of course, was not done just for fun. You can see that this has a policy outcome. This is where you're going to pump money into, not saying you shouldn't pump money into slums, but you get the point. You get the point that if you want to carry out trials and for that matter, if you want to carry out trials and say toothpaste X is better than toothpaste Y, there's a statistical test that one can use to prove that X, Y turns teeth better than Y. So one of the things you've got to appreciate here is, as far as data analysis in your research method is concerned, we are going to assume that the right questions have already been asked by the scientist that the right method has been employed to collect data. And then the statistician gets into the picture and then just does some number crunching and then decides, is this equivalent to that? Because the statistician cannot counter the effect of a badly designed experiment. You cannot improve on that. So that's a moral to this. Of course, this point is more or less obvious. If you are ignorant of how to go about using statistics, then you really have no business given the fact that we have limited ability to do experiments. You have no business drawing grand insights into your experience. It's not clear who the statement can originally be attributed to. Some say McFaen, but the point is there are lies, damned lies and statistics and this refers to the use of data analysis by politicians invariably to try to prove or disprove a policy point. So the shades of lies and we of course are going to now study a variety of these techniques. So in the data analysis and interpretation course, really one of the things we do is we go away from simply saying here's an algorithm. It's not like you learn programming. We try to spend some time trying to teach people how do you actually involve this kind of analysis in designing beta experiments. And so we just expose them to a variety of techniques. We don't go into the techniques in detail. Because you will always find that you can find books or for that matter software which will allow you to do all the number crunching for you. So really that's not what you really need to gain. What you need to gain is to figure out when to use a tool for a technique and why you're using it and you can't lose sight of that. So remember again, so this is a key point that the statistician is not bothered necessarily with the hypothesis itself. That's coming from let's say somebody with a physics background. Somebody is trying to claim that acceleration due to gravity is 9.8 meters per second. So that's a claim. Then from that point on the statistician jumps in and says let's collect data. And let's find the average measured acceleration due to gravity and find out if it's close to this claimed value of 9.8. So the typical procedure that a scientist will follow is to consider a physical model, for example, a model which says that acceleration due to gravity should be 9.8. Now you've got to appreciate that when you moment you say there's a model, you can visualize all kinds of fancy algebraic equations with parameters. So these are these variables. Now you've got to be cautious here. I've just said parameters and variables in one sentence. So what's a parameter and what's a variable? This is in fact, a major source of confusion for most people. So if I talk of acceleration due to gravity, f equals mg, right? So if I'm out to figure out what g is, what should g be? Is it a parameter or a variable? It should be a parameter in your model. It's a population value. And we've got to now start also talking about population samples. Meaning that whoever measures g, they're both trying to measure the same value of g. They might see differences as a consequence of different experimental apparatus or different errors that they've committed. They are inherently trying to find that same universal constant, right? So if I want to find out the average height of a human being, there's only one average height of a human being. But for me to figure that out, obviously I'm not going to measure everybody's height in the world. So I'll have to find myself a sample. And so based on the attendance here today, I'll find some average human height. And based on the attendance tomorrow, it will be a different average height. Because my samples are different. But we are actually measuring or attempting to estimate the same parameter value. So there's a notion of a parameter, which is what we have to get in a model. And then there's a notion of a random variable. And that random variable is our estimate of what we were trying to get. Because my estimated average here could vary from day to day, okay? So this is an important point, which is you can't do an experiment once, and then pretend that you found this fundamental constant. What you found is an estimate of that fundamental constant. And why are we guilty of this? Because if we have some kind of a bias that acceleration due to gravity should be 9.8, and you do the experiment once, and you get a value of 9.8, you shut down your experimental apparatus and you go home, right? That's what we tend to do. That's wrong in the sense that if you were to repeat the experiment again, would you see the 9.8 again? Okay, so what you've seen is an estimate. What you're working with is a random variable. What you're trying to learn about is a parameter, okay? So the idea then is to figure out what these theoretical values of parameters are. What's that average human height? And to do that, we're going to have to perform experiments, find out measurements, and come up with an estimate. And then ask, is that estimate close enough to that original proposed theoretical value for the parameter? Of course, we have got new ways of doing some of this, okay? If the physics is unknown, for example, if you're looking at estimating stock market trends, the physics of that is not well known. We don't know. We don't know how to start with fundamental differential equations and model these things. We can't do it precisely. There's no perfect theory for that. And yet we want to figure out trends, and yet we want to estimate some parameters, okay? So statistics still comes into this picture where you try to generate some kind of a crude model, a probabilistic model, and then try to estimate trends, okay? And invariably, these are not perfect models, but they actually give you enough insight that you can go back and rig up some basic physics to try and explain what you think might be happening. So for the engineers, there's quality control, and I've talked about manufacturing, requiring quality control. There's the detection of an abnormal event, for example, in that nuclear power plant. And this optimal process design, which, for example, reliance wants for its entire refinery, okay? And all of this goes back to beer, 100 years back. Staggering to think that most statistics came about because, not most, some of it came about because the brewery out of the United Kingdom essentially wanted to figure out two things. One, how to improve the quality of the taste. And second, when you fill up barrels with beer, and of course, in those days, these guys used to fill up barrels with beer and then ship them to different corners of the world, because everyone, especially with the British going out all over the world, they wanted their beer, and so beer had to be shipped from England. So one of the key things they wanted is to make sure that the barrels are filled pretty much to the same extent. You can't have a half empty barrel and something right up to the brim. So in other words, they wanted to fill it up more or less to the same level. So the question came about, how much variation might you see in the heights of beer in barrels? How to control that, or how to minimize that? So that question, okay? And the need to come up with better taste. And so how do you come up with better taste? So it turns out you use a few herbs. For example, a herb called hops in the making of beer. And so if you're going to add this herb to beer, you've got to ask, how does the herb made in this corner of the country taste relative to the herb made in that corner of the country? So that's hypothesis test. Is this better than that? So these are the first questions and all in an attempt to make better quality of beer and also have a precise volume of beer per barrel. And here's another key insight into research methodologies. The moment management says, let's improve quality. And let's figure out packing volumes. You look around and ask, who can do this? And then it turns out that a brewery typically has chemists, no mathematicians in there. And in fact, about 100 years back, statistics was not a well developed field. And for that matter, it was a field looked down upon by mathematicians. It was not nice to play around with numbers. You have to think of equations and algebra and that kind of stuff. So arithmetic and statistics was not that big, 100 years back. So, and yet here's a very applied problem, which requires analysis. And when you realize that there's need for a solution for this, okay? Now, here's something remarkable to me. A chemist decides that the only way this problem is going to be solved. You see, he crosses over and essentially trains himself in statistics. And in those days, there were no programs in statistics. There were no workshops in statistics. So what do you do? You just go become an apprentice. You go join somebody's lab and hang around and you learn. So he joined the lab of somebody called Ronald Fisher. He's a very prominent figure in modern day statistics. And he learned. And he learned enough that he contributed new theory. In statistics. And this is a guy who hated math, which is why he went to chemistry. The point of it is there's a problem driving you. You will pick up new skills. In other words, there's no fear here. You realize it's relevant to you, you go out and you learn it. And you do it. It may take you a while, but there's very clear focus of why it needs to be done. Therefore, you just, if you can't do it yourself, go find somebody who will help you do it, but get it done. So the irony is after all of that great insight in terms of practical statistics. So it's practical. It's really practical because it applies to the industry, straight, the brewery. The brewery after all of that beautiful theory ends up telling this guy, please don't publish your work. And why is that? If you publish your work, then the neighboring brewery will figure out how to improve its quality. So for a long period of our statistical history, here's a chap who came up with some of the best theory, he's gone without getting credit for it. He's now known as a student of statistics. So all publications are about student. Here's a guy called William Gussett. This business of also coming up with better quality herbs, let people to this design of randomization. Nowadays, we kind of take it for granted of how to randomize the experiment. But here's a simple insight. You want good quality herbs when they figure out a particular herb of appropriate quality. The next question came up, how do I improve production? And one of the insights at that point, again, 100 years back, agricultural biotechnology was booming. One of the ideas was, let's add fertilizer, right? So you add fertilizer. What do you expect? You expect a higher yield, or whatever, whether it's a herb or a grain. So now, how do you prove that adding fertilizer gives you a higher yield? So what would you do? I ask you to prove that adding fertilizer gives you a higher yield. How do you go about it? How do you design an experiment? I just want some ideas popping out of you. So you're given one huge plot of land, and you have the ability to break it up into subplots, that's your control over that. So what are the kinds of things you must now take into consideration? If you want to prove that, adding fertilizer. And of course, adding fertilizer is not cheap for a company, which so there must be a reason to do that. So if you want to statistically prove, rigorously that you're going to improve your yield, so what's gone into that plot process? So there are several concepts here. One is, so let's say you don't have the ability, let's say you don't have the ability to quantitate the soil type. There's no previous data. The first time it's popped into somebody's head that they should add fertilizer to soil. No, so again, let's say that's uniformly the same, independent of whether, okay. So now, the suggestion is that we compare plots of land with fertilizer, with plots of land without fertilizer. Why is that important? Because we want to prove that adding fertilizer means that there's a problem. Okay, so in other words, we're hoping that the one difference between these two plots of land is the presence of fertilizer and the absence of fertilizer. Hopefully, no other factors will influence your analysis. And therefore, if you get more here, that must have been because it's one additional component with different amounts. Now, that's a slightly more complicated thing, because not only am I now trying to prove that adding fertilizer makes a difference, but now I'm also trying to work out a model as to how increasing the amount of fertilizer causes an increase in yield. That's more complicated. Find out what are the parameters which are actually reduced? We don't know that. So in other words, it's a black box. So this is what I was saying. There's no physical model. The agriculture does not know what controls this. And back then, people did not even understand the biochemistry of how carbon and nitrogen are. No, let's say there's just one fertilizer. It gets complicated, because the moment you have different types of fertilizers, then they might each do different things. It gets a little complicated. So we want to keep it simple. Through first that there's a fertilizer, which in general increases yield. And once you do that, then of course, you'll do what you're suggesting, which is figure out which is the appropriate fertilizer, right? But you don't want to jump and directly tackle the harder question first. It's the same problem, which is now you don't want to work out things specific to different types of crops and then figure out what it is. In general, figure out, regardless of crop, which one gives you a better, whether adding fertilizer gives you a better yield, okay? So if you think about what you're talking about with different crops or with different fertilizers, you're asking, can we randomize and generalize this effect? But there's also one subtle thing. Where on my big plot of land should I add my fertilizer and where should I not? Does it make the difference? So if it now goes back to what we are saying about soil quality, could there be a problem? There's no guarantee of that. I mean you just walk around, you'll find rocky soil, you'll find fertile soil. So there's no control over it. We have the oil, what do you mean? Sorry? Quantity may be more that, no. If there are subplots, you will take care to add the right quantities onto each subplot, so that we're not cheating, adding more. Let's say that we are fair and we give equal numbers of subplots to both sides. So in other words, you have control over this. Let's say you have finally divided your plot into subplots. There's a key point that you have to randomly distribute it. Randomly distribute the fertilizer to different subplots. Why was that? Why is that important? Because soil is important. You? Because of the soil, that's what evaluation is required. You know, you've got to be a bit more precise in how you formulate this. The soil quality may differ. So the soil quality could differ from one subplot to another. You know, control over this. Maybe there's some rocks over there. Not much fertile soil available for you. So how do you know, therefore the yield is a function of addition of fertilizer as opposed to the inherent quality of that particular subplot? It may happen that, you know, quality of different soil and the quantity of fertilizer may differ. Now we are getting complicated. So that's for later discussion. Right now, my concern in designing this experiment where I wish to prove that adding fertilizer makes a difference, I have to make sure that no other factor influences my analysis. And what kind of factor? The inherent quality of soil in different subplots. Now I have no way of quantifying this. I have no measurement. I don't have an assay to go take soil samples and then evaluate fertility. Okay, the assay is what? Assay is just grossed off in it and that is the experiment itself. So we can't, sorry. So the whole point to this, therefore, is if I'm going to prove that adding fertilizer is going to make a difference, that's over and above the influence of any other variable. Which means I've got to control for those other variables, particularly because I'm not in a position to make measurements of those variables. I don't know what the fertility might be. I can't control for it. Therefore, I can't pick and choose plots where I'm going to add fertilizer. So therefore, I randomize and in randomizing, I'm hoping that I'm going to get for those subplots where I intend to add fertilizer. I'm going to get some subplots which are inherently fertile, but also some subplots which are not fertile. That way, the effect nullifies, the effect of inherent fertility of the soil nullifies. So in other words, we are now starting to talk about there being latent or hidden variables. The fertility of soil here is a hidden variable, which we never thought of, which we may never think of. And therefore, if I want to go ahead and say that what I wish to investigate is relevant or significant, then I've got to make sure that no other variable potentially is also causing an effect. But we are not talking of small plots here because this was a brewery interested in obviously large scale production of both grain and herbs. So that's the inherent question that we are trying to, so does, is there a need for addition of fertilizer if I wish to increase my yield? So that's what prompts this whole line of thought in the first place. So to do that, what we figured out is we need to control for any other variable which might have some indirect or direct role. And our problem is we can't sit here, work out a model which accounts for each and every hypothetical variable, and then figure out what set point to give those variables. We can't do that. So therefore, the next best thing is if you wish to do narrow, well-defined experiments, we try and control for whatever influence those other variables might have. This is actually something we do, sorry. Yeah, so there could be other ways. So my point is, you know, so you've got to randomize your behavior when it comes to the performance of an experiment. That's a critical thing. I mean, let's come back to talking heights. If I want to work out the average height of a human being, let's say an alien lands up from Mars and wants to work out the average height of a human being, walks in here, is there a problem? Is working on the average height of a human being? What are the methods to check the normality of the data? We'll come to that a little later. So we'll have to figure out whether we have actually designed it appropriately or not, okay? The height difference of the plane, right? Maybe there's some, some area- No, I was looking at something even more basic. You're all adults here. If you had gone to a kindergarten there, then that alien had ended up in a kindergarten. So the average height being reported back to Mars would be totally different, right? So what's the key thing then? You've got to sample appropriately because you do not want the location of your sampling now to be an inherent factor, a hidden factor. It should not influence your analysis. So when, for example, after an election, an exit poll is held, you can't just go to one polling booth where you know one party is predominantly preferred and then you can't ask people exiting that booth who did you vote for? And then from their extra-pollutancy, that party is going to, when you can't do that, you have to randomize the same thing with any telephone survey, right? Somebody calls you up and asks you, would you, like this bar of soap? What do you prefer? Another soap? So when you're carrying out these surveys, you can't simply call one neighborhood and ask people in that one neighborhood what their preference is. You've got to randomize if you're looking at getting an overview. Right? Okay? So inherently now, I mean we are now kind of escalating this into a collection of more concepts, but I'll bring it back to this basic point that we need to figure out, okay, how to randomize experiments. It's something we have actually not done. Now it turns out we don't really need to worry too much about that. That's when you want to learn more about a particular phenomenon, but inherently at this point I am going to limit myself by saying, okay. There are more variables than we care for. Many of these we know nothing about as we start our modeling process, as we start our research into a particular topic. And if you have no control over the half the variables in your model, then you've got to still somehow ensure that they don't impact the rest of your analysis. And the way to go about that is to randomize. So that's a major insight and in fact that is a key part of modern statistics. Here, even though you are saying that you don't know that it's why it affects your reading or not, still you are assuming that it does affect your reading. We have to, yes. So in other words, whatever factors you can think of, you must try and correct for. So as somebody said, time of analysis could be important. Now if you have not thought of time, yes, so your study could still be biased. That's what you're trying to imply. The point is made. So there's no perfect experiment that way. We can always come up with a new variable you have not thought of. But for now, just accept that we need to randomize the best we can towards trying to gain some control over what goes on. Which therefore brings up really this question of what is a random experiment because it's actually a poorly understood idea. When we, yeah, so I'll come up with a formal definition and then we'll look into what that implies in terms of actual design. So when we say that an observed data point is an outcome of random experiment, the easiest way for me to convey this to you is to think of that coin toss. And in fact, the coin toss is used to start off games because we think of it as a random event. So the cricket match starts with the coin toss, right? So what are the factors that we associate with the coin toss which makes it important for us? Why is it that no team captain challenges the use of a coin to start off the cricket match? Why do they all? So it's assumed that it's a fair coin. Then what else? Yeah, so that's what we mean as a fair coin. So again, so that's the definition of a fair coin. Each outcome is equi-probable, sorry, two possible outcomes. That's fine. Yeah, there are six teams playing, we would have rolled a die, so the outcome is not biased is what you're implying. That's fine. Yeah, but those are the basic, forget that, even if it were a very complicated coin toss, you still use it, but so what about the coin toss? Is it a reproducible event? Are the outcomes reproducible? No. We expect it not to be reproducible. We expect not to be in a position to predict the next outcome precisely. So what's important to the coin toss has been pointed out, we know that there are two outcomes. So in other words, we know the set of outcomes, no surprises there, we shouldn't suddenly get a new result halfway through the experiment. We know the set of outcomes, then what's important? We do not know the outcome of the next experiment, we do not know the outcome of the next experiment, that's why it's appealing for us to toss the coin. What else? For a large number, the outcome is probability of the outcome. Yeah, but that works out, so it will come back to that 0.5, assuming it's fair. So the outcome is not known in advance. More importantly, each toss is conceptually the same as the previous toss. It's not like your backyard cricket game where the guy with the bat suddenly changes the rules of the game halfway through the game because he thinks he doesn't get enough runs. So we're not doing that. So it should be reproducible, the event should be the same, the experiment should be the same, not the event, the experiment should be the same under identical conditions. So these are three key points, but these are three key things you want anytime you go there into a lab and try to get yourself an experimental measurement and think of the number of cases where this is flouted. Are you in a position to repeat your experiment? If not, then why are you in this business of saying that you've found some profound insight having done an experiment once? So is that experiment reproducible? Do you know the set of outcomes you are investigating? It's not a question of challenge. You probably don't have sufficient statistical reasons to claim a trend. You should not be trying to claim big things with limited data. So that's inherently the point to this. So the random experiment is important and the coin toss is a very simple elegant way of thinking about it. No, that's not the point. The point is when you think of the fertilizer experiment, for example, where the range of values, so they don't think of discrete as in heads or tails, you have a range of values that your yield could take, but that experiment is inherently reproducible. As in you can go from subplot to subplot. The condition, but it's assumed that if you have accounted for every single condition and you're randomizing for those variables you cannot control, then it's a reproducible event. So the moment you talk of random experiments, it boils down to this random variable. So there was something that was a random variable associated with your coin toss. What was a random variable? So it turns out you were trying to estimate the value of a random variable. What was that random variable? So the coin toss. Coin toss, we think of heads and tails, but heads and tails does not make a variable. Yeah, the number of times it is a variable. Right. So one of the key things you've got to appreciate is that in probability theory and for that matter, therefore with statistics, we want numbers. We want variables which have numerical value. We can't work with labels, heads and tails. You need to map them onto numbers. We can say things like if I get a head, I'll say x is 0 and if I get tails, I'll say x is 1 and then if I ask you what do I expect with the coin toss, you'll say on average x should be half because it's halfway between 0 and 1 because 0 and 1 occur with equal probability. Right. There's no average of heads and tails. I can't work with labels. So in other words, I need this mapping to variables. Now that's obviously the kind of thing we do intuitively with the coin toss, but it's also the sort of thing you need to do with whatever phenomena you're investigating. What is that random variable that you're looking at and then what is the range of outcomes? If the outcomes turn out to be labels like grades A, B, C, D, then map them onto numbers and then work out, for example, an average grade. Don't simply work with original labels because you can't talk of probabilities unless you have a map of your values onto a real number scale. So a sequence of coin tosses is not a random variable. It must have a numerical mapping. And so if you're looking at a pair of coin tosses and you've got four possible options and you now ask what is the number of heads that turns up in one experiment? So x now becomes a variable for you. So heads and tails never is a variable for you. So one other point we need to address is the moment we say that we've got ourselves a random experiment with some random variables in there, that doesn't mean you have no control over the outcome. So the randomness is in the consequence that you are unable to predict the outcome of the next trial, but you do know what range of values to expect. So if you roll a die, you know that it's going to come up 1 to 0, 5, 6, one of them. But you do not know precisely which one is going to show. That's the key thing. If you do that experiment and you're doing that titration of reagent, you do not know precisely how much of the acid you're going to have to add to a flask, but you roughly know what that ranges. So your range of interest is defined for you, right? And the precise value that it's going to turn out to be is not something that you know. Now this is important because we do not tend to treat our actions as occurring from random variables. Go back to that undergraduate chemistry lab and you're doing this titration experiment and theory tells you that you need 5 mL of an alkali to neutralize that acid, whatever the normality of that acid, if it takes you in theory, 5 mL, you'll find everybody stretching the numbers to try and get 5 mL, and it's not unusual to have a whole bunch of students submit the identical reports all getting 5 mL, right? The fundamental point is you don't expect everybody to get the same numbers, and there's nothing wrong in getting different numbers because you're working with a random variable. I do the same experiment 100 times, I don't expect to get 5 mL each time, nothing wrong with it. You take, in fact it's easier to see this, it's easier to see this with that coin toss again. So if I ask you to take a coin toss and now one experiment is 100 tosses, not one toss, but 100 tosses, supposed to be a fair coin and an experiment is 100 tosses. How many heads do you expect? 50. You take the coin toss and you do the experiment, will you see 50? Does it mean it's not a fair coin? It's a fair coin, but you just happen to see one outcome with your fair coin in a range of outcomes, and there's nothing wrong with that, but that's not the way we behave in science. So with the coin toss, the moment we get 20 out of 100 heads, we'll still say it could have been an extremely rare event that happened to us with a fair coin, we were just so unlucky to see that. But it could have been an outcome with a fair coin, whereas the moment we start doing this in science and you see an extreme result, there's a tendency to fudge it and then come back towards that average. Now it turns out your undergraduate chemistry student is not just guilty of this, somebody has famous as Gregor Mendel, remember Gregor Mendel, heredity, he did this kind of fudging. So when he says that 1 in 4 pea plants will turn out to be short as you start meeting different types of pea plants, that 1 in 4 is just 25 out of 100, he did not get 25 out of 100, he probably got 23 out of 100 and next year he probably got 27 out of 100, but to give him credit, he realized that it was tending towards 25 out of 100, therefore 1 over 4, so he saw the trend, but in no way will the experiment give you precisely 25 out of 100, you've gone back to his old notebooks and he realized his first day. So to give him credit he spotted the trend, but the point of it is if you now think of it in terms of statistics, it's actually highly unlikely that you'll see precisely 25 out of 100, it's like that coin, 100 tosses, it's highly unlikely actually if you think about it that you'll see precisely 50, because it's a very large probability cumulatively that you'll see 45, 46, 47, 48, 49 and then on the other side 51, 52, 53, cumulatively those are high probability of happening, exactly 50 happening, not that high and yet we try to pass ourselves towards particular outcomes. Now the moment we have a random variable, so where are we now, we have ourselves a model with parameters in there, like that acceleration of gravity, parameters are not expected to change, the average humanoid is not expected to change, but because we are unable to do the perfect experiment, because we sample measurements, because I sample your heights for instance, I have to acknowledge that whatever estimate I come up with is variable, because it will vary from for example this classroom to the next, as a function of who is sitting in there, correct? So our measurements are a function of our sample sets, so working with random variables, so we can't pretend that one attempt at an experiment is it, because there is variation, and if there is variation, question comes up, what's an average value that we can come up with and what's that range within which our measurements might lie? So when you are going back to that coin toss experiment, what's that range of heads, where you will consider that the coin is still a fair coin, so we are tossing it 100 times, you all said it's 50, but then supposing you get 20, will you still go ahead and call it a fair coin? You can see at some point, at some extreme value, we are going to start getting biased towards saying it's actually a result you have seen with a biased coin, not a fair coin. So in other words, we need ourselves an interval, an interval within which we'll say it's a fair coin. So inherently, we're heading now towards the definition of a hypothesis test, so I'm going to speed up a little bit and just point out that the moment you start talking that of random variables, you got to ask what's therefore the probability of having seen that unique outcome that you saw, and so you can't dodge this, you're back to talking probability, so statistics depends on probability theory. So the moment you say you're expecting so many outcomes, what's the probability of each outcome? And therefore, how likely or unlikely is what you saw? Ideally, you have infinite sampling, so when we want that average human height, you measure everybody's height in the world, but you can't do that. So it takes us back to this notion of working with a sample, but appreciate this, we want to ask what is the average value that we see, and here's a subtle thing, and here's a subtle thing which again escapes most people as they do research. When I work out my average, and how will I work out the average human height, so just think about this. What should I use as my measure of average human height? So if I collect your individual heights, what should I do with that? Okay, so that's the arithmetic mean. Is that the only way for me to come up with an average human height of the sample here? Could have come up with another average mean, mode variance, all those things we studied in school, right? But here's, I mean just take this further. There are different ways you're all claiming that one can work out an average. Here's another way to come up with an average. I'll take the shortest kind of room, the tallest kind of room, and average their heights. Is that an estimate of the average? It can be. Of course, you can immediately intuitively start being uneasy with that. Why? It might be a large variation that you see as a function of who walks into the room and changes the minimum and maximum, but that's an average. That's an estimate of an average. For that matter, I can just pull out an individual by magic randomly and use them as representative of an average. So different ways of looking at averages, right? So the average that we work out is itself a random variable because it's a function of the samples in here and samples change on us. So one can therefore ask what is the average of the average? And the average of the average is what's called expectation, the expected value. So I want the expected human height. So then Martian who comes down from Mars wants to figure out the average human height. So what's the expected height? And to figure that out, starts doing experiments, take groups of people, work out the arithmetic mean in each group, work out the arithmetic means group. If I work out enough arithmetic means that should also be tending towards average human height. The average of the arithmetic means should itself be tending towards average human height. Do you see this? Right? So we're talking about now deriving something from the raw measurements as an average and then using that analysis. In other words, you're saying don't work with an individual measurement, don't take one sample and pretend it's average human height. Instead, start taking arithmetic means and then average these arithmetic means and that would do a better job of giving an average human height. And there's a reason for that. There's a reason for that. The reason is it turns out that the arithmetic mean gives you a better estimate of average height as a function of sample size. So when you said, the more we measure, the more confident we get little earlier, that's what it is. So to increase my number of measurements, if I do more replicates of something, I seem to get more confident about what's going on as opposed to just doing one attempt once. So a statistic, and by the way, the word statistic is related to the word state. I don't know if you notice this connection. And the word state in all of this data analysis goes back to the fact that 200, 300 years back, kings, particularly kings of England, wanted to figure out how many people lived in the kingdom. Why? Because if you know how many people live in your kingdom, you can collect so much taxes back from them. So it all had to do with collection of money. And for that, you need to start collecting data and keeping tabs on people. So ironically, statistics has to do with money making. And later on, by the way, interestingly enough, statistics had to do with understanding where disease would break out. For example, the king of England, again, he did not want to walk into certain neighborhoods if he knew that there was going to be a disease there, diseases like the plague. So you want to keep tabs on where are the diseases occurring, at what rates are diseases occurring in different neighborhoods. And there's a lot of plague happening here. Don't go there. He goes somewhere else for holidays. Yeah, well, that, no, I've left that alone in the sense that, okay, so since the point is made about gambling, and instead of picking on gambling as such, we'll ask what is in it for the casino where you gamble? Or for that matter, what is in it for that state government which runs a lottery? Why do they do it? Why do they run a lottery? Because they can just print a paper and say this is 10 to 2. No, that's, yeah. No, no, I bring it in. Right, so let's be more precise in how we answer this. In terms of profit making, what is in it for the state government or for that matter, the casino? They collect the gold and they have to spend on the okay. So actually, the more rigorous way of formulating that is in terms of expectations. When you as an individual go in there, you're not going to do an experiment, and an experiment here is putting your money down and gambling. You're not going to do an experiment in finite times. You have limited resources, right? So you'll walk out of there. And of course, as you get into the experiment, you've got so much of an expectation of what your profit is, right? But if you think of how the game has been rigged up and most people who go to gamble do not appreciate this, but the way the game has been rigged up, the casino, for example, is not interested in whether it has to give out 1 million rupees to an individual or not. They don't care. They don't care about individuals. They care about it in an expectation sense. On average, they're going to be concerned about how much money is coming in and relative to that, how much money will they have to pay out? And they want a net profit. So how that money being given out is spread out, whether it's an individual getting a million or everyone's getting a small amount, they could care less. So the odds of this entire game get rigged so that they make a net profit. So that's an expectation. So you see the casino owners are in it for the expectation because they are in it for the long term, but you have very limited scope in terms of their ability to replicate this actually. Of what? But that's the whole point. The casino, the casino is not focused on an individual trial. It is not focused on you as an individual. Yes. No, it has to be. No, so the net profit is the casino has to make a profit by definition regardless of which individual is playing the game. On average, the returns to the people must be less than what the people are sinking into the game by definition. So it is an argument made on expectations and our problem is we get caught up with actual individual trials, not with the average.