 If you look at notation, notation in statistics therefore turns out to be important because you want to keep track of parameters that you really want to learn about means and variances. But invariably we are working with samples. So, where you want to learn about a mu we instead find ourselves an x bar and where you want to learn about the sigma square you learn about an s square and so on. But this also matters to us when we do regression. So, for example, when you go back to this chemistry lab and you are doing this absorbance versus concentration experiment. So, as you increase the concentration of reactant in a tube you want to figure out what is its absorbance how much light does it absorb. So, it turns out of course, there is I do not know if you recall this theory here. There is a theory which interconnects absorbance and concentration it is called beer Lambert's law. And beer Lambert's law basically says absorbance is proportional to concentration if you have got a dilute solution of particulates. So, in that case what beer Lambert's law is basically saying y which is absorbance is proportional to x which is concentration. So, therefore beta should be 0 there is no offset. So, as I increase x y will increase proportionately and that alpha is some kind of inherent constant based on which species we have put in our test tube. So, that is what beer Lambert's law says. Now, if I give you a reagent let us say this is a protein whose color you are trying to estimate whether you measure it or you measure it you all should get the same value of alpha because that is inherent to that molecule you are trying to evaluate. It cannot change it cannot change from individual to individual. So, it goes back to the fact that alpha is a parameter in a model it is that inherent constant regardless of experiment of regardless of trial. But that is not what works out when you do the experiment you do the experiment will you get the same value of alpha you will not. So, is there a problem there is no problem there is no problem it is like that coin toss right we are expecting 50 years out of 100 but somebody sees 49 somebody else sees 51 does it mean it is not a fair coin. So, you can have variation around this proof inherent constant therefore that alpha which should have been one precise value that everyone computed is not going to be seen by everyone you are going to get a range of values for alpha and in fact more precisely what we end up estimating is A x plus B where A is an estimate of alpha B is an estimate of beta and hopefully A is close enough to what anybody is claiming as a theoretical value. So, for example, if I evaluating acceleration due to gravity and trying to estimate the value of G then you got to appreciate that you will never get 9.81 you will get some estimate. So, that is why I am putting this hat on top of G we derive from our experiment and estimate somewhere close to somewhere close to 9.8 but that is ok I mean it is ok that we are not precisely there. So, think again of this coin toss I would be extremely surprised and very nervous if all of you got 8 out of 100 because that is actually reflective of a bias in how you perform the experiment. And I am not trying to use the word manipulate. So, I am just simply saying bias ok. So, if you then think about it all these things that we observe are estimates of the true fundamental parameters that the model the physics the model suggests. So, we sample therefore to kind of draw conclusions about an entire population ok. The population given described by probability distributions samples give us measurements following a distribution and we assume the samples are randomly chosen all these things go back to that coin toss why did we like that coin toss to start games. So, you know the set of outcomes you could not predict a precise outcome ok. So, now should I be looking at individual measurements or should I be looking at arithmetic mean. So, that x bar is my notation for arithmetic mean. So, a fair amount of time in this data analysis goes into convincing people do not look at individual measurements never look at individuals look at averages replicate look at averages and why is there an advantage looking at averages that advantages because of the variance associated with an average. If I look at you and I say that you are represented with average human height your height will be chosen to represent the average human height. And then I go and look at somebody else separately I do not learn anything new I am not able to converge if there is so much variation associated with the measurement of your height that variation exists with the measurement of your height. So, I am not able to zoom in and look at what is the true average height, but the moment I start working with averages the beauty here is as I increase my number of samples you can see that n in the denominator as I increase my number of samples my variance drops what do you mean variance drops we are converging we are converging to a mean value. So, if I look at individual measurements I do not learn new things if I look at averages I start learning new things and I start getting more and more confrontable about the true underlying value which is at the heart of why we intuitively think of replicating results ok. And I will just show this to you in the context of it is a statistics discussion you cannot avoid looking at graphs like this plots like this ok. So, inherently what this is saying is when you are looking at only one measurement and this is supposed to be the true value you are trying to get let us say that 9.8 then there is a large range of outcomes you might see associated with this curve is a very large range of outcomes you might see, but the moment I start doing replicates if I am trying to find 9.8 then my distribution starts getting narrower and narrower and therefore if I am now looking at a window around my mean that window is shrinking ok. So, think of that window as some kind of an error bar that error bar which I am going to use to represent my uncertainty in my measurement that is shrinking. So, as obviously as n goes to infinity this actually becomes one delta function it actually becomes a peak absolutely confined ok. So, this is one of the key theorems and statistics which tells you that basically sorry with more and more samples. So, as I increase the number of measurements and replicates basically I start learning more ok. So, if I come back to now this notion of a hypothesis test and I make a connection to humanities of philosophy is actually goes back to a guy who is dominated over the last again 50 to 80 years 50 years. And his theory here of the mode of research is something called critical rationalism is a lot you can learn about this you can for instance just start off with Wikipedia and then look up a lot of associated links there and he basically says it is very hard to prove something true and this point was made to you on the first day in any research methodology we feel it is very hard to prove something true. And in fact it is much easier for us to set it up as a test of failure you know those we try to prove something false because it is much easier to show a counter example and prove something is false than to prove something is true. And this affects us tremendously in the way we go about doing research. In fact when you apply for funds to a government agency they will actually probably want you to write a question so that there are narrow experiments which you can do to prove or disprove more likely disprove something ok. So if you want an example to think of this let us say that you have again let us go back to this scientist who has come up with a new drug candidate in the lab. So, you think it is a new drug for cancer you think it is a great drug. So, how do you prove it is a great drug you got to go do trials you got to test it on people on patients and then prove. Now how many samples must you do to prove that it is a drug very large number. So, what is the easy way out because pharma companies and by the way a drug a new drug costs a pharma company about 1 billion dollars. I told you they hire statisticians because if they can cut down on the number of experiments they do at 1 billion dollars per drug in R and D you can see the impact of doing good data analysis. So, what they do is instead of asking how do we prove that the molecule is a new great drug you do something very silly seemingly silly you prove that the molecule is not the same as a sugar pill. So, do you see that what you should have been doing is proving that your molecule functions as a nice active chemical agent. But instead we set up a hypothesis that the molecule that we have just invented is as good as a sugar pill sugar pill by the way is called a placebo and then what you do you actually try and prove that it is not as good as a sugar pill. Therefore, indirectly implying that it is better than a sugar pill. But you see the entire spotlight is on proving or disproving equivalence to a sugar pill and the spotlight is not on proving that this molecule is a fancy drug. So, it is a neat trick, but it is a trick which has impacted us in terms of even a legal system think about it think about it. But how does a legal system work innocent until proven guilty. So, it is a hypothesis test why should we go with innocence unless proven guilty why not directly prove guilty because you can see how that trial will go you will try to find something some wrong or the other the somebody is done and you will be able to mail anybody. So, everyone is finally going to be shown guilty of whatever it is you will pull out something from the childhood and say that they are guilty. So, it actually impacts how we have gone about it. So, the basic notion is that you are innocent and therefore, the burden of proof is put on the prosecution to prove that you are guilty. It is a related argument here it is easier to prove a falsehood than the truth directly and that is how our science is done. So, we in fact focus on setting up small statements which you can prove false of course, if I prove that my molecule is not the same as a sugar pill it does not tell me how much more activities and for example, is my antibiotic, but new antibiotic better than an existing antibiotic I do not learn that. In fact, that have to take up as a next question is not it is not if you know that that is why I said this is a very fuzzy philosophical area, but it inherently reflects the fact that it takes us a lot of examples or experiments therefore, to try and prove a truth because all have to do is to find one counter example we say that all crows are black I just have to go there and find one white crow. So, it is easier for us to overturn a statement and to keep proving that all crows are black because what is it take for me to do that I have to go sample every single crow and then end up concluding that all crows are black no, but that is the, but then how are you going to proceed with an experimental design. So, what are you going to about it next you either have to go and sample all crows. . No, so the hypothesis that crows are black. So, the test for that is there are two approaches either I go and identify every single crow which is obviously not doable. . Yeah, now we are getting into something even more fuzzy in terms of philosophy. So, we are trying to stay away from that I just want you to point out I just wanted to point out that this idea that it is easier to tackle a falsehood actually dominates how we do a statistics because we take small questions which we try to prove a disprove ok. So, we these days design experiments are falsifiable hypothesis. The other approach would have been inductive reasoning you keep seeing more and more evidence and you induce from there what you think the behavior is. If I keep seeing a drug work on a set of people and belief is that it will continue working on more people. This for example is our traditional Ayurvedic style of treatments for instance. And reason it is not too popular these days in science is because then that requires that you keep testing for more and more crows to be black we need more and more experiments it seems to be a lot more work. So, the procedure then to summarize is that to test hypothesis pertaining to a population parameter we set up a model we identify the parameters in a model and now need to devise a test and it goes back to this range. So, if I am going to prove hypothesis if I am going to prove that my drug is better than sugar pill. So, what is the trick if I want to prove that my drug is better than sugar pill I do not put the spotlight on the drug instead I put my the spotlight on the drug being the same as a sugar pill and then I try to prove that that statement is false. So, if the drug is same as a sugar pill the difference in performance of the drug from the sugar pill is this. We will assume that that 0 point maps on to drug being equal to sugar pill. So, then it is like that coin toss we expected 50 heads in fact it is easier to give you a coin toss analogy here. So, at what point will you say that the coin is not fair when you see two extreme result either less than let us say 20 heads or more than 80 heads. So, there is got to be this range there is got to be this range where you will say within this interval any measurement that I see or any average for that matter we said we will work with averages any average that we see is reflective of the true value in other words we should on average have seen this what we are seeing values here, but a value here is equivalent to this. So, that is the statement the value here is equivalent to that and reason it is not exactly that is because of my imprecision in sampling and the fact is inherent randomness in how I am sampling. So, I have got to acknowledge that I have got imprecision. So, that sets up for me a range that range is where my null hypothesis and what was my null hypothesis drug equal sugar pill that is where the null hypothesis is accepted actually we do not like saying accepted remember what we were trying to do we were trying to prove a statement false ok. So, the fact that I get a measurement in this range actually does not prove that the it does not prove that the drug is a sugar pill it means again notice the trickery here it means that I failed to disprove that the drug was different from a sugar pill. I do not have enough evidence to say that my drug is different from a sugar pill because my measurements for the drug seem to be close enough to the measurements for the sugar pill it is like trickery in other words we are being a little grudging we do not want to accept that the drug is a sugar pill we do not want that and why did we not want that as we are a pharma company trying to make a profit selling a drug we really want the drug to work out to be different. So, you can see how the whole research approach is being tweaked around we do not want to tackle the question directly. So, we are framing a falsehood we frame a falsehood hoping what hoping that our measurements will lie outside somewhere far. So, that we can say that drug is tremendously different from a sugar pill that is what we are hoping for and when it does not happen when we end up here rather than say a drug did not work out we will simply say we did not have enough evidence to disprove the null hypothesis. So, we are not accepting the null hypothesis we have not been able to disprove the null hypothesis. It does not again it can be. So, it could be on that side. So, you can see the trickery in this and all of this because it is possible for us to do a few experiments and then very quickly figure out where do we stand relative to a claim there is also a subtle thing which happened here we said the drug equals a sugar pill ok. So, now it comes to that if instead I had said let us try to prove my drug is better than a sugar pill then I would have been stuck with the follow up question how much better because then if I saying that the drug is better than a sugar pill and here where my sugar pill is I would have had to ask where do I now expect my drug to show up does it show up here or here or here. So, define what you mean by better and we do not want that because that is a tricky business. So, even if you are one percent better than a sugar pill we are going to advertise tomorrow and say we have found a super duper drug which will do the job fine. So, the whole analysis procedure gets tweaked around because you want convenience in experimentation and you want to avoid large numbers of samples in your instrumenting. So, this is a point I had made a little earlier remember the IQ rural versus urban ok. So, in all of this in what follows I want you to appreciate again a major insight into statistics is statistics cannot counter a badly designed scientific question. You want to answer is a drug better than a sugar pill there is a way for that there is a statistical algorithm for that should you have been comparing that particular drug with that sugar pill I do not know that is for a chemist or a pharmacist to kind of explain why that molecule must be studied and the statistician knows none of that. So, if you therefore now try to summarize what we have just described as a procedure what is the typical approach that we follow in methodology for involving statistics in this. We start by stating a hypothesis actually we state two hypothesis we state a hypothesis as an equality drug equal sugar pill right, but why did we even do that because we wanted to prove that the drug was good. So, in other words our real belief what we want to succeed is an alternate hypothesis that the drug is better than a sugar pill that is where we are, but because we do not want that to be tackled up front we focus on an equality and that equality allows us to prove that we are slightly better than it easily. So, state the hypothesis identify a random variable involved and the relevant parameter to be tested. So, the moment we are saying we are going to start looking at drugs. So, what is the measurement for example, if it is a BP drug I will have to start measuring people's BP's those have taken the drug those have been taking the sugar pill. So, that is going to be my variable. So, identify that measurement which is going to be tested clearly state the hypothesis. So, H naught is what we want to here is a trick the farmer company does not want H naught to win the farmer company wants H 1 to win, but we shoot down H naught. So, identify how H naught is expected to describe variation in other words our measurements are not going to be precise our measurements have error in them we can expect a range of results. So, if you are going to make a claim about truth of falsehood that is relative to our ability to correctly do an experiment. So, how much error might we have what is that uncertainty with our measurements and therefore, intuitively if our measurements are very precise we should really be able to distinguish a drug from a sugar pill. But, if our BP measuring device you know those devices that they wrap around you that measuring device itself has a huge least count and do you really have the ability to distinguish a drug from a sugar pill. So, it is a function of your measurement errors. So, therefore we need to understand the distribution that applies and the variance in particular that is associated with your measurements. So, we then identify a significance level when I do the distribution for you and I said here is an interval there was a certain area outside it do you want that area outside it to be very small in which case you are setting the goal posts even further wide apart. It is very hard for you to now shoot down H naught because anything inside the goal post means H naught is ok drug equal sugar pill. So, you want a narrow set of goal posts because then it is easier for you to shoot down H naught. So, what is that area outside identify the threshold which will use to decide on the test. The relevant distribution is to be identified. So, we will get into this as we do the actual workshop as to what are the procedural details in carrying out these steps. Are we looking at one side of the table of the distribution or are we looking at both sides. So, that is what I mean by one side or two sided and compute the test statistics from sample data. Just one subtle thing here the first appearance of our measurements from our experiments in the entire approach is in step 4. You do not do an experiment you do not collect data and then go back and ask where should the goal post be. If I want to prove that the drug is a sugar pill is better than a sugar pill. I should not get my measurements and then figure out where to keep the goal post because I am biased remember I am a farmer company here trying to sell my drug. So, I want that drug to work turn out to be better than a sugar pill. So, I will make sure I will set the goal post such that the sugar pill looks far inferior to the drug. So, you cannot do that that is cheating obviously. So, the goal post has to have to be defined first and then you ask where did our measurements end up are we inside or outside and if you are outside we say you got something funny or extreme going on. So, these are subtle ways. So, this procedure is written out and I describe in the data analysis course in detail. What happens if you interchange steps? Where is the possibility for cheating in here? Yes. So, what is the what are the issues with the procedure steps and how does one kind of compromise the procedure by doing things out of sequence? They are typically negations of each other, but that is also a kind of problematic thing. So, for example, what is the complement of drug equal sugar pill? Look at look at the farmer company if it was a BP drug drug for BP what did you expect? The BP the change in BP should be it should control your BP better than the sugar pill does. So, in other words we expecting BP measurements to lie on one side not randomly on both sides of what the sugar pill does. So, the opposite of drug equal sugar pill could have been drug either not equal to sugar pill or drug better than sugar pill or drug worse than sugar pill these are all possibilities. They are not they are not direct negation. So, it comes down to again precisely what you have in mind are you expecting. For example, somebody is trying to sell you a set of light bulbs and claims that your light bulbs last 500 hours right and you decide to do a test you are going to buy light bulbs and figure out the average lifespan of a bulb. Now, what do you think? Do you think your light bulbs will on average last 500 hours more or less? Where do you expect your measurements to be? Less because you expect this guy to have already inflated the estimate of 500. You do not you do not trust that 500 that is why you are doing a test in the first place that is why you are investing your money in buying light bulbs. So, at the end of it all you do not expect measurements to lie much above 500 because if you end up with working out an average of 600 hours light bulb that will be going to be very grateful to you to proving that his bulbs are good. So, if anything you expect him to have inflated that is what we do in advertising. You inflate a claim which implies therefore, that you expect measurements at a lower end. So, we want our attention therefore, focused on the lower end. Finally, there is a systematic way in which you report a test result and there is a concept called P value which is critical to reporting these results and you also want to talk about how powerful powerful as in a numerical estimate of the power of a test. So, these are concepts that one describes in detail in a hypothesis testing course. So, before you break for T I just want you to appreciate that life is not as simple as what I just put up. The reason is probability as you have talked about it has been talked about in the context of a repeatable experiment. So, we talked about coin tosses and you have no problem tossing a coin 100 times your thumbs will hurt, but you can toss it 100 times and you can see yourself doing that experiment. But supposing I asked you to prove that increase CO2 levels will cause the polar ice cap to melt. Increasing the CO2 levels will cause the ice at the north pole to melt. Can you repeat that experiment? No. So, you have a fundamental problem. So, we still want to test such hypothesis. You can see obviously in climate science these are very important hypothesis. So, now it comes down to not so much probability based on frequency of an occurrence. So, when you talked about coin toss the frequency of heads over all possible tosses. So, it is a frequency based definition. Now, we have got to talk about belief based definitions because we have for example, with that crow example we have a certain belief that crows are black and then that belief system will change based on which crow we see. If I see more black crows my belief that crows are black will only increase. On the other hand if I see a white crow, boom, my belief system changes tremendously. So, there is this notion of belief and that takes us back to Bayes theorem. Hopefully some of you recall Bayes theorem. We would have been taught to use probability of A given B in terms of probability of B given A. But we have now got to talk about probability of a model. So, the model could be crows being black. The probability of crows being black given that I have seen a new crow who is black. What is that now? That has got to be written in terms of the probability of a model before. So, here is how this gets interpreted. Before I did any experiment I had a belief in crows being black and then I went and got myself some data. I did some experiments. I got myself some data. I got some crows and I looked at their color and having seen the data I have got to update my belief. So, that term on the left hand side is almost like an update. You had a previous belief. You see some data and now you have an update on what is going on. So, there is this approach called the Bayesian approach as opposed to a frequency based approach of how you interpret probabilities and therefore beliefs. And research is a lot like that because we do not think we have got a precise model that we can do infinite sampling and then work out parameters in a model. It is almost like a belief. So, there is a certain belief in Newtonian physics and your belief in Newtonian physics is going to be shaken the moment you see one example which cannot be explained using Newtonian physics and then you need Einstein and his theory of relativity. So, it is a belief system. So, given for example, two theories, two competing theories you are going to have beliefs in both theories and our tendency is to go with a simpler theory which explains all that we have seen. So, therefore, the belief now that is being shared by two theories which are competing with each other kind of gets modified based on the data we have just seen and the data tends to say that one theory therefore, ought to gain more weightage or belief or credence and we put our belief now in this new theory compared to the other theory. So, in the Bayesian world it is not like saying there is only one theory. So, if you thought of that hypothesis test we had sugar pill equal drug that was one narrow question and that is that and you disprove it, disproving it then gives you a gray area what are you going to do about it. So, it turns out your drug is different from a sugar pill what does it mean you have to do something more to explain what it means. So, in the Bayesian world you end up saying there are multiple world multiple theories possible and now it is about how much do you believe in each of these theories. So, that average human height could have been 3 feet could have been 5 feet could have been 7 feet and as that alien sees more and more human beings the belief in the average human height being 7 will drop down 5 drop down in 3 goes up. So, you have multiple models multiple theories and we are now talking about juggling the extents of belief in each of these theories. So, that is a Bayesian world that is important to us because we are doing things in incremental debates when you are talking about doing experiments we are adding or removing belief in a particular model system. So, there is a strong conditional probability based explanation of how we go about that. So, that of course now impacts that hypothesis test. So, if I told you that I want to simultaneously believe in crows being black and crows not being black and I just want to juggle the extents of belief in these two models based on available data that is a different way to go about it then simply saying crows are black and then trying to go with the set of measurements because when I say the later case crows are black I acknowledge only one model at a time otherwise I can think of beliefs in multiple models. So, we will pause here for T and when we get back we will start discussing about ways to representing data and how one can be more efficient in conveying information with data presentation and after that we start talking of key distinction between cause and effect that is also key aspect of research. What I will do is give you a flavor for three additional topics and we will have to kind of leave the discussion of the actual topics for the main workshop later in the month. Now, what I am going to do in the next topic is point out to you that it is important to convey this information precisely to others. So, it is usually a discussion of descriptive statistics but I am going to forget the descriptive part and we are just going to point out statistics using pictures and I just wanted to be aware that over the years this has taken on a different dimension in terms of how you can convey information fast. Of course, do not forget this point which was brought out in one of the videos you saw on the first day which is you must talk about numerical significance. It is very annoying to see students take some measurement let us say 1.0 and then divided by another measurement 3.0 and say that 1.0 divided by 3.0 is 0.333333 just because the calculator gives you 10 digits. You got to talk in terms of the number of significant digits and the least count invariably of your measurements defines the number of significant digits. You got to then persevere with all subsequent calculations with the same number of digits. You suddenly cannot get something more precise to infinite digits just because you calculate it will see that. You calculate it is not aware of errors and how errors propagate. So, it is the key aspect which governs therefore these intervals of uncertainty which anyway were what we are talking just before the break with the hypothesis test. It is something that is particularly annoying when students tend to do this in labs. So, it is something that we will discuss in the main workshop in some detail. But I instead want to focus on the fact that we are all used to certain ways of depicting data. So, the bar graph, the line chart, the polygon and the fact that we need to be maintaining data in tabular form. So, if you think about it most of this has to do with the fact that given tons of data you cannot really walk around with data. You have how many significant digits in 1.0 2 significant digits because 1 was important to you and the 0 was important to you. In other words you do not trust the follow up digit 1.05 you do not know about the 5 right. So, if you got 2 significant digits you have 2 significant digits in your answer that is it. So, 0.3 yes how many significant digits. So, accuracy gets lost when you do series of calculations. For example, when you subtract if I were to tell you take 0.33 and subtract 0.31 you lost significance because you have now got 1 significant digit. The answer is 0.02, 0.33 minus 0.31, 0.02, but only the 2 matters to you. So, you got down to a significant digit. So, you can sit and work this out. In other words you can simulate what might happen at the next place and then work out the range of answers that you get. So, fairly important aspect is that as you go through the calculations you tend to lose significance. No, the point is if you have got a certain number of significant digits and typically the least count of whatever you are using tells you that you cannot do better than that and if you have got one measurement which is inaccurate and that is entering into a formula with many calculations it does not matter that the other things were measured to high precision. This guy which is inaccurate is going to impact the accuracy of your final product and so that has to be taken into account. So, that is invariably ignored because calculators report for you a set number of digits. That is exactly 1 k. That is exactly 3 percent. So, what is that 1 k d divided into 3 equal to 1. No, there if you are implying that your 1 is to infinite precision and then you are dividing by 3 people and 3 obviously is infinite. Hello, hello in terms of precision because it is an integer. Any word infinite precision? So, 1.0 and how many? So, 1 kg precisely split into 3 is has infinite precision, but a 1.0 divided by 3 has its problems. So, coming back to this the whole point of this is given tons of data we got to find quick ways of compressing it and carrying it around we cannot carry matrices of data. So, you want plots you want to learn things from plots. So, when you are talking about distribution of heights what is the average what is the range is it a skewed distribution are we a room full of tall people or the short people. So, all of that information is easily conveyed using some kind of a curve or a plot and invariably we want to abstract out these plots or these functions from the whole matrix of data. So, that is where this notion of descriptive statistics starts. So, these are the classical ways of doing it. Here is a waterfall chart. You start with 217 then something subtracts out 55 then you add back up and you can see the levels going up and down and you can look at the final 199. So, these are just different tricks with how do you want to convey information of addition and decrease. There is a stem and leaf plot which your textbooks statistics books say here is a way to represent numbers. So, that is a 70.0 or 69.0 do you see that the common digit is abstracted out and written to the left in the margin. So, if I have got 51.0, 51.3 and so on then rather than repeat the 5 all the time pull out the 5 and this now starts looking like a leaf. But these were things invented way back when all we had was typewriters to convey information in plots. We can do much better. So, of course now a days thanks to Sylab hopefully all of you appreciated the workshop on Sylab. There are ways to now handle lots of data and come up with elegant plots to convey trends. Here is actually something out of biology where for different patients you are looking at the level of production of a protein and the idea is to figure out is there some defect in protein production in a patient. So, it is called a heat map. There is a pie chart. The first pie chart was ironically by a nurse. You remember Florence Nightingale? She was trying to make a simple point again back to the politicians in England. She wanted to say that there were more soldiers dying in the war zone because of lack of hygiene and because of enemy bullets. So, she just looked at mortality how many were dying per month and here is a way of conveying it back. So, it is a simple elegant idea. The same thing can be drawn in different ways of now here is an unusual way to think of the sizes of continents and here is Africa with different countries plugged in there. You see China, you see the US, you see India plugged in here. So, it gives you an impact now of what is going on in terms of comparative behavior. Here is a periodic table in a form you have not seen. So, the idea is to provoke. The idea is to communicate and inform. You can start provoking with different representations. These are called Chernov faces used in pattern recognition. And if you got, I remember I told you about that nuclear plant operator who is trying to stare at 100 different sensors at a time. So, he is getting some readouts, some signals fluctuating back and forth. He wants to make sense fast. So, it turns out one quick way to do this is to take each measurement and assign it to some part of the face. So, your eye, shape of your eye, shape of your eyelid, whether you are smiling or not. So, the process is a good process. Hopefully, you are somewhere over here, a small smile. Why is it important because you map the variable to some facial feature. It is easy for us to recognize the facial feature and recognize things are fine. So, very intuitively, it quickly tells you and you do not have to do the calculations. You do not need formula. It tells you one look at this image and you are learning a lot about your process. So, it is just a trick. So, each of these lines you see is a different variable actually. So, the magnitude of the variable depicts the magnitude of the line. Here is the world map drawn according to population. You see China and India dominate. The other countries do not. So, just ways to be provocative. And that said, you really ought to go and look at this site, gapminder.org or you could go to ted.com and look up a chap called Hans Rosling who is essentially founded this company. He is basically a health scientist and he has been trying to talk of how they have been tremendous changes to various third world countries over the last 30, 40 years in terms of how health has improved, mortality has improved. Beautiful insights, beautiful insights. He has got essentially a tool which looks like this except it is animated. So, that would have been gapminder.org or ted.com and the name is Hans Rosling. So, he has basically got this tool and there is lots of data which the World Health Organization has collected over the years and essentially runs an animation. So, video of what happens over the years. So, his bubbles move around. You scroll this back to 1960 and you start a video and then you realize what has gone on in the world. So, the history of the world boom in 10 seconds. The amount of information he conveys with this is staggering. So, basically he is trying to point out how countries like India and China start at this end, very little money, very poor health as in the health is measured in terms of life expectancy. So, very low health. How are we all moving towards Japan and the US? And you are not doing as badly as we think we are. So, you are moving well. And you learn some surprising insights. China for instance controlled its population first, before becoming rich. So, boom they went up there, they improved their life expectancy, then they are becoming rich. So, he has got tools like this which allow you to look at various parameters. So, this is a function now of the United Nations giving data out there, census derived data out there. So, you can play around with this and it impacts now policy in big way. Google has a similar tool for this now which is open source which you can play around with this public data explorer. So, point of it is there are now much more sophisticated ways of making data felt in terms of a presentation to others as opposed to the standard cartoons that you see in your statistic books. So, you do not really need the bar plot anymore or frequency polygon anymore can help it. There is also this standard cartoon that I put up which is that we need information to bring down confusion, but you have too much information you can actually confuse people. You have to tailor your presentations of the data that you have. The things that you can do, you can plot this data, you can learn how to deal with data in spreadsheets. And of course, you can also learn how to do it in sila. We will do some of that in the course. Now, in what little time I have left, I want to bring up one important serious topic which has to do with cause and effect. And again, I want to just like we have talked about the hypothesis test before the break, I want to bring out that we tend to confuse the notion of cause and effect given the data. So, you can look up this particular phrase on Google and you will get a set of presentations created by a guy called Judea Pearl. Judea Pearl is actually more famous for his son, a guy called Daniel Pearl. Some of you might remember as that American journalist killed in Pakistan. So, the cause of, so he is one of the leading researchers in causality, that too from a computer science perspective. And unfortunately, thanks to his son being killed off, he is given up all his research. There are lots of interesting presentations that he gives about the history of this kind of research methodology. And since we are talking about methodology, it is important that you get a sense of this. And you will appreciate that really you want to make statements about causality. Something is causing something. This drug causes that symptom to go away. You want causality. Smoking causes cancer. But the catch is whatever experimentation you do only proves correlations between variables. So, you take a bunch of people, you will realize that they have been smoking and that they have cancer. So, there is now a relationship between X and Y. That is a correlation. What it does not tell you is that smoking causes cancer. That is the theory we would like to believe. That is the theory which inherently we think is true. But it is subtlety that I want to point out to you. When you say Y equals AX plus B, you are saying that Y is related to AX. But think of all the years where you have been asked how to plot in your high school. They have been told put X on the X axis, the lower axis and Y on the vertical axis, right? You have been told to do that. The interpretation being you are in charge of X. You control X and then you will see what happens to Y. So, in other words, you have got an independent variable and a dependent variable. Something depends on something else. So, X is the cause, Y is the effect, right? But there is a subtle thing about that, which is if you are insisting that Y equals AX plus B, I should be able to rearrange that algebraic expression and write X as a function of Y. But then what is cause and what is effect, okay? So, the key thing here is in math, we do not in an immediate sense have a notation to deal with causality. So, we have an equality sign that equality sign only tells us there is a relationship between Y and X, but it does not tell us that Y is caused by X and it does not tell us that X is caused by Y. It simply says that you change X, Y changes and vice versa. Whereas in causality, we want an arrow. We want to say X causes Y. That is a subtle thing. So, when we do a research and we say let us go about an in our process, make 10 measurements, we will quickly realize that some things seem to be changing along with some other variables. And therefore, you are now tempted to say there is a relationship between these variables. And there is one quick thing that we jump and claim, which is that variable X caused variable Y, but how do you know that? So, the extent of smoking seems to be associated with the extent of cancer that people or the incidence of cancer. But think of the interpretation. Does smoking cause cancer or do people with cancer smoke? And why is that not feasible as a theory? So, we have more preference to one particular direction of this arrow than the other. But there is no proof. You are unable to prove that direction. You are unable to prove it until you do an initial experiment. And I will discuss that a little bit. By the way, I just want to show you this. Can you ever think of an, you asked a plot Y versus X. So, it is a standard interview question that I have when I interview students. Can you interchange the plot and plot X on the Y axis and Y on the X axis? Yeah. In thermodynamics, you have to do it. But it is a skill. It is in fact a failing of a schooling system that you are taught to visualize things in a strict format. You tend to visualize process. For example, a saturating process like this. Could it have been saturating like this? So, it is one of those issues. You have got to figure out how to visualize relationships between variables and not get locked into a bias in one form or the other. So, it turns out many people have struggled with this. Galileo, of course, started shaking things up. And then later on, it turns out many laws that we have developed are empirical laws in causality. And the first guy who really starts talking about causality is a guy called Hume who looks at flame and heat and says, where there is a flame, there is heat. And he is trying to imply that you can learn these rules. That is the first instance where you get a sense that one can look at data and learn a rule. Learn a rule which says something is causing something else. Of course, you got problems. The crowing rooster and sunrise. What is causing what? You can see the philosophical tangles you get into. All you know is that the two happen at the same time. But what is causing what? Similarly, when you have given an expression like this, you will realize that given two quantities, you will be able to solve for the third. But does that mean force is caused by acceleration? Or is acceleration caused by force? Because if you think about how we are taught in physics, acceleration is actually caused by something propelling an object. So, force being applied. It is a subtle thing. But does the equation tell you that? It does not. For that matter, can you rearrange it to say f by a causes mass? So, the causality is a big headache. And then another major statistician comes along in the early 1900s and talks about correlation and not causation. Keep focusing on relationships. Prove that they are pairwise relationships. Show that x and y are related by some linear model. Y is proportional to x. But do not go that extra step and claim causation because you do not have proof. And it then turns out that if you want to prove x is causing y, if you want to prove that smoking causes cancer, you have got to do some additional experiments. And a simple example to prove this. If you take the equivalent relationships y equals 2x and z equals y plus 1 and x equals y by 2 and y equals z minus 1. So, I am just playing around with this. You will realize that they are the same expressions. They will automatically imply that the two statements are identical. You can go. I am just kind of rushing through this. But you can take it for granted that the two statements are identical. If I plug in the original equations, you will get that. But if you think about it now, if I want to show arrows, how do I go about it? To go from x to y in the top model, over here, what should I have been doing? Should I have been multiplying by some parameter into 2 and then adding 1. So, the into 2 gives me y. So, y equals 2x and z equals 1 plus y. That is the interpretation. And similarly, if z is what is causing everything in my model, I have got to interpret it slightly different. So, it is a subtle thing. But how do I know which one is true? I will see values of x. I will see values of y. I will see values of z as I do my experiments for different combinations of x and y and z. So, how do I know which model is true? You cannot do this until you intervene and do something critical. For example, suppose if you come in here and say, regardless of whether y came from x or not, let us somehow fix y. Z y equals 0. Let us force y to be 0. If you force y to be 0, what should happen? According to this upper model, what should happen? y is 0. It means z is 0 all the time, regardless of what happens with x. Sorry, z is 1 regardless of what happens with x. But according to the lower model, what should happen? If y is 0, x is 0 regardless of what happens with z. So, now do you see if I intervene and set y equals 0 and then I observe x and z, I can figure out which way my arrows are going. So, that is the basic idea of an intervention experiment. So, invariably when you do experiments and you learn that so many variables are related to each other, what next? How do you figure out what is causing what? You have to go fix some variables. Then as soon as you fix some variables, you got to ask what else will change. And that gives you an idea of which way things are headed. Just to get a sense of how controversial this can be, I have just put a couple of examples here and we will end with these examples. So, if you go back to a major publication in the proceedings of National Academy of Sciences in the US, this is the article which came out a couple of years back on warming, increasing the risk of civil war. So, this is not really about causality directly, but the fact that sometimes interpretations can be false or could potentially be false and therefore dangerous. So, here is a statement which says warming increases the risk of civil war in Africa. And if you go through the paper, they basically they are saying that years which are likely to be very warm years in Africa, the US and the UN should not pump in money as aid into Africa because people are going to fight anyway and they are going to waste the resources. So, that is the impact of the statement. So, you can see there is big money at stake for countries in Africa. How do they come up with this? They simply looked at the average temperature on the continent versus the frequency of wars and there is a correlation, there is a relationship between the temperature and wars and from there they are jumping to saying warming causes wars, causality, causality which is the warming which caused the war or causes wars. Now, there could be a elaborate reason warming hot years you run out of water, you run out of water you are more likely to beat up your neighbor and wars will happen. So, there could be a realistic physical reason why this is happening, the experiment is not done to prove that linkage. And it resulted did impact funding, humanitarian funding to Africa. So, this can be controversial of course, there has been lot of follow up on this in the policy literature debating such things. Here is another thing which created a lot of us had to do with cricket. So, you remember Javed Miandad and the famous six at Sharjah. So, here is a guy who published in a medical journal in England that after that six in 1986 that Pakistan started winning relative to years before, started winning against India. So, what you are looking at is the net cumulative wins that Pakistan had. So, they are trying to imply that that six was a defining moment in cricket after that at least up till the early 2000's Pakistan had a winning streak against India. That is also flawed argument of you can immediately imagine cricket right India Pakistan what more do you need a whole fight broke out on this. And what do you think the cause of all this was what do you think the flaw in all of this turns out Pakistan was simply doing better in cricket with any country. So, otherwise they are not winning against India particularly they are winning against everyone. So, that is a good time that is concerned. But you can see the interpretation of how interpretation gets tweaked and you are claiming a causality that six had an impact on India Pakistan relationships according to this theory. So, you are tweaking statistics for your own policy purposes. So, couple of more fun examples just to wrap this up. So, I am going to give you a plot here where I tell you that on my x axis and once again x is on the x axis why not put y there right. So, if x is on the x axis let us say it is the number of crimes per city per year in different cities. So, each point is a city and we are looking at the number of crimes per city per year. What do you think why is they can immediately see how dangerous that this analysis can get. So, do criminals go to temples or do people go to temples cause crimes that is what happens to if you are going to infer a causality here. So, at best you can hope for a correlation and even that correlation intuitively is troubling to us right. So, why are we looking at a correlation between these this pair of variables what is the resolution to this what is the explanation for this plot this is a plot this is how the data will look like for different cities. So, nothing wrong with this one can statistically prove that this line is indeed good proper straight line with a non-zero slope what is the explanation for this the population. So, actually what they should have been measuring is the number of crimes per population of that city number of temples you have a larger population you will have more crimes and more temples to deal with that larger population. So, the inherent problem here is there was a third variable which had been ignored from this analysis which could have better explained everything. So, that is again a practical problem in the search are you looking at the right variables. Suppose you get included population as a third variable this relationship would have worked out with x being related to population and y being related to population no problems with interpretation, but now you can see how troubling your interpretation gets and this is statistically sound as a regression you can do test to prove that x is related to y, but that is what happens in social policy making you collect hundreds of variables that gap minder link I gave you this hundreds of columns of data available to me and I look for relationships and then I will get a relationship like this now who is to say that this relationship is invalid because now I am going to an extreme and giving you a pair of variables to provoke you, but how do you know? So, final point of this is each measurement relevant to you and this is another standard problem in the research we tend to throw out. So, we tend to throw out measurements that we do not care for so we will tend to think that this was something bad we did in our experimentation. So, let us quickly hide that measurement before somebody sees it and pretend that we never did that experiment that is very standard behavior, but for instance that could have been a very influential point which actually leads you to believe that it is no longer a linear model, but something non-linear happening to you. So, how do you interpret your data? So, you can see there is a systematic need for diagnosing the importance of each and every measurement in your analysis and then evaluating it in the context of a relevant model. So, I will stop here with this, but the last part of my talk would have been how all of these issues impact the actual publication of results of the published literature.