 pooling that is used, type of this, type of that, all catapulting the data. Or for instance, Sama Kapi said, we want to understand what explains the reasons for why our service request processes are delayed. A band said, we want to understand why our servers are crashing. Starki said, there's a serial number, several series we were talking about. We want to understand what impacts the TV radio. Tell me what can I do to influence the TV radio This is often itself, but let's not get into that. We are currently working with a band that says, we want to know who's going to leave before they actually, before they actually tell us whether they've made their decision or not. And all of these have just one characteristic in common, which is that we are trying to answer a certain pattern of question, what explains something. And the bulk of the data is categorical because in 2012, we didn't have a different system. We don't have automatic data. We created one and this is called openness. This talk is basically about this. There is a library I wish I had been with you towards the end of the talk. But the focus of this session is on explaining what this technique is and how it works. All right, go ahead. I'll talk about a couple of things. First thing, this is a simple video. Not going to talk about the data. Not going to talk about your members. Not going to talk about machine learning. Not going to talk about AI. The most complicated thing that I will be covering is the test. And you don't need to know much about the test to be able to get this data. But as I've said, there's a difference between simplicity and ease. I'm focusing on simple things. Ease is different at all together. Also, I'm not going to try and make this entertaining, this focus on teaching because that's very, very important. Right, people mistake entertainment for education. We had a lot of an interesting talk. And say, oh, that was really cool. What did you learn out of it? I don't know what it was really cool. I'm guilty of that. I've got a number of patients that are very down to earth. And we'll be using spreadsheets. I'm tempted to say Excel, but that's the purpose. It's not to say Excel is great. It is great. The point of the spreadsheets are great. You may have heard of Redbooks, most of which talk about how after the advent of high level programming languages, there hasn't been something in programming that dramatically transformed the money. So there are a few exceptions. Data uses are one such exception. Spreadsheets are another such exception. And I truly believe that spreadsheets are one of the greatest gifts that software can find. Takes the ability to process data away from an esoteric small group that has access to the state programming, to the multi-cultural world, and you could argue that more analysis is being done, they're more excellent, pretty much anything else. But any spreadsheet, that's all we're going to put is going to play around with inspections. What I'm going to do is actually take up an exercise. You will answer one simple question using this data set. Through calculators, there's a use of calculators actually help students score more in mathematics. So let's open the data set. I'll make it a bit bigger, hoping that we can see it in the right size. Is this good enough for the people at the back? So the data set is fairly straightforward for each student. And I've taken a small sample of the data set on the students in the state of Maharashtra that I'm taking. So we have about eight of these students that have been sampled in Maharashtra. We have a student ID, which is a long number. We have a district code. We have the gender, we have the age, and so on. And the information about calculators is stored in a column or use calculator, which is sometimes empty, or it can have a yes or no. So in order to explore this, what I'm going to do is create a pivot paper, use a shortcut, not a robot, how I got to do the game. And I'll answer, I'll say let me prove by the use of calculators. So we have three possibilities. We know so that it's a little bit easier. Yes, no or no. So let's see, what is the average mathematics marks of a student's average? But I also want to count. So let's take some of the product by mistake. So roughly, let's represent all of these as percentages. So I'll have a percentage of all of these customers. Let's get rid of that. And I also find it easier to color code it so that it will see what the responses and this also somewhere. So what this tells me is, yes, using calculators gives me the highest score in mathematics than not using calculators. And interestingly, the people, and there are 24% of them who left this blank, not indicative of they're using calculators or not, are having even this. Why they've left it blank? I have some theories on it. They don't even understand the question. So which means that they have bigger problems than the use of calculators. You have problems in the language. So but that's just a theory. However, what I can say is that, at least from this, the use of calculators seems to give people a higher score. Having said that, however, correlation is not a causation. It's just a hint. There are two ways to take a statement. The first is to focus on the first one, which is to say correlation is not a causation. Therefore, the fact that I have found a correlation is meaningless and it's therefore not actionable. However, that's throwing the baby out of the bar. Then on the other hand, you look at it and say, here's a hint. There's something going on between calculators and mathematics. Are there any of these analysis to know that? But I was thinking, obviously, the relationship in the first place. It at least merits investigation. Maybe those that score high on mathematics tend to like calculators and use them. Possibly. So it could be in reverse. But at the very least, use of the calculators and scoring high in mathematics to see the association. And we can check this association for many other things. So for example, can I answer the question to computers? Helping mathematics. Let's quickly do that. So instead of using calculators, I'll get rid of that and say you think computers. And the short answer is yes, using computers also helps. Not using computers doesn't help. That question obviously not answering the question is no good for the student. But you'll notice that the difference between these is not so large. But before we go into the difference between these two, let's ask the question, is this meaningful or is this just plain random? When I'm a kid, my name starts with A. And therefore I get called in class for many things in alphabetical order and end up being one of the first. Which puts me at a certain disadvantage. I've got to be a little more prepared because I could be called on right at the beginning. But then people also say, look that's an advantage because then you're getting trained to be more prepared for improper things and that's gonna be an advantage. Now this is a classic question, right? So does it actually make a difference? Does the first letter of the child make a difference with their marks or not? I'll do a quick poll. How many people think the first letter of the child makes a difference to the marks? How many people think it does not? Okay, it does not, is a majority. So until last year, I was consistently telling people that now it makes no difference whatsoever. And then I went back this year and did a rehash of this for specific states. It turns out that in Karnataka it does make a difference. And those who's name starts with the letter K do score higher and it is statistically same. But this isn't a talk about numerology. That's not getting to that. The thing though is we do have the ability to differentiate between pure randomness and what is actually significant, right? Doing the T test. But those of you who don't know the T test it's a magical thing that basically tells you if you have the same average of one. So you know the T test. So let's apply the T test. I want to check whether the use of calculators is actually making a difference or not. Now what do I want to do that? I want to see if the population as a whole is different from the sample that uses calculators in a statistically significant way. In other words, it's the average of this higher than the other. So how do I do that in excess? So I'm going to open that set of people who use calculators. That's this set. And I'm going to take their maths marks which is a column somewhere towards the end, okay? That's their maths person there and it is on the new column here. So what I have is data for approximately 3,000 students who have used calculators and each column represents one of those students. Now I don't have maths marks for everybody, but that's it. The other population that I want to take is basically every single student. So let's take that. And again, I don't have maths marks for everybody, but that's it. Now I will give you the data analysis tool back which is an optional item that you can install on itself. There are a whole bunch of pieces. There is a pad, for example, blah, blah, blah, there is a sample of some equal variances and there is a sample of some equal variances. Now here, when you get stuck, the simple rule of thumb follow is try them all one by one. It doesn't really make sense. The good part about large data is that the whole simple that statistics is based on goes for a toss. And statistics is based on the assumption that it's very difficult to get data. Therefore, we have to make the best use of it which means that we have a sample because working with data is difficult. Getting the data for the sample in the first place is very difficult. With data analysis, what we have usually is the entire information. And we have systems that can process billions of data points. Like on your mobile, let alone your desktop. What's the sample we got? The different between equal variances and unequal variances. Unequal variances for this particular data set is in the 16th decimal place. We have bigger problems in the 16th decimal place. So let's just do the test with equal variances for all I care. And this says, tell me the column which has the first data set and the column that has the second data set. So if it's the first data set and the second data set, I will have this, not those. And okay, and this tells me that there is the, it tells me a bunch of things. Now, the important thing to know about statistics is that usually everything is ignorable. Most of the stuff that they're given is not really relevant. The one thing that matters is what the difference will be, what the result of the test is. In this particular case, the test tells me that the first group which happens to be the calculated group and the second group which is every one. The difference is 33 versus 31. There's a small difference in the, in this. Now, other important thing is to look at the p-value and there are two p-values. Let me start with the 2k and then I'll tell you what the difference is. So this one says that there is a 0.00014 p-value. Now, what the p-test does is tells you whether one particular population of something is different from another sample for population. The base hypothesis is that they are the same. What this is saying is that there's a 0.00014 chance that they are the same. Basically, they're different. And statisticians will word this as the null hypothesis has been rejected. What basically means is your original statement is they are the same. It's not true. In other words, they are different. So in other words, it's saying that there's less than a 1% chance that these two groups actually are getting the same marks. In other words, there's less than a 1% chance that the use of the calculator is different. That's pretty decent. Now, when I make that statement, some people may wonder should we take a one-day test or a two-day test? Then look, if it makes a difference, then the conclusion is questionable in the first place. If one is tiny, the other will always be half of the first. And the difference between these factor of truth is nothing. We are talking about orders of magnitude. If you get a conclusion that's probably 1% wrong, probably worry about it. If you have a conclusion that's probably 0.1% wrong, then say, OK, I've got a decent concept. Now, between 1 and 0.1, there's an order of magnitude difference. So you really shouldn't be worrying about it. But of course, the t-test suffers from the classic XKCD problem, right? So you apply a t-test in there. So they do a test on whether it makes a difference and whether specifically the color of the jelly bean makes a difference. So does purple jelly have an influence? Does brown jelly have an influence? Does pink jelly have an influence? And all of this is done with a p-value of 0.05, which means that just make sure that there's at least 5%. There's no more than 5% chance that you're wrong. So all of these, and somewhere along the line, one of these, the green jelly, has actually a significance of less than 0.5%. So OK, maybe that causes acne. And of course, the conclusion gets totally distorted. The green jelly does actually cause acne. Obviously the statisticians, the non-statisticians among you didn't get this joke. But what it really meant was, if you apply something like a t-test, not on one or on some, but on dozens and dozens of things, sometimes you will get things wrong simply because you've chosen a green cutoff. You said 1% is OK. Now what that means is that if you put that 1% is OK 100 times, there's a pretty good chance that one of them will actually be a wrong conclusion. So in this particular technique, we are applying the t-test on a large variety of categories. We're checking for calculator, we're checking for computer, we're checking for x, we're checking for y. So the cutoff that we need to take for the t-test is a lot more stricter than otherwise. The last thing is which one helps more? So we saw the result earlier. For computers, the result was that there's only about a... Calculators. For calculators, there's about a 3.5 percentage point difference. But whereas for the use of computers, there's only a 0.2% difference. Let's assume that this is statistically significant and we don't know where it is. What this means is that more than whether this was important or not, what matters is that the use of calculators is probably a stronger correlation with maths marks than computers. So this gives me the ability to say that calculators are the stronger influence on marks, at least maths marks, than computers. And this is the basis of the technique. What that leads to is a result that's on grammar.com slash NAS and you can happily explore. It says here are all the factors of the difference and if I look at the maths marks, what makes the largest difference is the father's education and the father's occupation, understandably. And computer use. Maharashtra was the data set that we saw. Nationally, there's a huge difference and there's a big difference between states which we'll come to in a bit. What helps total maths, father's education and mother's occupation is a standard pattern. Parents help more than anything else. And we figured a bunch of other things as well based on this, like making a difference. Turns out that watching TV once a week is roughly the sweet spot. But not for mathematics. Students who watch TV even once a week are particularly excessively. They just rank in their mathematics marks. Father's education makes a huge difference. But we find that in patriarchal societies like Punjab, the mother makes has a larger influence. In a patriarchal society, let me show you what it's like in patriarchal society, let's take West Bengal. So in West Bengal the influence is primarily the father even though it's a matriarchal state. And in places like Punjab the highest influencer is primarily the mother, which is kind of counterintuitive in one way. But maybe because they are staying at home and they are the primary character for children perhaps. Cricket, when we applied it, we found that when you're playing first, when you're adding first, what really makes a difference whether you win or not is which team you're playing against. But if you're bowling first, what makes a larger difference is who's umpiring. And we have a list of the umpires who are very strict and generous. May I come in, madam? Turns out that when you want to pick costumes for Sanjana it makes no difference what costume you pick, they're all equally good. But she is a disaster in jeans when it comes to TVR ratings. I haven't seen the serial, I don't know what she looks like in jeans. But on the other hand night song sequences work really well for her. See, the thing is, all of this is based on relatively simple technique with the automation behind it and a certain amount of data cleansing is what goes into the bulk of the problem. That's the repository Grammar.com Grammar.com Grammar.com It is extremely well documented