 So I'm Rajesh Kasthoyar Angar, so that's me and that's Anand over here. And welcome everybody to this Thinking with Data course. It's the first for Nias in several ways. So if you've never been to the National Institute for Advanced Studies, well you are at the National Institute for Advanced Studies. We don't do anything here which is either elementary or medium. We only do advanced studies over here, right? So if you don't want to do advanced studies, I suggest that you get up and leave right away, right? But if you are gung-ho about advanced studying, then this is the place to be. Our course on Thinking with Data is a first for Nias in several ways. First of all, I think the first time we are doing something that's open to the public, not as a one-time event, but as an entire course. So those of you who have registered for the course, well, you're part of this experiment and thanks for participating. Those of you who are from Nias, well I hope that having so many people brings a new dynamic to the way courses are run. The second thing that we are doing, so Anand here who works for a company called Hasgeek, is going to be videotaping these sessions and we're going to put them online. And there are many people across the country who are interested in this course. So there will be participation in the course from people who are not in the room right now. So that's the second thing. What I am going to do today primarily, let's get started with that in a minute, is to lay out the framework of the course as a whole. Many of you have probably not been in school for a while. So how many of you have not taken a course in the last, let's say, three years? I course you with me. Meaning like you have not sat in a room every week for say 10 weeks? Oh yes. So that's most of the people here. So as a result, you know at Nias that's what we do, we are an academic institution and so we run courses and which means that there are a few cardinal rules that we always follow. In which everybody in this room did today because we are going to start at 10.30 and everybody was here by 10.30. But it's a course that starts at 10.30 every Monday. So let me start with the logistics here. It will not be in this room from the next week onward. It will be in the main lecture hall in that building which is actually a much better room for instruction but unfortunately today it was double booked so we couldn't get that room. And so that's where it's going to be. It will be two hours of lecture come discussion. So if you are not used to sitting still for two hours, take your written in or whatever drug that you need to and make sure that you can sit in one spot for two hours. The primary way in which the course is going to be organized is lectures here and discussions and projects offline and online. So the projects are really the main thing in the course. The course is really an excuse for getting some interesting project stuff. And we have a very diverse range of backgrounds here. So many of you know write code for a living and many of you write words for a living. Hopefully the people who write code will talk to the people who write words and do something interesting in combination. And some of you probably even draw pictures for a living. So pictures, videos, words and broadly speaking. So every one of you here is going to be assigned to a project. We still haven't decided on the final projects partly because we didn't know how many people were going to show up. And by speaking of between, let's say 30 people in this room and some of you might be coming in and out. So we have a lot of live projects before the project. And the way we want to do it is to mix people of different backgrounds so that some people who have interesting questions get combined with people who have interesting ways of addressing the questions. So projects are the real test of learning. It will happen as much online as in this room because once you leave the room you can of course go and check back on the lectures. We will potentially be hosting this class in another location at the Center for Internet and Society at the other end of Bangalore. You don't know. Where if you can't make it during the day but I hope all of you who have signed up are going to come here on Monday mornings. But if you know people who are interested but just cannot come on Monday morning we are hoping that it will be hosted on a weekend or on a weekday evening at the Center for Internet and Society. So that's the other thing which means that there will be people who want to participate in the course or not coming here. And so some of the online discussions will be important. So the first thing you need to do if you haven't done so already not register for the course but to register on the site. Because once the project groups get divided and you're all assigned to a project group we want that group to be documenting its progress. And the only way we can do that is by doing it in my mind. So please do document your go to the website and register. So there's a register right there on the front page so you should be able to find that out. And we will use those registrations to drive the projects. So hopefully by next week you should have a project and every week from then on we hope that you will submit as a group, not as an individual, something pre-requisites and Anand is going to talk about this later as well. We don't expect too much in the way of pre-requisite knowledge but we do want to warn you that some exposure to Python and HTML will be desirable. So if you want to get the maximum out of the course now this is not a course except for the Nias people who are taking it for credit. Everybody else is doing it for fun or some other noble cause. So really you don't need to do anything but if you want to get maximum benefit out of the course you do need to know enough Python and HTML to do the assignments. Ideally you are part of a group where there are other people in the group who know some of this stuff so they will be able to help you out. But we are going to organize a crash course next weekend at Grammar in Indra Nagar. Anand again mentioned that. And if you therefore are interested in getting that crash course I especially urge the Nias people who I am almost certain none of you have actually written a line of code in your life so if that is the case please do sign up for the crash course. So these groups will be mixed so that you complement each other's expertise. Let's put it this way that 500 bucks for an intensive workshop in Python and HTML you are not going to get that deal again. So the perks of this course are going to astonish you as the course goes on. Okay so what are our goals in doing this course? So it's really an experiment in several things. The first is at least from my perspective I think of this as a core course. So for people who are not from Nias and also the first year Nias students you know that we have these two foundation courses at Nias that everybody is supposed to take. So I think of this course as a kind of foundation course meaning these are skills that everybody should have and unfortunately because they are new or they are packaged in a way that are not done in the same way elsewhere you are unlikely to get these skills as part of a single package. So extracting data, analyzing it, visualizing it thinking about it as you do all of those is something which is a standard skill. It's like learning how to do quadratic equations. And I'm sure that people here who don't know how to do quadratic equations either so please add quadratic equations to your list of things that you should know. But in this course we will be essentially teaching you the thinking skills that will hopefully be useful whatever you do. And the mark of that success is how you can execute some interesting projects. And I'll come to what kind of projects which will be interesting. When I say interesting I don't mean that they are just interesting to a classroom. I mean projects that if you do well will be either publishable if you're an academic or will have some value to you as a business person if you're an industry or if you're in a civil society group it will allow you to do your campaigns in ways that you cannot do otherwise. So the interesting is defined by the world outside not by this solar system here. So the projects combined with the skills are really a way to bootstrap a community. Now Anand here has been organizing data needs for a while. So working with big data is now something that lots and lots of people are interested in. But not too many people know how to do it. And I would venture that knowing how to think who data is an even rarer skill partly because as I will say in a bit the nerds and the geeks don't talk to each other. So what do I mean by that? So in case you want to understand the sophisticated terminology geeks are people who write code for a living and nerds are people who write words for a living. I mean this is my very very simple way of distinguishing those two. So probably these two don't talk to each other. And my guess is that most of you who came from the geek world until you signed up for this course and you came to Nias you probably did not even know what Nias was. And in the nerd world that wouldn't be that uncommon knowledge. And the other way around, we actually don't know what you guys do in Nianagar and Jainagar and all those places. So the idea is that the people who write code will find interesting problems to work on with people who write words and vice versa. So this really is where the proof of the pudding is going to be. So let me start with examples of things that I feel are thinking with data kind of examples. So they have certain characteristics. So let me start with an election campaign. Now suppose you are trying to, you are a politician or you are a poster or you are a campaign manager and you want to run a good election campaign. And incidentally if you are working for an NGO who wants to run a campaign same kind of ideas. How would you do that? And how would you use data to do that? Now to give an example, there was a graduate, my friend Ashwin, grant for the graduate constituency in Bangalore. Now Bangalore has 1.1 million graduates. So the constituency in principle had a million and 1.1 million people in it. It turns out that only 23,000 of those voted. So look at the percentage. And the winner won by 400 votes. So which means that 400 out of a million was the deciding percentage which is what points 0 to 5. Slightly less than that actually. So which means that a very minuscule percentage of people decided the winner. And incidentally it turns out, in fact I think there is a lawsuit that's going to happen. The person who came second can convincingly argue that votes that were disqualified because somebody instead of just so ticking the right candidate's number ticked their name and then put a circle around it or put a circle around the whole candidate's name. So if you just take these kind of what you call completely certain but disqualified votes. So people whose intention was absolutely clear but they just did something that got them disqualified because of the rules. Are something like 750 and most of them went to the person who came second. So actually it turns out therefore that the difference between the winner and the loser can be covered by just these disqualified votes. I mean if you remember there was a US election which now increasingly decides the way the world has run for the last 10 years was also decided in the same way. Rats of 500 votes in Florida was the difference between Bush and Gore. So what this is telling you is that if you want to run a good campaign and in a winner take all kind of system where the winner gets everything the loser gets nothing. If there's such a small difference between the winner and the loser you need to run a very very sophisticated campaign so that you get every single vote that you can get. And I think that Obama-Romney election for example will be decided in exactly that way. In fact already people are saying that there are about 400,000 voters in a few districts in perhaps Ohio, Florida, Colorado and a couple of other states who are going to be the difference between Obama's winning and Romney's winning. Again out of a possible electorate of about 160 million people. So these very very, you can't just look at the data and ask how to run a campaign to make your drive win because it's very unlikely that you will without prior hypotheses about how to understand the data you're not going to be able to pick out those 400,000 people or those 400 people who are the difference between winning and losing. So how does one run a campaign in a way that you get exactly the people that you know you're going to get. I'm sure it's going to happen so you know later that you'll have Facebook ads for vote for so and so. And it may, if they're smart, what you see when you vote for so and so would be different from what the next guy sees when they see vote for so and so. And that kind of smart analytics is not too far but if you don't know how to do it or you think that it's magical then you're going to be suffered. So if nothing else we need to know this how so that you're not being propagandized. Now the second thing I want to talk about, so this was not quite in the deep world but at least it's something that a lot of you might easily understand the microbiome. So how many people in this room have heard the term microbiome and who haven't heard it from me? So the microbiome is the bacteria that live in your gut. It's as simple as that. And it turns out that there are 10 times as many bacteria in your gut than there are cells in your entire body. So in that sense, you are justifiably just the house in which those bacteria live. As far as they're concerned, you are not a person but this entity which houses them. Now it turns out again that these, so if you have 10 times as many bacteria in your gut as your cells you can imagine that these interact in very, very interesting ways with you. It also turns out that all of us have a unique microbiomic signature. So every single person has a unique microbiome and this is amazing. So you should read this as an economist in the latest one if you haven't done it already. If you have identical twins, identical twins may still have differentiated microbiomes and that might be the difference in let's say nutrient intake. So it may turn out that people who are 10 are different from people who are not not because of their genes so to speak but actually because of how their genes interact with their microbiome. So this kind of personalized medicine of actually analyzing this microbiome is going to become huge. I mean the microbiome project just released its data about a month ago and India for example will be fantastic to do a large scale microbiome project in India. So you can imagine the kind of stuff that lives in our gut and the microbiome diversity of India if anything will rival the human diversity of India. So that's the case. You can imagine the kind of interventions in public health that you can do in a country like India if you understand this microbiome well and that can only happen if you take people who know how to diagnose diseases or other medical interventions with people who know how to analyze data. Ideally it should be the same person but right now your average doctor who writes that scribble on a piece of paper wouldn't know standard deviation from clearance. Correct me if I'm wrong but typically they wouldn't. So therefore this kind of skill I'm saying is standard. It should be something that all of us know and once we do we'll be able to do our jobs better. So addressing these questions requires what I call thinking with data. It's not just data, it's not just analysis but it's thinking with data. What do I mean by that? So there's a kind of bottom up way of understanding data and when I say the bottom up it means collecting the data, run some statistics on it, show some significance and then call it quits. This is what a lot of us do even in the academic world when we are collecting data on let's say A's make up something but A's tuberculosis will correlate it with caste to take two things. Now you may collect a lot of statistics about who has TB in India and what caste they do but typically the kind of analysis that will happen are of a bottom up kind which is to say the correlations between these two categories TB and caste. Now of course there will be something significant there. You can be pretty much sure that lower caste people in India are more likely to get TB than upper caste people but unless you talk to a sociologist or to an anthropologist you won't really understand why is it that upper caste... I mean we have some intuitions as to why upper caste people are more less likely to get TB but to do a genuine public health analysis you would need to do something more than just what the data in front of you tells. You need to embed that data in a broader understanding of how society works and that is not what comes from machine learning. I mean at some point in the future there might be a machine that crutches political science and anthropology and sociology and just figures all of this out but until then it's people who really understand how experientially the world works who will have to talk to people who know how machine learning works. So the data is actually much more interesting if you can tell a story. So again if you take caste and TB you can tell some very interesting stories about the relationship between caste discrimination and public health. So one very obvious kind of story to be told is about the access to water that somebody who is of an upper caste background has versus somebody who is from a lower caste background has. Now that kind of story about how access to water, clean water, I mean determines public health is pretty clear but you need to plot carefully across India let's say what kind of water do people have access to and what is the caste distribution of that and that would give you a much better story about how TB and caste are related. So which is why some data points are more relevant than others and we need to figure out where that relevance comes from. Where do you get what is more valuable data? You can't get it from the data itself right now. So there are learning algorithms that are beginning to get those out of the data itself right now. You need to be able to make why some data is more relevant than some other data. You need to be able to explain the significance of a fine. You need to be able to convince me that this particular data point is actually far more important than some other data point. Now let me give you an example which we all sort of think of as zero. You know that the earth goes around the sun instead of the sun going around the earth. Every single day you get one data point which suggests that the sun goes around the earth. You wake up and it looks like the sun is moving and the earth is still. So every single day we are collecting data on why the sun is going around the earth and yet at least our current science suggests that it's the other way around. It's the earth that goes around the sun. So how do you ever show that? What is the reason why you believe that the earth going around the sun data which was painstakingly collected by Copernicus by looking at the epicycle regressive motion of the planets is more relevant than your everyday experience? I mean that suggests that there's some theory in the background that tells you know this data is more important than that data. And that's what you could argue that Sherlock Holmes sort of taught us. I mean there's a very famous quote in one of the Sherlock Holmes stories. I don't know which one it was, maybe somebody here knows. If from the improbable you take away that which is impossible what you're left with is the solution. I don't know if Anand is probably online so he can Google it. But that really is should be what I call the data actives method. You need to know what kind of data is relevant and what is it. Incidentally again if you remember the beginning of Sherlock Holmes Watson makes this list of things that Sherlock Holmes is good at. He doesn't care about the earth going around the sun or the earth going around the sun. So he has no interest in broader culture but he's extremely good at identifying where each soil fragment comes from. So that kind of real eye for detail and knowing which things matter to your particular form of expertise I think is one of the great skills of a good data scientist. So do people actually think with data? This is where my work as a cognitive scientist becomes interesting. We all would like to believe that we are rational, smart, data driven, analytic people. But actually people are not. So a very interesting experiment by Lerapur Ditsky who is at Stanford showed the following thing. So suppose I take a group of people who are given a story, a description which either has some statistics in it or has a metaphorical description in it or both. So an example would be crime in L.A. grows by 17% last year versus so that's the statistical fact or there was a crime wave in L.A. last year. It turns out that if you do a control experiment and they did a very interesting control experiment it turns out that people who are given the data and the metaphor perform much more like the people who are given only the metaphor versus people who are given only the data except that they all say that they use the statistics. So they think that they are thinking with statistics but they are actually thinking using the metaphor. So that goes back to the previous thing, telling the story metaphors help you tell a story much better than raw numbers do. And therefore even if you want to be data driven what you do with the data and how you represent it either visually or numerically or whatever other modality you use will actually be far more relevant to the end user of that data than the raw numbers will be. And my own bet therefore would be that diagrams so if somebody draws a diagram which goes probably like this these are the two axes and somebody shows that that's how crime is going. So it's like an S shaped curve that would be far more convincing. The numbers on it would not be convincing but the shape of the curve would be far more convincing than an S. So what I'm saying therefore is that thinking with data and doing it in a way that takes all of the sensory modalities that we have whether visualization of course is the most important one but other sensory modalities is actually not a bad idea. So there's a way that the stories are told with data to learn as much from storytelling in other kinds of stories as with statistical analysis data. Can I go ahead? As you're talking about this, my mind is going towards Gandhi man the use of heuristics and decision making that we made. This is all tied to that. So if you're thinking with data you're just using data as a tool for your thought rather than being the end. So unless you're a statistician data is something that you use for other purposes rather than just crunching it. You should be able to synthesize your findings, deliver the story. So in that sense thinking with data is not about artificial intelligence it's not really about machine learning as much as it is about how humans can do the things that they are already doing but with data in a better way. So for example if you want to persuade somebody if you have a business which wants to say buy my product over the next guys you are actually communicating a story. So how do you do that better with data? Same thing if you're a public official, if you're a scientist any of those things. So if you look at diagrams in science journals I often have a very hard time understanding figures in science journals because they're so opaque. So I would say that a huge revolution in science can happen just by if everybody is forced to read Edward Tufti's books by three times. All of this and which actually has been severely questioned by the Kaiman, Turskis and others in the world but nevertheless the ideal model is that of the rational agent so a rational agent is somebody who thinks with data by extracting as much information as possible from the data and communicating it to the world. So the idea is there is data, you can extract information and you can convey what you extract as succinctly as possible and the rational agent is someone who maximizes both of those. Who maximizes both the extraction and the communication with minimal resources. So broadly speaking any kind of information designer would try to design systems which help you extract as much as possible and communicate as much of that extractive information in as efficient a manner as possible. So people may not be perfectly rational so again this Lera Balditsky work that I mentioned and work by Kaiman and Tursky about heuristic biases have shown that we are actually not rational beings but nevertheless we can aspire to be one. Rationality is something which is a desirable goal even if we actually are not rational and you could argue that because the data if it is collected accurately without error doesn't lie you would be able to predict the world better than the guy who is making up stories that are not grounded in the data. So just because other people are irrational doesn't mean that we should be. So for example when I say RA I am an rational agent so suppose you are a government investigator and so here is something that would be in India very very important public health intervention to be done. We all know that there are all kinds of fake drugs in India you want to know where those fake drugs come from or not in the fake drug domain we know that in India doctors often over prescribe antibiotics so you want to know where so if you are finding out that there are new TB Basili that are more and more resistant to drugs you want to find out where is the over prescription to antibiotics coming from so that you can do something about it. So how do you use data to figure out where in India is the source either of these furious drugs or of the antibiotic over prescription. Now similarly at the other end imagine a Kirana store so just walk up and down the Matikire road over here you will see lots and lots of small stores. Now they all have to make a living their living depends on having only those things in stock that they sell and as few of those things in stock that don't sell except that of course you don't want to be completely out of stock of things that are only rarely sold. So how do you again maximize your inventory in such a manner that it depends on the data that you are getting from the world about interest in those products. So you are a stationary store outside you have various thicknesses of books some are hard bound, some are paper bag on the one hand you want to make sure that people buy the most expensive hardback book that you have but most people don't want that so how do you market how do you structure your store so somebody comes in they need to see your store how do you lay out your store so that people buy the most expensive thing that's on your stock and yet they also get what they want if they know exactly what they want. So this is a Kirana store you might think that they actually don't want to spend too much time thinking about layout design or data collection but actually if anything their business model will be far more efficient if they did so then a large chain like food world reliance timeout or something like that where they have people on the payroll who are doing that so how do you make data so transparent that a Kirana store owner can also use data to maximize their revenue that's sort of a small business model and note how imperfect your knowledge of the world is and how little time you might have to make sense of it so take again that Kirana store guy he is only getting input from the customers in front of him unlike reliance timeout he doesn't have access to data from across the country or even from his neighbors who are telling him what's selling and what's not so you have to make decisions on the basis of extremely impoverished information and you want to do it in such a way that you can react to the world immediately rather than many years in the future in the long run they are all dead so you have to make the invisible visible this is what mathematics is supposed to do mathematics makes the numbers and quantification makes the invisible visible so you want to be able to measure the world, you want to analyze it and visualize it all while being a rational agent and by doing so after the end product of the visualization should be something that makes transparent what was completely inaccessible before so for example it may turn out that if you take three or four products that are in stock in a Kirana store there is a clear pattern as to what rate they are being consumed and therefore what rate you need to order it so thinking of data if you take these three things here as the goals therefore you have to think so I am again coming as a theorist where the thinking comes first and the data comes last and I might want to flip that a little bit but we will play a good cup and a bad cup so if you think first you extract data on the basis of the hypotheses that you derive and from that you combine the thinking and the hypotheses you have a reasonable chance of success and incidentally this is what happens in what's called Bayesian reasoning you have a hypothesis, you have a hypothesis that says the sun goes around the earth and another hypothesis that says the earth goes around the sun and then I go test the world in some way and then I combine the outcome of that test with the hypothesis that I already have and say oh it's the earth that goes around the sun so that's what people think an ideal researcher should be doing so what is thinking? now there are many many definitions and you know if you are a professional student of concept formation and reasoning and so on you have many different ideas of what it was to think but for us it's just a capacity to form beliefs to reason about those beliefs once you have formed them and to test those beliefs against the world that's probably speaking what I consider thinking it's something that starts with a belief and ends with a hypothesis that you can test against the world ok so suppose that you leave your house in the morning so this is the monsoon season you leave your house and you see water on the pavement in front of your house was it rain or water mean that burst the gardener spraying water all over which one of those things is it so thinking is about forming one of these beliefs one belief would be it rained yesterday night another belief would be a water mean burst yesterday night a third hypothesis would be the gardener was watching the plants 10 minutes ago right so these beliefs are the first part here so we have the capacity to form beliefs then you reason about those beliefs and then finally test those beliefs against the world right so how did that happen now it's August we all know it's been raining this year it wasn't raining until recently so you see water in front of your house and you form the belief that it must have rained yesterday night you even heard some thunder potentially so you did not form this belief on the basis of explicit reasoning but most of us when we leave our house with an umbrella or something like that explicitly reasoning in our head saying oh I saw lightning and therefore I must leave with thunder no it's like I'm subconscious you hear thunder you pick up the umbrella and so therefore this first step is sort of it's not explicitly rational so we need to first convert in order to do data science you need to convert these subconscious intuitions into explicit beliefs that you can reason with how would you reason explicitly so the first kind of check in your reasoning system would be is it plausible to assume that it rained yesterday night now the way I want to defend it incidentally this is what Indian logical systems do so the famous famous example in the Indian philosophical logic is it you see some smoke and you conclude that there was fire and so where there is smoke there was fire how do you do that? you say that well wherever you have seen smoke there was fire like for example in the Chula and therefore in this case because there is smoke there must be fire so that's an example of reasoning with data there is a middle point which is data that whenever I cook chapatis or whatever in my house and there was smoke it was caused by fire so that thinking with data is souped up is what really we are talking about so we are saying ok it's August it's cloudy there's water on the ground and therefore it must have rained now if you are a rational agent again you have to test that belief so for example there might be water in front of your house where there is no water in front of any other house again you have a tested understanding that if the rain fell from the sky it wouldn't have scattered rainfalls so non uniformly that there is water in front of your house and not in front of the next sky so think about it this way right therefore tacitly there is an understanding of the probability distribution of water on the ground given that there has been rainfall ok there is no smell of rain yes you all know there is a distinct smell of rain when it first comes on the ground and therefore you can prove that maybe rain is not the correct belief that even though it is August and it could be rain it actually is not rain then you go around and you notice that there is water gurgling next to a drain right and therefore because there is water that's gurgling from a drain no other drain is gurgling and therefore you conclude that it must be the water in your drain that I mean the drain in clog that must have led to the water in front of your house I mean most of us are really not that rational but this is what it should be actually even this need not really be true because it's the rainy season and there is a lot of water it's much more likely now that a drain is clogged because of excessive water than in some other parts of the year so actually the real reason could be a combination of both rainfall and water blockage right which is to say that this particular drain got blocked because of excessive pressure on that particular system as opposed to some other drain which were able to handle the rainfall that came about incidentally this is how medical reasoning goes right so if you're trying to diagnose what disease somebody has this is exactly how you would go about doing it so suppose you want to find out somebody is having that's a pain in their left arm a lot is that due to a heart condition is that due to a pinched nerve or is that due to muscle strain somewhere else in the body all of these are plausible hypotheses in fact the real world is always like this that conditions don't all come to you cleanly saying this is what it must be because we are complex systems both our bodies and the world outside symptoms are clustered in ways that it's often hard to find out what is the actual cause of affairs in the world and therefore one shouldn't be one should be very careful about jumping to conclusions and that's where actually big data helps a lot because what big data helps you do is to repeatedly test your hypotheses against the world as opposed to those doing it once right so if all of this sounds familiar it's nothing but the scientific method that's what actually science is supposed to be a scientist is somebody who performs hypotheses goes and tests those hypotheses and then tells you whether that hypothesis is valid or not so what I'm trying to tell you the beings in this room and everybody else is that it actually will benefit you a great deal if you just understand how the scientific method works so all this thinking of data is nothing more than the scientific method but just done more carefully and thoroughly than most scientists normally do right so the scientific method in some sense was created with small data in mind so if you look at how many data points Copernicus had before he concluded that the you know the earth goes around the sun my guess is that it would not pass single journal, current journals statistical significance test right that doesn't mean that it wasn't a fantastic piece of science just telling you that our standards for how much data is actually relevant have changed so can we scale the small data mentality of science so from my perspective what's really really fascinating about the big data business is that you now have the capacity if nothing else to test your hypotheses against much larger data sets than you ever were able to before and that's as much true for a businessman or for a public official or anybody that is for a scientist right so what we need is techniques not just to cut the data but how to understand it right so another way of putting it is what's the science in data science data science is the fancy number of statistics well that's actually not such a bad thing statistics is a fascinating fascinating subject and it arose precisely to quantify how to understand large data sets so if you look at insurance policy well mathematicians discovered the normal distribution and that drives insurance policy right so you could argue that what's happening in the big data world is that we are finding out that the old normal distribution the bell curve so to speak doesn't work anymore the most real world data sets don't fit a bell curve and therefore if you want to be able to reason accurately about large data sets you need other kinds of statistics than normal statistics but broadly speaking there are two kinds of statistics one is what's called frequentist so you count the number of times something happens so you toss a coin 100 times and you find out that it came heads 47 times and tails 53 times and you keep doing more and more of those and you find out that as the number of observations grows larger and larger the coin more or less settles to 49 heads and 51 tails so you conclude it's a biased coin with a probability of heads being 0.49 and a probability of tails being 0.51 so it's not an ideal coin anymore right so that's how one derives the distribution but a Bayesian doesn't do it that way they don't get their data their distributions from the world they start with an assumption so for example you can assume that the coin is fair so you start with the assumption that it's 50-50 you make 100 observations and you find out that again 47 and 53 and then you recalibrate on the basis of that as to what is the most likely distribution that could have generated 47 heads and 53 tails and you keep doing that again and again until you settle down upon something now ideally in the long long run the two should converge so if the coin is indeed biased your beliefs about how the coin works should converge to actually how the coin works but in the real world you never get close large n number of observations you have to make your decisions now and that's the hard problem so big data actually shifts the debate away from long n decision making to very very rapid decision making so if you want to do science with statistics and you want to do all kinds of sciences with statistics what if we derive our science on the basis of statistical rather than deterministic assumptions becomes a very important kind of science to do which in practice social science disciplines are like that or biology also unlike physics we are not based on deterministic assumptions we don't have strict rules that say people only go from Delhi to Mysore there will be some distribution of people who go from Delhi to Mysore and you don't throw away your theory that Mysore is say the center of let's say yoga in India because there are some people who are going to learn yoga in Indore rather than Mysore so the kind of science that we do is often driven by statistics rather than deterministic assumptions but the more we do and the more statistics we collect I think there are different kinds of science we can do now so to just give you a brief idea Newton invented the calculus to solve this problem how do you explain the fact that the earth goes around the sun in such a way incidentally these are Kepler's laws that in a certain given period of time say one month independent of where the earth is or independent of where some other planet is it always covers the same area in the same period of time so in one month whether it is January to February to November it's still the same area this is what Kepler discovered in his laws and Newton was trying to figure out how to do that and he of course invented the law of gravitation and he invented calculus and he put the two together and he said this explains how one can get the Kepler's laws so we are essentially trying to do something here which is not that difficult we have huge huge data sets and we want to explain quantitative as well as qualitative behavior on the basis of those data sets but we don't quite have the calculus that helps you convert the data into insights we are still groping the dark as to what the calculus is and the reason here is partly because thinking with data is not like thinking with I mean thinking with big data isn't like thinking with small data sets so if you take very very large data sets which are dynamic so let's take Google's servers that are processing huge amounts of data every second or AirTel which is processing I don't know how many millions of calls per second it's very different from the Copernican model of data collection where you stare at the same thing every year and you can go back and you can go back and it's always going to be pretty much the same so where the moon was on January 27th of 2001 is a lot like where the moon is going to be on January 27th of 2002 but in these very very large dynamic data sets that kind of regularity is simply not there so if you think that data science is about extracting the patterns that underlie the data generation we need to have much more sophisticated idea what patterns are so what I mean is if you take a flashlight so this is the famous Bevel story remember Bevel was asked by Akbar find me the 10 stupidest people in Agra and come back the next day and so Bevel is walking around and somebody is there's a lamp somewhere and this person is frantically looking for something and Bevel says well what are you looking for oh my wife gave me this very valuable gold ring and I lost it and I can't find it anymore and so Bevel also helps him for like an hour and then Bevel finally says wow I can't find it where did he lose it and the man says oh well over there and they say why are we looking for it over here they said well at least there's some light here so there's a chance that we might find it but over there it's impossible so a lot of data for signs is like that you're shining flashlights in the dark and you're hoping that you'll hit something that's interesting but that's not the way our minds and brains work when you think about it every single time you open your eyes in the daylight you see the world it doesn't appear to you as flashlights that are shining in the dark I mean in fact we take it so much for granted that the world is what you get when you open your eyes I mean just imagine what your sensory experience of the world would be if it was like sporadic flashlights that are shining like that so our brains and minds are the original big data thinkers I mean every single second you're processing billions in the trillions of photons that are hitting your retina then going through your primary visual areas and then being processed further so on and so forth all the way till you get to the higher cognitive areas let's say oh this tiger is chasing me or something like that and it all has to be happened like that right so our brains therefore are actually geared to process data quickly efficiently and hopefully accurately right that kind of dynamic continuous data I think is what's going to be the future of where big data is going to be in a sense that's what excites me because what it tells me is that the object of study of my science which is the mind and the brain seems to dovetail really well with the evolution of technology as we see it right now so that big data technology will hopefully tell us and help us understand how our brains and minds work and the other way around that more and more understanding of our minds and brains work will also help us design better data analysis techniques so this I think of as the AI challenge of the 21st century I think way back when in the small data days people wanted to build AI computers or robots that would think like human beings famous Turing test where the idea is that if you can design a computer that is indistinguishable from a person it's as intelligent as we are so imagine what the big data version of the Turing test is going to be which is that if you can take a computer which can process visual information in a way that is indistinguishable from our visual system you've cracked what it is to understand how human vision works right and vice versa if you understand more of how human vision works you might be able to build better image processing or other kinds of visualization technology so this combination of the true is what I think is going to is already is and is even more in the future is going to be where the intersection of science and technologies are so to I'm not going to start summarizing today's session so the idea in this class as I said is to dress up the scientific method but in a way that is compatible with large data sets like you how to think with data so most scientists actually don't think about the scientific method they just do the experiments but what I'm saying is that with big data becoming more and more common you have to think about the method as much as the domain of study so what do I mean by that if you're a biologist and you want to study the microbiome it's not enough to just know what kind of species of bacteria are there one by one because you will never be able to understand a hundred you know billion bacteria in your gut by studying them one by one so something called metagenomics where the idea is you sample a little bit of bacteria from somebody's gut or some other like or ponds come from from some water and through very rapid testing you simultaneously identify what the bacterial species are in that sample so instead of doing you know genome sequencing one by one by one you do it simultaneously for the entire sample that's actually where a lot of this is going to go and to do that well you need to understand data analytics as much as how microorganisms work so that kind of heavily analytic data processing will I think be very very crucial in pretty much any kind of inquiry so and it will be as important I hope and I think for people who are studying you know conflict or caste or you know Sanskrit texts or any of those things okay so to summarize we'll have to build something like an additional sense order right so think of data analysis machine as giving you an additional sense order it's like your sixth sense okay we're not anywhere close to that we'll at least help you suppose that that was going to be your grand challenge well this will give you the you know class one version of that you have a very very long way to go so this course is therefore about developing the cognitive tools thinking tools to help us understand data we will develop these ideas in a structured way so from next week on so next week will be Anand and then every week we will have a specific topic that we're going to engage with the syllabus will be posted online it's already there mostly in the page of the course and with the projects and the structure of course where hopefully you'll get some interesting but remember again people don't think of data we actually are not data driven thinkers we are actually story driven thinkers or metaphor driven thinkers so one of the challenges of visualization communication analysis is how do you make data more presentable to people in such a way that they understand it and that's really going to be our challenge