 Let me start with what Rajesh talked about on Sherlock Holmes' quote, which is from the sign of fortitude. It says, when evaluated the impossible, whatever you need, however incredible must be the truth. As a model of one of the black room industries, which had a very interesting coordinate viewpoint to this. But this is a story about a scientist who said, look, I've got a case. Now, this is a lady who seems to be able to predict the future. It's not very far. She's able to predict things about finance to tell us the future. The more catastrophically events are, the wider the range she's able to predict. She's able to predict earthquakes and things like that. And these are a bunch of guys, hardcore rationalists who question him on what he did and so on. And it seems impossible to escape the conclusion that this lady has the cognition and their code. The story is called the obvious factor. He says, I'm left in a position where I must learn what Sherlock Holmes lately said. When the impossible has been eliminated, then whatever remains, however incredible is the truth. In this case, if there are any kinds of impossible, the recognition must be the truth. Don't you agree? And these guys were stunned into silence until one day, for example, says that Henry is grinning. Holmes asked him to explain this well, Henry. And Henry then says, let's say that when the possible has been eliminated, whatever remains, however incredible is the truth, is to make the assumption that, is to make the assumption usually unjustified, that everything that is to be considered has been considered. Let us say we consider 10 factors. 9 are really impossible. Is it 10? However, it is wrong, but therefore true. What if there is an 11th factor? Or a 12th factor? Or a 13? That we are not considering. In this particular case, in this story, it turns out that the factor that they had not considered was the guy, Eldridge, had been lying all through, he'd just been cooking up a story, making things up as they came along. Now, a lot of what we have been exposed to in analysis is based on the assumption of limited information. You are given a set of information. You perform a set of analysis on it. And therefore, we have grown used to making some implicit assumptions about the kind of information that is used. For example, any puzzle you solve or any exam questions that you answered have two characteristics. One, you are guaranteed that there is a solution. In fact, there's even a stronger guarantee that somebody believes you to be capable of solving the problem, given the information that you've been given in the recent past. And secondly, that you are in that question given all the information that you need. But that's a really powerful constraint. In fact, there are a number of problems in exams that are solved just using these. And I'll give you some examples. So, in the absence of this information, the same problems might have been a lot tougher to solve or perhaps even impossible to solve. But with these constraints coming in, the situation comes in very easy. One of my friends said, look, SAT is great. Supposing you had four questions, who was the 34th president of the United States? And I don't know who it is, but let's assume that the choice is where, Mickey Mouse, Donald Duck, FDR, and none of the above. There's a good chance that it's FDR, who may or may not have been a 34th president. I don't know. But it's very easy to know that none of the above is very unlikely. And Mickey Mouse and Donald Duck, certainly they're not presidents of the United States. Because it's easy to eliminate. And that's an additional set of powerful constraints that we have in particular type exams. That you know that there is a solution. That there is only one solution and it is among the choices that step up. So you don't really need to actually find the answer to the question. You just need to make sure that you've crossed off the wrong answers. In one of these are the solution. What remains, however, in government, must be the truth. Not to reveal. There are many more possible considerations that one might encounter. Incidentally, even this specific objective type of exams doesn't always work. So here's an example in which a student has consistently marked the answer as C, which is normally a good technique. I mean, you are pleased to get 25%. Except that the professor who graded this had this to say in his review. Every year I attempt to boost my students' final grades by giving them this little, little simple exam, which consists of 100 to 12 false questions from only three chapters of the community. For the past 20 years that I've taught intro to communications 101, I have never seen someone score below 65 in this exam. Therefore, your score of 0 is the first in history and ultimately scored the entire class and reached down by 8 points. There were two possible answers. A and B. You chose C for all 100 questions. You're obviously ready to get lucky with at least a quarter of the answers. It's as if you can look at a simple question. Unfortunately, this brings your final grade, this class, to fail. C next year. If all that fails, all this fails, go with B from now on. B is the new C. This isn't quite what works in real life. I would say it doesn't quite work. It works in a number of cases. There are other ways of looking at data, especially if it's open-ended. There are other ways of analyzing if it's open-ended. The thing that I'd like to add on what Harjit said, or complement what Harjit said, is on data-led analysis. Sometimes you wonder if, looking at the data, you can get new hypotheses in that sense. So, the 11th factor, the 12th factor that Henry is talking about, can those come up from data in that sense? To take an example, we were working with a small restaurant who said, here's our point of sale data. Let's see what we can do with it. Let's toss a bunch of scattered analysis we do with a number of other data sets with this and try and see how the sales of every single item is correlated to the sales of every other item. Now, it turns out that there is one product that they said, which is almost like poison, at least from a revenue perspective. When people buy this product, the sales of almost every other product reduces. Now, this is true for another product. Whenever people buy more of something, they tend to buy more of most other things, except for this one product and that poison is mineral water. Now, in hindsight, that's somewhat intuitively obvious. It fills your stomach. Another example was starters. Starters tend to reduce the sale of your main course. Again, understand, you tend to fill up your stomach. Deserts don't go back in time. Deserts are positively correlated to every single product. Fantastic result. So, this is not something where a hypothesis start with. Can we find out which product will reduce the sale of another product which shouldn't be set? It's a data set. You've done this kind of analysis a lot of times. Toss a data set. Did we get a result? In this case, yes. If we not get a result with other kinds of analysis, absolutely. So, for every one interesting result you get, you're probably going to have to do 10 pieces of analysis which are not that interesting. That's where computing power comes in. And you can toss enough hypothesis. You can toss enough kinds of analysis at a data set. And expect to get results much faster than you normally would otherwise see. The traditional approach of hypothesis which you come up with by thinking and then test using data leading to insight now has an alternate workflow which is you start with data, do automated analysis. You can also connect a hypothesis. And that could potentially give you insight. We've maintained something of this kind in this course. That's not all we'll be doing. But that this angle is something we'll be bringing in as well. Because this course is also going to be not just about thinking with analysis, but doing the analysis and not just manually doing the analysis. It's also about programmatic analysis which is you're not doing the analysis. You're getting a program to do the analysis for you. Which is not an easy thing. Partly because it requires a different set of skills. Partly because it requires a different way of thinking. And almost certainly because without doubt manually doing the analysis will take less time the first time than it would if you were doing it programatically. The advantage is the second time, the third time and so on. It would be in zero effort. And you apply the same analysis to a different data set and it will be in zero effort. And same analysis will give you different kinds of insights. Let me move back a slide. And to throw a question to you. How do you do your analysis? What tools do you use? For both of you who are not familiar, our is programmatic language. It's almost a borderline between, say, an excellent Python programming language and the GUI based. And Python is a programming language. SPSS. I think and feel. I think and feel. Think and feel. Which probably is augmented like they can intensively. Google refines. Google refines. Which for those of you who are not aware, is another fairly interesting tool. How do you code for the digital analysis? We'll take really our own tool. Excellent. It's interesting that the answers form into two buckets. On the one bucket there are a set of programs that we talked about and on the other side is you talking about thinking and feeling, which is going about the analysis in your mind. To me these are almost two very different things in the sense that both are absolutely essential. You have to think about how to structure the analysis. And in some cases you may be able to complete the analysis in your head. That's what a lot of people have been doing over many centuries of history. And clearly we have a lot of tools which you can then use to translate that analysis into something that you can delegate to a being that's far more powerful than us. What we've learned in all of these years is great. And we're going to continue that learning. This course will be tighter to this part of that. What we haven't learned is how to use computers really well to get them to do the kind of thinking that we need to do. SAS or SPS is excellent. Fantastic tools. Writing your own program even more fantastic. Getting them to do the kinds of analysis that can be done on data. We'll get there someday. So, like I said, this course involves programmatic analysis. There's very little manual analysis that we'll be doing as part of this course. And the program will be mostly written in Python. The reason we will do Python's A is simple to learn. MIT recently shifted over from LISP to Python. Part of the reason was that it's fairly straightforward and simple to learn for a person who has no programming background and we use that as well. The other reason is that it's fairly good for analysis. Some people here are already using Python for analysis. It's got a number of languages. So if you want to calculate the average, you don't have to write your own code. It's written. If you want to calculate something volume sophisticated, let's say the kurtosis, you don't need to write that yourself. The code is written. And it's fairly well documented. There are enough people whom you can reach out to for help in a forum. You can go out and get that information. So that's part of the reason why Python. The other part of the reason is Python's guarantee my direction. So if you learn something, then you learn something. Let me just flip this question around. So if you have to write a program today, what language would you have chosen? Just curious to hear from the teams here. What would be your language of choice be? If it were not going to be Python. Visual basic. Visual basic. Why? I work in visual basic. You work in visual basic. Which is a five reason. And the other reason I can think of using visual basic is it works well with Excel. So if the bulk of your data is Excel and a lot of people have Excel, if you want to share your results in Excel, that's a great language to write it. And if you're familiar with visual basic, it's as good as any other language. Java. Java. The reason is it works well with Hadoop and infrastructure. It works better with Hadoop. It works much better than Python. It works much better than Python with Hadoop. Hadoop is a way of doing analysis on multiple computers as opposed to single computers. So if you had a problem that was bigger than what one machine can record, then you're better off using this thing called Hadoop. And Java does a better job with Hadoop than with Hadoop. That's another reason. Closure. Closure. Yeah, because that runs on the JVM. Yeah. And also like in Python, some of the C libraries like NumPy that doesn't run on the JVM. Yeah. In the Python implementation, it doesn't run. And Closure, it contains a lot of statistical tools and all those things, and it runs on the JVM. And also it is like it is coming up as a new language and it makes programming like as easy as Python. So the language is Closure. That's like Closure except for the J instead of the S. For those of you who have learned the term Java and are interested in this video, there are two parts to Java. There is the Java programming language and there is the Java virtual machine. Java virtual machine is something like an operating system so that you can actually run it on pretty much any kind of machine that runs on Windows or something like that. So Closure is a language that compiles into the Java virtual machine and therefore can run on any machine including Hadoop, which is a very good part of it. But it does not require you to like the Java programming language. It's considered by a number of people in more powerful programming language with a lot of support that's new. Because that gives you an overview of programming gives you a feel for the kinds of tools and the kinds of programming languages that exist that one can play with. We are just going to stick to Python but just don't worry there are many other alternatives some of which you may not have heard of but it exists and part of what this group can do is for those of you who are not aware of that side of the community you would be able to get in touch and you would be playing around with some of these things potentially cutting in stuff and either use their help or learn directly from them. Now, I want to gauge your comfort level. You might be in a state where you say I'm not too comfortable programming programming or I've never done any programming I did it for years, about whatever but today if you ask me to sit and write let's take a sample program a program that can play a standard deviation of self numbers. You could say I'm not too comfortable doing that or you could say I can program but not all that well in Python. So if you ask me to write you can see each other and it should be safe or excellent or not quite in Python or you could say I am good at Python just think about where you think you stand writing down keeping your mind whatever and let me do a poll actually before I go ahead let's do the poll now how many of you would say you're good at Python? that's five hands how many of you would say you can program but not that well in Python? five, nine and how many of you would say five, nine you're not too comfortable in programming? not too comfortable is a mild word I have no clue about including included five, so let's go through what you're probably able to do if you want to get the most out of this course the five of you who are good at Python if you're starting off with scraping next class you might want to try out the lxml library if you haven't already but otherwise you're just you're fine for those of you who can program but not that well in Python I just suggest reading up on Python here are a couple of links these slides will be available online you can refer to this link or if you want to copy these links that's fine or just remember to do a search for these the first link is google's python class code.vul.com.ed has a link to a patterned system introductory course on Python which will not take you more than a few hours to run through and do it and it will give you a sense of how to create lazy data structures leave files and process things the second is a book called Dial into Python he published this book online had a huge 500 meters who said look you can't just give away the book for free and have it published but he fought that battle the book is available in the public domain and then he completely vanished from the online space so every single online presence of Mark Wilkins has been wiped out but because he put this book in the public domain there are enough copies enough translations and what used to be in Python.org now the site is in Python.net and a number of other cases with translations it's an excellent book this will take you a few days at best to complete but you don't need to complete the book just go through the first say five or six chapters you should be good so that's for those of you who can program you can stay good for those of you who are not too comfortable programming or have not programmed so far I would strongly suggest taking an introduction to programming class in Python for those of you who will be watching this online or are not going to be based out of Bangalore here are a couple of links you can take a course from University they have a course CS1014 which is an introduction to computer science that's in Python Coursera has a course on programming and it's a basic introduction to programming it will teach you all the concepts of programming from scratch it takes a number of us so you probably want to start right now for those of you who are in Bangalore we are running a course that's at TechnoTurf TechnoTurf is a technical training institute and these classroom sessions will start at Indranada these will be done on Friday and Saturday if we need to do this session Sunday we will cover this on Friday and Saturday give me just a thank you I will talk a bit more about it's at Indranada near Metro on CMH Road the timings at this class is something that we are opening up only to the people in this room who are interested in taking up the course will be run by the chief the chief will please stand up which you will need for that thing which goes on Friday and Saturday the particular will be made available online at some point but this is more than this is a classroom session as much as workshop where you are done for the course have questions, ask sit and work through the course and that's all if you are interested in taking the course please after this class on email and the other thing that I will have to do work over the time on Friday is pretty much the full day session on Friday we will have a few hours of course for those of you who would like to be taking the course how many people would like a Friday how many in Friday sit and work how many of you are comfortable on Saturday Saturday even anyone how many of you have constraint on Saturday morning on Saturday morning if you have constraint on Saturday morning who can't make it on Saturday morning in that case you probably end up having to do 3 days because we won't be able to cover from 6 to 8 or 9 who is ok with Sunday Sunday full day Sunday full day might be a possibility yeah what if you want to spoil your weekend thank you go over and see what maximizes yeah what way did you get to my your life Sunday full day you will need a lot of in terms of the sessions that we will go with the next session that's getting the data if you have quick introduction to what we will be doing there internet will be provided internet will be provided if it's at the end of Saturday that's a reason to do it actually that's a good point you will need access to the internet so it will have to be at the end of the day all of the courses that I mentioned earlier will give you an idea of where to download it so the next session is going to be on getting the data there are two parts to getting the data it's either either you don't have it or it's not a format that you can use and those are the two things we will be covering and when I say covering we will talk about how to do this in an order of fashion so let's say you have text file and it's got the number of let's say it's a standard report format so you have let's say names you have a district ring for data you have the name of a district and as many spaces as required to pair it on a filter and then you have the number of centimeters or millimeters of weight for that particular one and so on you have the row after row at the end this is in fact the actual format of the data that we will see in a number of cases now how does one process that to come at you has data realistically so you have a file in which you have spaces separating everything but it looks very neat when you open it everything is sorted in the column format how would you process it I hear a few numbers of Excel any other parts Excel is quick, really quick you just open it, you say text to columns and text to it you can put everything into columns you have good control on whether the columns should be numbers or whatever format you can do a lot of stuff with it if you haven't used text to columns in Excel please take a look at it the problem with that is if you had one file per district so some of the plus files it's going to take a lot of time so one of the things that we will be talking about in the next session starting with the basics of file how would you go about converting this out of the file into a file that one can open in Excel let's see the other part is downloading those 620 15 out files is going to take a lot of time for you to do it so how do you automate that the second part now incidentally this is called calling and it's no different from what Google does to every single website does it take much by way of computing resources so that's how fast you do it when you write your bandwidth and your computer's speed so if you save 600 files let's say each one takes a second that's going to take you about 10 minutes happy to wait for 10 minutes if you want 60,000 files then you might want to either split it across multiple machines or let it run for a thousand minutes it's still not such a big deal it's a few days you can leave it over and get it there so depending on how urgent you need it you have different words something that will your time both to write the program and to run the program versus how much money you have either in terms of the ability to hire people to do the co-writing for you or to run the machines that you can most of the problems that I have dealt with in many years have not required more than one computer and have not required more than a week of the longest process that I've done is a week of downloading Twitter's data because I wanted to get my own weeks worth of data the second largest was a matrimonial site from which I downloaded every single frame's data recently I feel like I was getting my own data but anyway purely for analysis purposes and that took me a day and a half the rest of the sessions you can go through the contents and we'll explain our subsequent sessions through what they will be talking about so like I said that's what the topic of the next session is you have a sense of what you need from a big business perspective just let us know if you're fine any questions that I can take those like now so couple of quick announcements so if if you want to attend that course next week independent of the time please tell Anand that you want to do so if you haven't registered for this course like I saw that a couple of people came in late and you haven't actually paid up please do so as well and apart from that again let me reiterate register on the course site so that we can use the forums and other functions accurately we will be hearing from us about the projects that you'll be participating in by the next session or at most by the middle of next week and that's what going to be the fun part and obviously once you know how to scrape you should be able to get your hands dirty with those data sets thank you I'll be here