 I hope everyone attended yesterday's talk on R, so we have done some basics now, you all got familiar with R, spent some time building say 304 models, this talk again focused on machine learning, but the idea is that yesterday we covered a breadth of machine learning, we are dwelled into multiple models, this time we will do sort of a depth, so we will go from both perspectives. Second, yesterday we focused mainly on classification models, I try to see if apart from classification we can cover some unsupervised material today, it is also important to have that world view also. Third, why is this focused on Python? There is nothing there to compare between Python and R, both are great tools, I happen to use R and Python every day in my work, it is just to give you a wider set of options, so you find R bit uncomfortable to use, here is another tool, Python, just see if that is helpful, so the idea is to not focus on tools, we will focus on the concepts and tools are just the means to get there, so with this thought, let me begin, by the way people on this side, is this visible to you or you may not be sure, but small introduction, my name is Garshan, I work out of a startup called Socorati, which is based in Pune, we are an advertising domain startup, so being an advertising company, we keep on getting a lot of click stream and impression level data, and my job there is to make sense out of this data, so we have done introductions yesterday, I wouldn't spend time on that, so let's get on, so if you think about the whole agenda of the workshops and the conference, the idea is something similar to this, so machine learning or analytics world, overall if you look at the big picture, there are these different ideas, there are these mathematical tools, sorry there are these mathematical models, the theory, then there are these tools, then there is the business knowledge, the idea is it's not enough to have knowledge of only one part of it, it's more like an orchestra, you need to know be a jack of all trades there, so one is the science part of machine learning, so when you say that the library is fitting a logistic regression model, what is the library doing exactly, knowing that is important, that is the science part of machine learning, second is the process, if you are given a business problem with a lot of data thrown at you, how do you begin and how do you go to making a usable solution out of it, then there comes the engineering part, so this data that we are talking about it could be billions of data points, it could be coming from multiple sources, how do you store that, that's more of the engineering aspect of machine learning or analytics, and the last is art, black magic, or more importantly art of machine learning is about how do you come up with a better model of the problem, the real world that you are trying to model, how do you come up with a better data model for that, so conference talks would be mostly focused on engineering, the end talk, don't miss out on that, so shall we start talking about art of machine learning, yesterday's talk on art, I think we covered a lot on science part, so today what we are going to do is we are going to focus on process of machine learning, now process sounds a very bureaucratic word and as a community we would hate problems, so what is this process of machine learning, we are going to take a real world challenging problem not a toy data set, we will spend some time on the science and the tools, use some intuition, so I am taking a problem which will be very simple for everyone to understand and then see if we can create a usable solution out of it, so that is process of machine learning, so what is this process that we are speaking about, so imagine this case that you are working for an organization and they throw a data set at you, which says that we have done some telemarketing exercise 6 months back and here are a bunch of data points from that exercise and that marketing campaign was somewhat successful, but can you tell us something more, what should we do better next time, can you do something using machine learning to tell us a better way to do our marketing campaign, very simple problem applied applicable to all domains, so now if as a data scientist someone throws this kind of question at you, you would have a flurry of questions in your mind, where do I start, that is the biggest problem, what kind of models am I supposed to use, the data that is being provided to me, is it too small, is it too large, how do I even know what is useful or what is enough size of the data set, so process is a somewhat scientific way of understanding and answering these questions and going to a usable solution, so these are more derived out of the experience of community, if I do not know if you heard of these non open source tools like SAS and STATA, which are again used in analytics community, they tend to have some notion associated with this process called as CRISP, DIM and there is a bunch of terminology associated with that, but the basic idea is you need to have a well defined process of going, starting from a data set and going to a solution, so that is what we are going to focus on today, science is covered and art and engineering, I hope the conference talks are going to cover that, so when we go to a typical machine learning or analytics problem process, these are the steps that invariably we would go through, one is we start off with defining what is it that we are going to achieve, defining the objective, we need to get data, we need to get a lot of data, the next part is called exploratory analysis, where we try to get familiar with the data, try to understand what is being thrown at us, then we come to the meaty part, which is the modeling of that data, here all the science of machine learning comes into picture, we need to evaluate the model, it is not just enough to throw data into a library, build some model and just get on with it, we need to see actually whether the model that we have built, is it good enough, that is the evaluation part, and what I have observed from my experiences, after evaluation you always have to go back, it is an iterative process, it is highly unlikely, in fact it is impossible that you get to a great working solution in the first attempt, we will need to keep on iterating and tweak our model, tweak our parameters, come up with more features go back and say after three or four iterations, hopefully we will come to a usable solution, and the next part is we need to keep on validating, if you build a model today, because the world outside is changing so fast, you need to constantly keep on evaluating your model with newer data, and the moment you feel that it is not doing the work that it was supposed to do, we need to go back to the iteration again, right, so in this talk we will talk about objective setting very briefly, mostly I will be focusing on exploration, modeling, evaluation and iteration, get data is not covered here, right, because one it is two drivers of problem, it really depends on the application domain in which you are working, how is your organization storing the data, nonetheless I have another talk in a conference tomorrow afternoon, so if some of you are interested do join for that talk, where I will spend some time on how do you get the data in shape, right, so let's go tomorrow, okay, so now let's start with the hands on, right, so today I am not going to spend too much time on the discussion, we had a lot of it, very interesting yesterday, so let's get on with the hands on, we need this particular data set called marketing campaign data, how many of you don't have it, just raise your hand, right, so if couple of you are not having this data set, one is jithub repo, those who are familiar with jithub, you can fork this repo, this presentation is an html file which is also shared in that repository and the data set, okay, those who are not familiar with jithub you can just google for bank marketing data set UCI machine learning, UCI machine learning has this open source repository of machine learning data set, so I have taken one very interesting data set from their repository, right, so I will give couple of minutes for those who want to download the data, yes, so this is the UCI machine learning repository, if you have jith installed you can fork this repository, it has the presentation html file, so those who are familiar with jithub as I urge you to do this, that will save you some time of copy pasting the code or it will save you the time of typing the code, you can just copy paste that code on python console, alright and just keep this link handy even after the workshop if you want to go back to some material or if you are not able to follow along, if you get missed out on some commands the whole presentation is there with all the code, so you can keep going back to it, if you find some errors or if you find some better insights do point it out, I will incorporate that, alright, so what is this data set about, so this was a paper published by these Portuguese fellows, I guess they are professors at some university and it's a data from a marketing campaign of a bank, what they try to do is they try to sell a deposit or fix deposit product to their customers and through telemarketing and the data that was gathered is they massage that data a bit and that is open sourced, it's a very popular business problem, lot of marketing companies, so this can be fitted to lot of marketing business problems that you face, can we move on? you have all the data, alright, so let's begin with the hands on, so whenever we get a data the first thing that we need to do is date the data, right, so whatever tool that we are using it could be R, Python or some other fancy thing out there, so what I am doing here is NumPy is a package for n dimensional arrays in Python, Pandas is a package which gives you a data frame functionality, so we must have worked with data frame yesterday on R, Pandas lets you use data frames within Python, okay, so I am importing these two packages, giving them an alias so that it's easy to refer, here what I am doing is, I am setting up the path to my bank file, alright, so you could have this somewhere in your downloads location on your desktop, just set up the correct path, this line is important, what I am doing is you fire up your Python console, so if you are on Windows, open your Python there will be some .exe, if you are on Linux go to terminal and type Python, yes, yes how do you know the current working directory in Python, I guess there will be some command, not sure, no basically what I am saying is, just set the path of your file here in this link, to know the current directory we will have to import couple of more packages in Python and then set that, this is an easier way to do it I guess so there is this file called bank . so there is an html file called presentation.html, you can open it in your own browser . is that so, not sure, just try running this, see if you are not able to import that at all, I think it should be fine, I don't think we are using date anywhere, so I am using 2.7, yes just open it in your browser . so there is a theme folder in the . so what I am going to do is, I am going to show you the code, give you some time for typing that out or copy pasting and then I will run the same code in my Python console . so you must have downloaded the file somewhere, just copy paste the path to that file and put it in this file, so if you are on windows, the top bar will let you copy that . I am just setting up that path to the file and then I am reading this file, it is a CSV file which is delimited by a semicolon and I am setting this header argument which says that the file does have headers at the 0th row. so if you don't have headers, you set this to false, if you have some other delimiter, you set this delimiter string . so I will just run this part now. . yes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .