 Okay, so a bit of introduction about me. My name is Harshad. I work as a senior data scientist at Socrity. I have spent, so I don't claim to be any expert on machine learning. I'm just, you know, trying to sort of understand and apply machine learning. We are a digital advertising startup based out of Pune, about four-year-old company. And I've been working with Socrity for the last one and a half, two years. Mumbai, again, I've spent some time in Mumbai for my MTech and I was with Vodafone Group. They have office in the world for it. So we had a small analytics team. So I used to work there. Right, so can we have a quick round of introduction? So where is everyone working in what domain? Couple of liners. My name is Meenit Blight. I work with a company called Coldplay. Okay. And I mostly work on 5,000. Oh, great. I'm with which I run a data visualization company called Spiky. Okay. My name is Dick Mages Wally and I'm working with Affinity. And I'm working on the open-source resource. Okay. I'm Mr. Gashan Nair. I work in VRR Connect. I'm a senior R and D engineer. I will talk about myself later probably in the session more. Okay. So anyways. Prashan, from Chitlan Systems, working in Python. Great. I'm working with contextual advertising company, Meere. Oh, great. My name is Chitlan. I'm from desktop.co.com. I work in front and back end of the web. That's an ISA. I'm here. Oh. At? Amrita Yunasi. Okay. Thanks. Tony Simon. Thank you. Thank you. Arish, there is an internet price bar. Internet? Price bar. Okay. Great. Sanjit, I'm working as a journalist with India Spain. You write articles on data. Oh. I'm Anjuli. I've been working at India Spain. Date us. Our second year will be next week. At? Yeah, they're at. No, they're not at. I'm Karat, second year internet. I'm working at SwissNet. I'm working at SwissNet. I'm working at SwissNet. I'm working at SwissNet. I'm working at project data. Okay, great. I'm working at internet also, that's a question. Okay. I am working on business intelligence and project software. Okay. I'm working, both of you. Okay. Great. Let's get started. Okay. So what are we going to do in today's session, right? We have about two and a half first time. So what I intend to do is start with sort of a 10,000 feet view of this field, right? So before we actually get our hands dirty and go into specifics, what I want to do is sort of give a very 15 minutes, bakwas, gyan sort of session on machine learning. So sort of what I've gained through the last four years and the peers in this field. So the idea is to, you know, not miss the forest for the trees, right? Once we get into actually working with our Python, we kind of tend to get lost into remembering the syntax, remembering the API. So first 15 minutes, I'll just spend on giving a broad overall view, right? And then we'll switch gears and we'll get into specifics, right? So that's the basic plan. So let's start with, right, so what is machine learning? Any ideas, right? We basically feed data to make it learn so that it can be again run with a new set of data and it can use patterns from the old set of data to identify or to do something with the new data. Exactly. It's very, very precise, right? So engineering students, what is your idea, right? Hot field, a lot of jobs, right? The ability of the computer to perform certain operations without being explicitly programmed will do so. So basically training the computer to do something that to recognize those patterns that we cannot explicitly program or we cannot track those patterns. Right? That's good. So now if you sort of go on to social networks and internet, right? So there is a lot of terminology and a lot of views about what is machine learning, right? So a lot of academic debates, if you go on Cora, Stack Overflow and others, you'll find people writing big posts about what is analytics versus machine learning, data mining, big data and all the new industry buzzwords, right? And so I'm not claiming that everything is the same, right? But most of these fields tend to have a lot of overlap. And if you think about it, a professional and practitioners shouldn't be too worried about the definitions and where the lines are, right? The lines are all blurred. All these fields have a lot of overlap and we need to pick up good concept from each field, right? So if you ask me something on the line similar to what many other folks have mentioned, teaching machines to take decisions with the help of data, right? So that's the practical man's or woman's definition of machine learning, right? So bit of history, right? So what is this? Anyone knows this? This is Halley's Comet, right? So what the hell is this doing in a machine learning workshop, right? So this Edmund Halley, right? He was a very, you know, intelligent man. What he did is in 1776, he was a resident of London and they used to keep a death register of the city, right? So anybody dies, they make sort of an entry into that register, right? Now, what this guy did is he took that data and sort of analyzed that the people who are dying, what is their age, right? Is there any correlation, right? So and intuitively we all know that as you grow older, your probability of dying is going to increase, right? And why is this useful? Any guesses? Exactly, right? So all these insurance folks who are selling annuities, their bread and butter depends on this, actual tables, right? So if you go to an insurance company saying that I want to take out an insurance, the first question that they'll ask is what is your age range, right? And their insurance premium slabs are defined on age. So Edmund Halley, so even though he's famous for this comment, he sort of laid the foundation for actual or using this kind of data to, you know, aid decisions. So I tend to give this example in a lot of workshops and these are the kind of reactions that I typically get. That man, there's two ancient and this is not machine learning, right? You're speaking about something that happened in 17 or 18th century. So bit of relevant history, right? So as I mentioned, insurance and banking industry, up to a large extent, they were pioneers in using models for decision making, right? So if you have heard about FICO score or even in India, now we have this Sibyl credit scores, right? So credit scores have a long history. I guess 1960s, 70s, they had already started using credit scores for assessing a person's credit worth in the US, right? So because these industries are directly dealing with money, they were sort of pioneers in this field, right? Getting mathematical models, getting statistics to aid their decisions. So that is sort of one pathway. The other is this artificial intelligence and other fancy ideas, right? So we have heard about some checker playing machines, then chess playing machines, right? So that's the other sort of path where people have ventured into. So machine learning sort of is an intersection of ideas from both these fields, right? So if you sort of have to take a one-liner kind of idea about what is it when you say that you're working on a machine learning project, what are you exactly trying to do, right? So we have some information about world, okay? Which assume that this x is a matrix and we have put that information or stopped that information into this matrix and there is something, some kind of outcome which we are interested in, right? So either it could be the outcome itself, how much money can we earn from selling some goods on some website or it could be some function of the outcome, right? So what is the probability that x will happen or probability that y will happen, right? And the whole protocol is to find out this f, right? So irrespective of the kind of models that we are using, kind of languages, kind of libraries or even domains that we are working on, the basic, the bottom line is the same, right? We are trying to find out this f, right? That's our endeavor. So moving on, typically I will spend just five minutes on this. Machine learning, right? There are kind of two sort of distinct cultures in machine learning which have originated. One is this statistics kind of culture, right? So what we do is we take a particular model and say that, hey, this f, right? And this y, they are related by some kind of a model. And we assume, we start with assuming what is the form of that f, right? And then you try to fit that model to your data and do lots of... So these are the buzzwords in that field. Hypothesis testing, goodness of fate, residuals and some Bayesian analysis, right? So this is typically the stats culture, right? Regression models, survival analysis. The other one is the artificial intelligence sort of culture. Now what these guys say is I don't care about what model we are assuming, right? Let's assume the world to be a black box. And my focus is on predictions, right? So these guys say that I'll assume this model and then I'll try to see how good is that model of it to my data. Okay, but now here's the crux, right? So if there are say five models and if you try out only four, maybe the fifth model was the best one, you never know, right? But nonetheless, this is the fundamental theory, right? We can't ignore it. The other one is the AI culture, which focuses on good predictions. So and then ensuring that the model that you're fitting is not overfitting to the data, is not underfitting, we'll come to these terms. But so the focus currently in AI communities is on getting good predictions out. So neural networks, tree-based models, right? These have originated in the AI community historically, right? So these are sort of cultures and people are very religious about which one is the better, right? So there is this famous professor called Leo Bremen, okay? And he has written an essay about two cultures, right? That essay is called Two Cultures. So if you happen to have some time on weekend, just do read it. It's a fantastic paper, right? But we are not going to get into all these sort of walls, right? So if you want to be a good machine learning critic, famous professional for Leo Bremen, okay? And he has written an essay about four cultures, four domains that essay should actually balance. So if you happen to have some time on weekend, just do read it. It's a fantastic paper. But we are not going to get into all these sort of walls, right? So then there is the computational aspect, right? It's not a good machine learning very, very sound theoretical model. If you can't actually take, say, 500 million data points and fit your model to it or use it for predictions, right? So the computational aspect, application, scale, these are also important and business knowledge, right? So because if you, I don't know how many folks here have attended this Coursera and other courses in machine learning, right? So even I have done that. And the important point that I really miss out on these courses is they never talk about the problem that you're trying to model, right? It's more or less black box. I have got this data from here and there is this fancy algorithm. Don't you feel it to it? And then you get some kind of output, right? So business knowledge is very important, right? That should be the starting point. So my basic point is everything is important. You'll get feature selection, model evaluation from ML community, the basic theory from stats community, Hadoop, Storm and his friends. They help us to actually take this knowledge into production, right, as it is called. And business knowledge is important, right? That's the starting point. So I'll try my best to sort of touch base upon each of these topics as the best I can. Okay, so now we are switching gears. The process of any machine learning project, right? Irrespective of its domain, typically goes through these phases, right? And there are a lot of terms in the industry, a lot of vendors of commercial softwares have come up with their own terminology. There is something called crisp dem, then there is this SAS software which tends to call it SEMA or something, right? But in plain English, what you try to do is you fit your objective, right? What are you exactly trying to do in this machine learning project? You get your data. The data should be useful for your objective, right? It's not just good enough to pull some data out of the database. Exploration is very important, right? Trying to get familiar with the data, friendship with the data, right? So just feeding the data into some kind of black box and getting garbage out. That's not the intention, right? And a lot of online courses, right? They'll focus on this part somewhere. So personally, I believe that this is the 80% part. Any machine learning practitioner, if you ask him, right, a lot of folks here, they'll say that 80% of their time is actually spent in this region, right? And it's only 20% of the time that we actually are thinking about some kind of fancy model, right? Then we fit our models, evaluate them, apply them. We also try to validate our models, right? Is that model actually earning you a dollar? If it is not, then you sort of go to the loop again, right? So let's touch base upon each of this one by one and we'll get our hands dirty on this. The objective, in very brief, there's nothing much to talk about it. But from my experience, this is the bottom line, right? The objective that you're defining for your machine learning project shouldn't be in terms of the algorithm, right? Don't ever start with saying that I want to build a linear regression model with some F score or a classification model, right? That's not the way. The objective should always be focused on the outcome, business outcome that we're trying to model, right? It could be predicting probability of a customer churn, grouping some sort of textual data and recommending items, right? These are sort of the hot topics. So always focus on the outcome, business outcome that you're trying, right? So I'll just stop it there, right? Sourcing data, I'll speak about this at the end, right? So basically about sourcing data, I want to talk about how do you process data at scale, right? So we'll reserve some part of our talk for the end, okay? So explore, right? First phase of any machine learning project. So we'll now go to R, right? Let's get our hands dirty. So I hope everyone has R installed. So R is actually originating from the S language, which was built at Bell Labs a long time back somewhere in 1970s by John Chambers and three of his friends. So S is now, I guess, S plus is now a commercial software. And two professors, I guess, at some university in New Zealand, they sort of wrote an open source implementation of R. A lot of the base component of R is written in Fortran and C and R itself, right? It's a good language, but it's quirky, right? We'll explore that. So how do you start R? So you go to your command line and you invoke the R program. Some folks might be using RStudio, right? The GUI sort of good-looking interface, right? So when you start R, it prints a bunch of information about the R version and stuff like that. Okay, so when we talk about any sort of language, right? So R is a programming language. We start with the basics, the data types and the data structures offered in that language, right? So let's try our first, right? Just type this command, type of one. And then it comes back and prints double. So R is a dynamically typed language, right? You don't specify the data type ahead of time. And it sort of interprets what kind of data it is. Now, it's one I would have ideally expected this to be an integer, but it's double. So by default, it goes to a double, right? So let's try something else just to play your hands. There is a character data type, all the usual suspects that we meet in any programming language, right? So I always tend to get confused between this all caps and this camel case sort of version, right? So this one is Python, right? And this capital is R, right? So logical data types there. Then there is an interesting data type called factor, okay? So just print this as I'm doing. So C, this weird looking function, a single line or single character function is actually concatenation operator. What it does is it takes these atomic sort of values and puts them into a collection, right? An atomic vector, but what I'm doing is I've created another function or I'm calling another function called as factor on that vector, right? Now, what it brings is interesting. It says levels A and B, okay? Now, try another thing. Now, instead of this, remove that as factor function, right? So what is happening here is when I just use this, it's a vector of characters, okay? But if I sort of convert it into a factor, it prints this levels information. So now this is the statistical baggage of R, right? So in statistical community, this factor analysis was a very popular thing some years back. So basically what factor says is if I have some experiment and there is a particular, what you can say, variable, I have a lot of data points. Say you have 10 million data points, but in those 10 million data points, this variable takes, say, around 4 to 10 values, right? So those kind of variables are typically modeled as factors in R, right? So factor basically does this factor influence my outcome, right? That's how you remember it. So remember this part because when we will go to Python, you will notice the difference between how R handles these kind of values with small number of levels and what do you have to do in Python for that, right? So basic data types are out of our way. We'll go to data structures, right? So this C function actually is creating a vector, right? So I guess we can check that, right? So sort of an assertion to check that this is a vector. So vectors are basically atomic collections, right? Every value in that collection is of the same type. So if I say type of C, A, B, then it shows me character, yes. It's a vector of characters, right? So type of is going to show that in this container what kind of data values are there. It's sort of weird, right? I guess there is another function called class, which you can try. I don't know if it will bring. Right? So vectors are atomic data types. Then you have something called list, right? So R, how many folks have worked in common lists or lists somewhere in their graduation, right? We are looking language. We are looking parenthesis syntax. So R has a lot of roots in lists. A lot of internals of R are inspired by lists, right? So list as a data type is supported. So what is list? List is sort of a heterogeneous collection, right? So what I'm doing here is let's see if this succeeds. Now what I've done is we introduced the syntax for vectors, right? And this is a plain single atomic value, integer or a double. And then I'm just putting that into a collection, right? There is no what you can say limitation, enforced limitation that all have to be of the same dimensions, et cetera. So list is just put kitchen sink in, right? Just a container for all types of heterogeneous data you can just put together. So R, in fact, so we'll have these examples. The results of the machine learning or statistical models that we fit, those are often returned as a list. And it has vectors, data frames, text data, everything just put into that list. So lists are very important. Remember this list. And then we'll come to data frame, right? So data frame is sort of the workhorse when you are working with R, right? So data frame is closely equivalent to your MySQL or relational tables, right? Typically what is the format? We have number of rows. Each row represents some data point and there are columns which represent some attribute. So if our number of rows represents some, so if they are representing customers and our columns might represent say number of times that customer has bought something from our website, the amount of time that he has spent on a particular page and so on and so forth, right? So data frame is the data structure that you would be mostly working with. Even Python, right? So Python has this library called Pandas, right? Pandas also tries to mimic this behavior because data frames are everywhere, right? So if you want to work with R or Python, it's very important to get very comfortable working with data frames. So let's spend some time on data frames, right? So R comes with a library of some in-built dataset and because we don't have time to actually download some real-world data, so let's stick to these example datasets which are bundled with R. So if you load this library by using this function called library datasets, then what it is going to do is it is going to load all those datasets in the memory, right? So how do you know what all kind of datasets or functions or anything is covered in a given package? These are called packages in R. You can think of them as libraries, right? So there is this command help equal to and you specify the name of the package that you're trying to load, right? So it says information on package datasets, some history, version, and then here if you notice this index, it mentions that there are these datasets which are available. Is everybody up to speed with this, right? Very simple, right? So all these datasets are available now if you want to actually understand what the hell is there in these datasets. So a good or cool function with R is that documentation is inbuilt in this interpreter, right? So you put a question mark. Now cars is a dataset which is available in this library and then it puts a bunch of information about that dataset. It says that the data gives the speed of cars and the distance is taken to stop and blah, blah, blah, right? It's a data frame. It's a data frame with 50 observations on two variables, right? Just the way we spoke about it. So let's print cars, right? It prints too much, I don't want. I just want to explore what is there in this data frame. So I'll use this function head. So a lot of these exploratory functions are mimics of similar functions available on Unix, right? Head and tail. So if you put this function, it says there are two columns called speed and distance and there are six. It prints the first five, six lines of this data frame, right? So we have explored data frame. Now let's try to access the elements of this data frame, right? How do you access a particular row? You can treat this as two-dimensional data structure, right? So something similar to what we do in case of matrices. You can use that. Simple. So there is this function in R called sequence. Given endpoints, it actually creates a sequence. I think we will be also able to access some rows of a data set by using this sequence function, right? So, sub-setting data frames is something that we are going to do all the time, right? So we have spoken about how do you access particular rows? You can actually give the row number or you can give a sequence of row numbers, right? Now, if you run this function called... Oh, sorry. Right? There are two variables here, speed and distance. You use this dollar operator on a particular data set. It will print that column, right? Now, if you think about it, that single column is going to be an atomic vector, mostly consisting of values of the same type, right? So using dollar, and I think there are a bunch of other idiotic ways of accessing particular column, right? So that is one of, you know, downside with R. There is always more than one way to do it. And so if you actually happen to browse a source code of a lot of R packages, there is always this impedance mismatch, right? So if instead, if you look at Python, the source code, because these are curated libraries, right? The source code is very, very simple and clean API. That's not so with R. But it also gives a lot of flexibility, right? So you can experiment by loading the other... So distance is double, right? So we are all familiar with how do you subset using roles and how do you access individual columns. Now, what if you have more number of columns, right? In a data frame, how do you access that? So I think there is this data set here. Let's pick up some other data set this time, which has slightly more number of variables, right? So empty cars is another famous data set, right? So let's go ahead to explore this data set now. It has a bunch of columns. I just want to tell you how to use a multiple columns, right? So what is this data set? There are a number of cars here, and it gives some attributes about those cars, right? Like the miles per gallons, number of cylinders, displacement and so on, right? Now, if I want to say, select multiple columns of a data frame, what you can do is you create a vector of names of those columns. So if you want to access roles, that's the first index, right? So, right? Second, so it's like a matrix, right? You can think of this as a matrix, and so whatever you do here operates on the column, subsets the column, and whatever you do on the first index, it operates on the rows, right? So I have selected bunch of these variables here. Does this work? Yes, right? So let's just have printed few rows selecting particular columns, right? Now, here it gets interesting. Instead of actually giving the names another way is to actually give the column numbers. Now, a lot of programmers here will say that this is a sloppy way to print, right? Because the number of columns can change anytime, right? So not advice, but yes, you have this flexibility. Now, another cool way to do subsetting is if you do a minus here, right? It prints all columns except one and two, right? So, because if you have, say, a thousand attributes in the dataset and you just want to eliminate few, right? You don't need to sort of write down those 998 names. And by the way, I think this should work even if you give a vector of names. Nope, looks like it doesn't. I have to explore a bit, right? So now, to make it clear with accessing rows, accessing columns, by name, by index, by negative index, right? We can find some metadata about, because if someone gives you some data pulled out of some SQL database, right? So you would want to find out some information about that dataset. So then we'll print the dimensions, right? So there are 32 rows in this dataset and 11 columns. I think if there are some row names given, so, right? Row names are different than the index, right? You can give names to the row, right? So this dataset talks about a lot of cars, Mercedes and all. And then there are column names, argument given. And then there is column names, right? What is... I think that should be it, right? So basic, you can do head and tail on a dataset, find out the names, find out the dimensions, right? That's how you sort of try to become explore or familiar with this. Okay, so exploring using plot. I guess plot is a basic function that creates sort of a scatter plot. It just prints this all values, right? Now there's my mouse not working. Instead of this, if you want to plot one variable against the other, you can just give those... Now what it has done is the number of cylinders in a car versus the miles per gallon. So for the average of that car, right? It has plotted that. Another function, if you have a bunch of such variables and you want to find out the interaction between them, right? That's what we typically tend to do, right? If you have, say, the total amount spent by a customer as one of the variable and say the city from which he comes, right? Or the device that he is using, there is this function called pairs, right? If you want to find out what pairs does, you can just explore the help, right? Give you all the arguments and such. Let's try this. Only one column in argument prepares. I'll just explain this in a moment. All right, so what we have done here is... So this is the formula syntax of R. What we're saying is the number of cylinders in a car, please plot it as a function of the miles per gallon and the horsepower, right? So, yes, this is the syntax. We just have to remember that. So this tilde says that whatever is there on the left-hand side, please plot it as a function of the variables on the right from this dataset. And now it will plot this nice kind of graphics, right? How do we read it? Let's read it this way, right? Cylinders. What you find here is this y-axis is the number of cylinders and the miles per gallon and the horsepower, right, plotted. So it is basically plotting interaction of every variable with everything else, right? So there are three variables there, so it's just plotted in this sort of three-by-three matrix, right? So this is very useful, right? Pairs function, if you... If somebody happens to give you a big dataset, generally this is the first step, right? You tend to sort of see if there is some kind of a correlation, can you spot some kind of a linear trend in this dataset, right? That's what we do, okay? Now we come to the most important concept in R, which is called broadcasting or vectorization, right? So it is very, very important to remember this concept because it is applied everywhere in R. I think Python is also catching up with this concept, right? Because R is an interpreted language, right? You typically try to avoid writing for loops in R because if you write a for loop and something in between that for loop, the interpreter is going to take that text, parse it, and convert it to some kind of syntax tree in the back end, right? So this process is going to be slow. So what typically you do is looping over a container, looping over a bunch of things is given to the library, right? So let's see this in action. That's a better way to work with this. So we had this empty cars dataset in which there is a variable called mpeg, right? Now, let's say I want to find out all rows where miles per gallon or the average of that car is more than 10, right? What has been done here is this is actually a binary operator, right? But instead of giving it a single value, you have given a vector on this side. What it does is it broadcasts that to all elements in that container, right? And everyone is about 10, so it's not useful, right? So let's take 20. Now how is this useful is you can actually use this to subset your data frames, right? Because if I have this data frame, my first argument is going to work on rows. I can do something like this. It just works, right? Because it is broadcasting that operation across. And this is very useful, so you don't need to write for loops because in any kind of exploration of any problem, right? We tend to do this a lot of times. Substiting a data frame by some variable. So always stick to these kind of operations. It's very helpful, very fast. And another use of vectorization in R is if you want to apply some operation to an entire column, right? Say somebody tells you that for this column, I want to square that. I want to take the square root of a particular column. The same concept applies, right? Again, you don't write a for loop which goes through i's, 1 to the dimension. You just apply that function to the entire thing, right? So suppose if you want to create a new variable, I'll just call it some new R, okay? This weird-looking arrow is the assignment operator in R. Now imagine the case. Somebody tells you that I want to create an indicator or a flag kind of variable for all cars which have their mileage about 25, right? So let's print this out and then discuss it, right? So create a new variable in the same data frame, dollar operator. And on the right-hand side, we're writing in fls condition. Nowhere we have mentioned the dimension of the data set, right? Now if you do a head on this, this new variable is there. So this speeds up a lot of things, right? Yeah, so very important concept, vectorization. Don't write for loops in R. It will be very slow, right? So now let's switch gears and go a bit faster. Now what we're going to do is when you sort of start on a machine learning problem, you would want to actually understand how the data looks like, right? So in this empty car data set, there is this variable which is miles per gallon. Typically, we would want to summarize this variable, right? If you want to say, give a report to your team that, hey, how does this data look like? What is the average, right? That's what we speak anytime. So if this mean function, it will sort of print the average. Because R has originated in statistical community, it has got these nice functions called a summary, right? If you run this function summary on a particular column of a data frame, it tells you that the minimum is 10, the first quartile, right? 25 percentile of the data is this median, mean, third quartile, and the max, right? If you sort of want to go deeper, right? I don't want only the first and the third quartile. I want more of them. Quantile. So is everybody aware of percentiles, right? Yes. Yeah. So here, I'm going to use that sequence function that I told again, right? So now what I'm saying is, take this variable and give me the quantiles. Which quantiles? These are the ones. Start from 0 up to 1 at a step of 0.1, right? If you do this, it will give you something. It will print all the percentiles of the data, right? This is very helpful. So because a lot of times, right, what happens is, the real-world data that we tend to get is very skewed, right? So then, typically, we look at the median of the data and the mean, right? If they are very different, then you can think that the midpoint of the data and the average of data is not same. That means the data is imbalanced on some side, either on the left side of the median or the right side of the median, right? So what else? Right? So we'll go to the functional roots of R. R is a functional programming language. Now, I have these multiple variables and I want to sort of print a summary of each one of them. So what it has done is take this data set and apply it to all dimensions or all columns of this data set, this function, right? This function is an argument, right? Anybody who has worked on a functional language, folks in Python would see this stuff, right? So this is, again, very helpful, right? You can just do this kind of summary and put it into an Excel if you want to send it to someone, right? For every variable, it is sort of printed that summary now. Anything remaining? Histograms, right? Actually, looking at the shape of the data, there is a function called hist. It prints the histogram of data, right? And I think there will be some way of defining how many number of breaks or so it's... It internally uses some kind of algorithm if you don't specify the number of beans to actually plot it in a good way, but you can actually specify the number of breaks, right? And it has other bunch of arguments there, right? If you want to actually understand the distribution of a variable, right? Statistical distribution. Histogram is probably not a good way or offers something called as density plots, right? So if you call the density function, it returns some result and then you can plot the result, right? It gives something similar to this, right? It's not a bell curve sort of thing. There is this dent here, right? If you keep on doing this, you can supply this on all new frames and then take a look at each of these plots. You can also export these graphs to a PDF and create a report out of it, right? It's very flexible, all right? Now, so any questions up till now? All right, shall we move forward? So now we have done some explorations, right? We know how to summarize our dataset. By the way, I prefer to cover up one thing, which is detecting outliers in a data, right? Something that we would want to do all the time. So if I suspect that my miles per gallon is not evenly distributed, there are some terms that lie on either end, which are, you know, sort of outliers from the data. For example, let's take, say, annual income of all people in this room and then you add bill gates to it, right? What happens? You have an outlier, right? So how do you spot outliers in your data? Because a lot of statistical models are very, very sensitive to the presence of outliers, right? We'll talk about it when we go to linear regression. But what you do is you use the same quantile function. But rather than printing the whole range, what you do is you focus on the end part, right? The 90th to the 100th percentile. And instead of giving a smaller step, you can adjust your step size, right? Then it is going to print 90, 94, 90 seconds, all these percentiles. And then what we try to typically look at is the value at 99th percentile and the value at 100th percentile are vastly different. In this case, they are pretty much the same, 33.43, 33.9, right? But on the real-world dataset, if you happen to see a case where 99th or say 98th percentile and the max of that dataset, max of that value is very different, you clip that, right? Typically, this is what practitioners do under the term called outlier analysis. You eliminate outliers in your dataset, right? You don't want them to influence your model. Okay, so now let's go to models, right? What are the basic types of models, right? Most of you would be aware, I'm guessing. But just to quickly summarize, there is something called supervised learning. You have a dataset and you have an outcome available. You want to take that and guide those outcomes, right? To your model and figure out the relationship between your X and your Y, your outcome and your information, right? Then there is something called unsupervised learning. You are not looking for a particular outcome. What you're asking is, hey, I have this dataset. What can you tell me about it, right? It's more exploratory kind of analysis, which is unsupervised learning. Then there is something called semi-supervised. You have outcome available only for a very small part of your data, right? So, example could be you have actually done a survey on 100 participants and got some outcome. And then you have, say, 100K customers, right? And for each of these 100 set and these 100K customers, you have some information available. And you want to use this small dataset to guide you to group your other customers, right? So, that's semi-supervised learning. We wouldn't get time to cover this. And then there is reinforcement learning, right? You have a kind of a loop there, right? And it keeps on evolving. The model fit keeps on evolving. So, let's start with our first sort of supervised model. We'll try to fit a linear regression, right? So, all of you must be familiar with linear regression. There is this data, and we're trying to sort of stick a straight line to it, right? So, we have this cars dataset in which we have speed of a vehicle and stopping distance, right? If you apply a break to a vehicle, if the vehicle is speeding, you would expect that the stopping distance would be higher, right? What we're trying to say is, can we actually capture that relationship? How do you do that? So, I want to plot the relationship between disk, which is the stopping distance against speed, right? So, let's save that result. We don't even know at this point what that result is going to look like, right? Let R worry about it. LM, linear models, is the function which is typically used to do this linear regression fitting. What I'm saying here is, plot the stopping distance as a function of speed from the data cars, right? Very simple. If you want to actually explore what is LM, you are free to do that. So, the first argument is a formula, then the data, if you don't want to use all the data points, then you can actually subset that if you want to give some weightage, right? If you have a lot of confidence about certain parts of your data, right? Then you can actually specify weights. Isn't RStudio better to calculate everything in RStudio? So, RStudio is more like a packaging, right? It just helps you to explore it in multiple windows. The fundamental algorithms, right? They don't change. They are the same ones that they wrote in FORTRAN in 1970. It's the same thing, but the visualization is... Yeah, that's fine. So, in my company, we don't use windows. That's the only reason. I'm not using RStudio, but yeah, nothing else, right? So, we have... So, this ln function has created its result and saved it in this variable. You're free to give any name, right? Now, if you call the summary function. Now, this is something... Folks from programming background might tend to relate this with overloading of functions, right? So, there is this summary function which you apply on data frames, and then there is a result of a linear regression, and you still apply summary, right? If someone is interested in actually how this dispatch is happening, let's talk about it after the session. But... Right? So, now what it has done is it has said that, okay, I'm fitting this formula on this dataset. Residuals or the errors look something like this, right? Now, how do you... How do you use this information? Residuals, right? Are the errors in your dataset when the model has been fit? Error between your prediction and the actual value, right? If the median is very different from zero, then that means there is something wrong, right? On an average, you expect that on some data points you might overestimate, on some data points you might underestimate, right? But on an average, your median or the average should come to zero, right? So, that's what you should be looking at this line. And it says coefficients, right? It has done something weird. It says that there is an intercept because we are fitting y equal to a plus bx, right? Where a is that intercept. Intercept has a negative coefficient and the speed has a positive coefficient of 3.93. So, it is saying that the stopping distance of a car is equal to minus 17 plus 3.93 times the speed, right? But this is wrong. If the vehicle is not speeding, my stopping distance should be zero, right? So, let's capture that intuition here, right? Now, I want my model to not use an intercept. This is something that you know from your business knowledge or common sense, right? So, you add a term zero plus, so, ignore the intercept. Or I guess there would be some way to specify it using a Boolean intercept equal to false. Yes, check that, right? Now, what we have done is we have eliminated the intercept, okay? It is now only fitting a straight line model which says 2.9 times the speed is your stopping distance. So, I am not aware of the units, okay? Then it prints this kind of star things which is doing hypothesis testing. And it says that, is this variable significant? The model that you have fit, the linear regression that you have fit, is this variable worth including in your model, right? So, at some significance levels, which I did not earlier. So, I wouldn't go into this theory if you are interested, just talk to me after the session. So, at a very low level of significance, this variable still shows up, right? So, it is worth. So, there is indeed a relationship between stopping distance and the speed of the car. That's the outcome. Then it prints something about the residual standard error. Basically, it tries to, I don't know, residual standard error is, I guess, the total sum total of the errors in the fitted model. Then there is this R squared, right? R squared is a number between 0 and 1, which tells you how good is your model fit, right? I mentioned the statistical culture of R, right? They tend to focus on goodness of fit test. This says that it is 0.89 on the scale of 0 to 1. So, it's a very, very good fit. So, speed indeed explains the relationship between the stopping distance or speed indeed influences the stopping distance, right? You can actually do a bunch of plots on the result, right? So, it prints this first plot, which says the fitted values versus the errors, right? If my fitted value of stopping distance was 40, what was the error that I made, right? Now, what do you want to look at in this plot? Is there any systematic anomaly, right? On an average, you expect this plot to be sort of a random scatter plot, right? We are experiencing some. So, although this data set is very small, but if you sort of have this kind of a trend in the result, what it indicates is that your model is probably not doing very well at the high end of the values, right? If a vehicle is, you know, it's a vehicle from F1 going at 300 miles per hour, it might have some kind of a fancy breaking mechanism, right? So, it might be doing something wrong at this end, right? So, this is what you want to look at in this plot. Are these looking random or is there a trend? If there is a trend, then that means there is some relationship which is not getting captured in the model. Something similar again here is called normal QQ plot. Even I'm not aware of the exact theory behind it, but you would expect this to lie on a straight line. If it is diverting from straight line at some end, then there is a problem. Again, this model is probably not doing very well at the higher end of the spectrum in predicting the stopping distance, right? And there are a bunch of these other plots. Let's not worry about them, okay? So, okay, so we have fit our first supervised learning linear regression model and understood the outcomes, right? So, typically, if you try to do it in this in real world, what you would do is you take some cut of your production data, explore that, clean that data and fit this kind of a model. But then you would want to apply that somewhere, right? It's not good enough to just know that there is this kind of relationship. If I get a new observation, you would want to be able to predict the outcome for that observation, right? So, R provides something called as the predict methods. Again, these methods are over-loaded for most of the models. So, for results of type linear regression, there will be a predict.lm method. How this model works, right? So, we had saved our results in this linreg underscore res or some data set that you would have named. So, what you say is... Let me change this, this work. Don't try to, you know, get confused about what I'm doing here. I just want to show you. So, what I'm saying is predict now with this fitted model, right? And take these observations, right? Now, I've just created this small data frame inline. But in the real world, you might actually have your incoming observations, right? And you would want to give that data frame in that, okay, take this data frame, take this model and give me the predictions, right? So now, if a vehicle is going at 3 units per hour or whatever, the stopping distance is going to be 8.27 units, yes? I mean, it was a good fit, right? It was a good fit, but it was not working then. It had a problem that errors are systematic, right? So, we will jump that model and try more other models and then predict. Right. So, is there a way to cure the data of that problem that is for homoscedasticity, hydroscedasticity? Is there a way to cure the data? Hydroscedasticity, personally, I have not encountered that. I did encounter that in one particular project, right? So, in those data sets, linear regression is not a very good choice, right? So, that's the fundamental assumption of linear regression. You would have to change your model. So, I have a couple of examples from my work where we had to check this linear regression model and switch to something else. We can talk about it later. Sure. Yes? Yes? We have to look at other coefficients also. You just use distance and speed. Exactly. So, in real-time examples, we have a lot of a number of customer journey. Okay. So, we have various coefficients. Right. They recharge the timings and everything. Exactly. For telecom, actually. So, how do you fit in this? Okay. And the way you said that we train the model and then test the model. Right. What we do is like we take a set of data. Right. We train it. For something around 30 percent, we train the data. Right. We see that coefficient, that 0.89 you said. If it is working better and better, we consider different coefficients. Right. We may drop some coefficients to make it better. Exactly. Right. And to increase that 0.89 to 0.95, that makes the standard better. And then use it on the test data. Right. So, how do you exactly implement this? I have used it in our studio, but how do you do it in? All right. So, what you are speaking about is if you, say, have 100 features. Right. So, it's a very valid question. Right. In a real-world dataset, if you have 100 sort of attributes of some customer or 100 variables, not necessarily all of them are going to be significant. Right. One way is that this, right, if you had, say, 100 such attributes, what it would do is it would print this coefficient table for each one of them. And then it would indicate with star marks as to which one are significant. Right. In a manual process, you can go through these iterations to actually then just select a subset of those variables and see the results. Is this, if I just out of these 100, I'll pick only the ones with three stars. Right. Those which are very, very significant. I'll define the model again. Does this go to, say, 0.95 after that? Right. So, that is one process. The other is there are some packages which do feature selection. Right. Because it's a manual task, right, you just want to try out different combinations of variables and fit them. So, there are packages in R which do feature selection. So, I don't happen to have the names of these packages right now. But personally, I use Python for feature selection. Python has this nice API where you say that, okay, I have this data frame with 100 variables and I don't know which one are good. Okay. You say, so I'll cover that example later in Python code. But you say that, okay, take this data frame, divide it into 50-50, train your model on this 50 and use this 50 for testing and now tell me which variable should I select. Right. So, this entire pipeline, right, this is pipeline of feature selection, is available as an API in Python. As a very stable and very intuitive kind of API. Right. So, I'll try to cover that example as we go ahead. Select. Compress with the p-value and p-value and tell which variables are set. Yes. And select those. Yes. Yes. So, in fact, Python doesn't use p-values. It actually validates the data on the, I guess it will use absolute errors or something. It won't use p-value, right. So, Python community is mostly from ML, right. They don't go into this stats jargon, is what I've observed. Yes. But it's a very, very simple to use API. I'll try to demo that in some time. So, what you just discussed, is it the same as principle component analysis? Not really. It's a linear regression model. You have some variables and you are trying to fit a model which predicts the outcome. So, let's try this, right. It's a very good question raised. So, we... We are trying to drop a few of the coefficients like the SPD or PCA that we talked about. It's probably one of the best ways to do that, right. Without doing a manual or iterative process of testing various coefficients. So, those folks who are not familiar, SPD and PCA are a way of clubbing your variables together. If you have a bunch of variables which are similar, what it does is it takes a combination of those and creates one variable. So, right. SPD is very good, but where it suffers is interpretability. Yeah. Right. So, if I fit this model and then take it to an MBA and then I can't tell him that this is actually a linear combination of 50. So, it's a personal choice, right. In a lot of projects, it's the best way to do it. But if you want to interpret those results, may not be. If you want to find out the causation, not necessary, but if you want to just create a super sexy model which will do 99% accuracy outcomes, yes, PCA or SPD. Okay. So, I want, I talked about this demo on a real world data set. Right. So, I'll do that, but let's just go through one more model. Let's get our hands dirty. Then we'll take a small break and then I'll show you this demo. Right. So, in that demo, I'll try to address some of the concerns. Right. How do you select variables if you have multiple of them? Right. So, let's go to unsupervised clustering now. So, we had this data set where there was some outcome and some information. Now, imagine the case where you just have some data and you want to group it. Right. That is what clustering tries to do. Folks who have attended Coursera or some online courses, you might have learned about K-Mains algorithm. K-Mains is one clustering algorithm. Hierarchial is, you try to create hierarchy of groups. Right. So, you have all these customers. They are of two types. And within this type, there are three more types. Right. If you want to do some kind of market segmentation, it's a fantastic model. So, we'll do a hands-on on hierarchical clustering and I'll again give you a real-world data set demo for that. Okay. So, I think we have this data set loaded called as EuroDist. Okay. What is EuroDist? EuroDist data set gives the distances between different European cities. Right. If you want to do a clustering, that's what you would try to do. Right. You will have data points representing something. It could be cities. It could be customers. It could be products that are sold on your website. There will be a term distance which is measuring similarity between them. Right. That's the idea of clustering. Right. And then you use that similarity measure to group stuff together. So, let's... What we'll do is we'll try to explore this data set. What I've done is I've converted this EuroDist data set which comes in some weird class data type. I've just converted it to a plain matrix called temp. Right. Try to find out the dimensions of this matrix. It's a 21 by 21 matrix. Right. Because it will give distance between the cities. These are the cities which are covered in this. Intuitively, what we would expect. So, I'm not very familiar with geography of Euro. But, I don't know. Mionic and Vienna, are they close ones? Or why don't we just print this data set? Right. So, we do a head on this. Right. So, I did a head. So, it's actually a simple data set saying that, hey, distance between Athens and Gibraltar is 4, 4, 8, 5. I guess it's in miles. Right. So, in this data set, we want to group these data points. So, R gives a function called hclust which is for horizontal clustering. Sorry, hierarchical clustering. Right. And you, the object or the argument that you are supposed to give to this data set is a method which tells it how to group this. I'll talk about it. And the first argument is the actual distance data set that we just saw. The actual distance data set. Right. Let's try to do that. So, I am storing the results in an object called hclust underscore res. Feel free to give any other name. Probably a short one. Okay. If you don't specify the method argument, it will use some internal default. Okay. And it has fit this. Let's see if we can plot it. I don't know. It looks something like this. Ethens in Rome. I'll talk about what it has exactly done. Don't worry about it. Gibraltar, Lisbon, and Hydrate all together. I don't know if it is a good fit. It was both known as geography, Munich and Vienna, together. Right. Now, what hierarchical clustering does is it says that let me start with imagining each point as a separate cluster. Right. Everybody is sort of an island. Then it says let me merge the closest ones. Right. What metric should you use to merge closest cluster is the method argument that we specify. Right. It can take the median of each cluster and the ones where the distance between the medians is closest or there are some bunch of other statistical methods. I don't know. Right. And it tries to club them together. Right. And then it evolves to a point where everything belongs to a single sort of cluster. Right. And then depending on your application you would want to choose how deep you want to go. Right. For some cases you would want to say that hey, this level is good enough for me. For some cases you would want to say that no, I want I don't want to go into details. Right. This is the level at which I want to operate. So at this point of time I'll show you a real world demo. Right. This is something that we have people, cookies, people like you, browsing stuff. Right. I would want to group people together. Right. That could be one application. We do a lot of advertising for e-commerce and other companies. Now these people might have a lot of inventory and I would want to club stuff together. Right. And I might want to do this in an automated way. Right. That's the exercise and intuitively you would think that hierarchical content. Right. So what I have here is this p underscore things that you look at. These are files. Okay. I'll just open one of them. Okay. What is it? So it's nothing but the text from that HTML page of an online e-commerce player in India. Right. So it's called so they had some hair care kind of a page and the scraper has just something that you typically encountered in the real world. Right. No more simple curated data sites. So let's open one more. There is there was another page about Mads and there is this text about it. Okay. If a human being wants to do a grouping you would expect that stuff like say lamps and curios right. Home decor kind of items they are similar and bleachers cleaners facial kits they look similar shampoos and hair care hair colors right. What I've done is I think you wouldn't have this data so we'll do a demo instead I'll just run you through the code very quickly there is a package called as TM which does text mining what I've done is here I'm just reading those files and converting it into a corpus corpus is nothing else right. Then map as in map reduce do take this function and apply it to every document right. I wouldn't want numbers numbers on itself doesn't make much sense then some regex to sort of remove punctuation right. Wouldn't help much in some cases if you're doing sentiment analysis you want to keep punctuation right. Then package all most of the libraries for text mining will come built in with English stop words and Russian stop words right. Stuff like and a that doesn't tell us much information so we would want to remove that and I'm doing something so let's not talk about it essentially what I'm doing is I'm converting that into a term document metrics what is that I'm going to speak on this very detail in say next five minutes okay we want to capture the distance right. We were capturing distance between cities here we want to capture distance between URLs right. How do you do that very interesting topic we'll come to that but in some sense this captures the distance converts it to a distance matrix here and the edge class function that we just use right. Fills to that as a distance matrix and then plots it let's go right this looks very opaque I know but we'll explore the theory in a bit now what it has done is if you think about it facial and cleaning and skin care kind of stuff is grouped together even within that facial and skin care skin care look close and this cleaning or I don't know what it is it gets added later okay so this basically is telling you at what level are those URLs or those documents getting clubbed right so curios wall stickers get here then lamps actually gets attached to it but at much much later stage right on duet and cushion covers URLs right they get clubbed shampoo hair care hair color great right it's almost what a human being would have done and it's nothing not much than say about 15-20 lines of code to do this very practical very useful right documents could be anything documents could be tweets documents could be emails documents could be stuff posted by your customers on your Facebook page right yes sure the data set yes sure oh you want to see how the data straight data is structured okay so there is this function called STR which prints the structure of a data set I think oh just says that it's a corpus with 15 document so if you're how it is represented for finally doing the clustering is that what you're asking so how's the raw data coming that you're running these algorithms okay raw data raw data is very simple it's sorry raw data this p underscore right these are nothing but text files which are grabbed by some scraper it's just hitting that URL getting the text content and dumping it to a file okay so look something like this right it's just words and some other information there would be white space there would be punctuations etcetera and in the first phase of that code we are removing all that yes so this is an unstructured data kind of unstructured yes what we call unformatted or exactly unformatted data you have plotted the graph so you've related the shampoo right exactly so what if I need to structure it and it's an unstructured data completely what in insurance we get it's like many of the agents the if an accident occurs we write we have an agent that writes writes the description exactly so that is an unstructured data you can write so I have a set of PDFs with all these datas and I have to structure it similar to this so how do I structure it I've used text mining 50m and removed punctuations and everything how do I structure the data and then do the analysis I have a lot of analysis to do on the documents ok 10% of documents and it is unstructured so how do I structure it can you tell me that you have used the algorithms to make out the clustering and all so that was unstructured data you did the analysis of clusters yes I had to get it in a structured manner exactly we happen to have that as the next thing right this demo that we did right the most important idea if you think about it what was it right so focus on this part if you have unstructured data it could be images it could be tweets it could be posts on a social network right the structured data that any device is a coordinate access and data points in that space right so the structure that you're talking about is how do I take this you know big hairy ball and transfer it to that final form right this is the most important concept there you can treat words as access right so if you forget everything that you need right then if you have bag of words that becomes an n dimensional space right where each access is one word and you have to be careful about which words you're choosing because once you do that then any document becomes just a point in that space right so all companies doing sentiment mining and document lustering and what not right document model that's the fancy term but this is the basic idea access is word right so if I have 100 words I have a 100 dimensional space and any document is just a point in that space once you go to that point you have all the tools available because you can calculate distance you can so if I don't know we used to wonder in school who the hell uses cosine in the real world right I used to always have this question why my learning trigonometry right we used to find out that how do you find out the cosine of angle between two vectors right so now if you have this document and this document right so you have these vectors right the cosine of angle between them is very important right if the cosine of angle between them is zero the documents are if it is 0.98 and those documents actually represent something posted by two people on your company's social network those two people are similar sell them the same product right but if it is I don't know if they are at 90 degrees then they are completely different they are orthogonal right these are completely different set of customer these are completely different set of emails right so that is how you do it right so in this exercise I used only single words I just tokenization is done only at spaces you can use bigrams you can use trigrams you can use positional bigrams there is lot of fancy stuff there right and whatever token that you get that token becomes one of your access and then anything that you are representing your tweets your documents your emails are just points in that space and then these documents actually represents stuff posted by people on your social networking site and you know that these people have spent on an average thousand bucks and these people spend 500 you can fit a classifier in between them right next time anyone posts when I said something similar to this you know let's go after it right so this is a very very powerful idea I hope I have answered your question tokenization this tokenization what you want to say yes we have done tokenization on tweets also okay similar case exactly a lot many tweets because we have Twitter users a lot of yes you have the right so tokenization works in that way but still I am physically I am structuring the data I have a case study where I have to do an unstructured data converted into the structured data then do the analysis so I am struggling with that part after the session very interesting so these are fairly lot of companies are working in this space right so all we remember this concept right anything that you encountered in an unstructured data if you fit to this model then then you are good to go right can we move on 5 minutes break do you want yeah let's take a 5 minutes break do you want to add no no no very like I like I like that you know that is I like that I think that is very good but if you want that that is very good see I can just just just just just just just just The text was from an e-commerce area, so most of the works are made out of the same context. You need to determine the context. I think you will have to do some kind of migration. We can use the whole space. You will need some kind of look-up. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space. We can use the whole space.