 pregnancy-related products and data for the journey and the few days later the cause and the very end cause to check if everything is fine and then the guy sheepishly admits that the teenage daughter is actually pregnant. So that's how powerful data can be. I mean if you mind your property, we look for your property signals, it can be good. So mission learning, it's getting into almost all the companies that you can think of. So if you take a particular company, a particular division can have, can do analytics, can have marketing resources, supply chain, you all lose. Any industry that you think of, I can't think of any industry which doesn't use machine learning. It's becoming all pervasive. So I would want you people to do some examples of what you think are places where machine learning has to play a part. Just to ensure that you start thinking along where it can be applied. You can start from somewhere here. I mean you can make a division if you can understand things of all the problems they face. So I work for Symantec and the real and security business. So I even think that our thread is very clear, but if we can do it in real time, and there are too many, too many data points where you can actually identify a thread. So both I need to do some of the quite complex machine learning problems which is certainly not possible. So threat identification, security threat identification, that's the problem. Education is not there, but I think educational data mining is a very big thing. So we see what we are all about. So we see what my company is all about. So we provide educational data to the students. What do they do? So I mean we can predict what their style of thinking is and what kind of students they are and how they can do the things themselves. I have actually worked with the humanities company. So we are working on a project where we predicted from, we get a lot of data from sensors of the helicopter and predicted that when, what will the age of a heart, when can that heart need a repair or when will that heart age. So those kind of predictions will be used. So you can get a sense of the different kinds of things. So human resources, the way they can do is when will the design, they can predict even before the designs telling can we create a model which identifies that he is going to design three months back and start managing him better. So that's one option which human resources can do. Marketing, we all get emails and SMS by real estate, by this, by car, someone will call and tell you not to loan. So most of these things, at least in developed economies, they are all using machine and try to identify people who would be willing to buy a particular product or more likely to buy a particular product and then the campaigns are targeted for those people. So it's helped in designing your advertisements, your marketing campaigns, those are things where it's highly relevant. When you type on Google, some ads come on the right. That's machine learning at work. Supply chain, what's the shortest path to reach a particular customer, where to set up bad houses, these are all examples of how you could do supply chain. Legal, that sounds very weird. What kind of stuff can be done in legal? Place the panel closed. So one of the places where it's coming up a lot in legal is, some people are talking about text mail. So take past information, historical information about how cases have gone and then they try to predict that if such a case come what should be the judgment. So it's in the lawyers and the judges to provide better results based on historical information. That's one thing. Health care, what kind of stuff can be done in this case? If you make a generic disposition and make a decision, you can predict some. Yeah. You can tell if someone has a cancer or not. You can look into the data and you can tell if someone has a cancer or not. So automatic identification, so you have all this in my data, there's been a lot of work going on in automatic identification, diagnosis from the MRA scams, even before it goes to a specialist who can write the report. So can the machine replicate what goes on in? The products, the products are positive for insurance, for claims that they have brought up on the floor. Yes. It's not that anything, okay, if a new case comes, would it be a fraud? Yes, definitely. That's something which all companies do. If something, anyone, anyone, any bank which issues a credit card doesn't have a fraud algorithm which deals based on the different sequences of transactions that happen if the fraud is happening or not. So that's something that's very widespread. Gene sequencing, trying to find what the optimum gene sequences could be. This one. Energy, energy is becoming huge these days. So a lot of companies are trying to predict what would be the, trying to forecast what the energy requirements would be and how to go about setting up infrastructure for that. Being widely done in a few countries. Sports, what can be done in sports? Nothing. For the fixed matches better. The Germany took the advantage of the big data processing and coming up with the strategy, they also found out that in the project you're scoring a rule that for the 17 minutes at 10, then any other. A few of the other cases that we have here in the Republic. Yeah. Analytics, right? What we thought which ball to Reina. So something like that. So that's something that's very lucrative in the U.S. to do analytics. Moneyball, we have all heard about the book called Watch The Movie. If you haven't, you should watch it. That movie is also a lot of analytics to identify players, create smart people with less value and then try to make a winning team. So these are different examples of how, what are things that can be done. One common thing that we talked about in all the places was that you need a lot of data before, right? I mean, you have the story, we talked about medical stuff, we talked about legal, we talked about software threat or anything. You need to know historically what happened. You need to have past data and you need to know what happened because of that. So when you have all this information, you can build, you can use this data to build predictive models. So the two kinds of broad classifications, many things, but two things that you want to differentiate is classification versus regression. Can someone tell what the difference between classification and regression is? Classification is like discrete values for classification. Yes, so classification is when you want to bucket our code into a particular category. For example, someone does a fraud. You want to know if the fraud is or not. You don't want to have any other thing, right? So that's what classification algorithm does. Regression is continuous when you're trying to, for example, lower forecasting. You want to forecast the energy of a particular thing. You want to forecast what the sales would be. You want to predict how much the sales would be. So those are all things which is continuous in nature. It can take any real value while classification is very, very definite. Someone was talking about computer vision. So we have, whether it's a cat or a dog, it's very definite. Not a classically everything else, either this or that. So that's when classification comes. We'll talk about what machine learning is. So it's exactly, these are the formal definitions of that. All in this is that when you have data, you want to learn what the system does. And that's, you can capture what you'll be doing today in action. So how many of you have experienced using R before? And any machine learning, prediction, classification, how many of you have done in R? No, it could be in any language. So let me talk about a very high level process on R. So this is the formalizing, this is formalizing what we talked about before. And we have a lot of data. This is the process that we would implicitly be following when we use all the algorithms. So we assume that the data is generated by some function. We don't know what that function is. So for example, fraud or energy forecasting, whatever it is that we are looking for, we assume that it comes from a particular distribution. There's an algorithm which generates the data. We don't know what the data is. It's coming from somewhere. The aim of a machine learning algorithm is to find an approximate algorithm so that we are somewhere near the target distribution. All that we have are the observed data. We just have the observed data. We don't have anything else. We don't know what the target distribution is. But all we have is the observed data. We have a lot of algorithms in R skills. We talk about some of them. We have large-scale regression, we have linear regression, we have support vector machines, we have neural networks. So there are different kinds of algorithms that's in place. The aim is to fit an algorithm which takes the data and to minimize the error that we have. To ensure that we have a model which we think is a reasonable approximation of the unknown target distribution and use it in practice. This is all that we are going to do. We don't know this, but we have the data. So we are doing some learning. This is what we would do with all the algorithms. Of course, there is an infinite number of algorithms that we can select from. And then we would select one such model as our final prediction model. Which model would we select? How would we do that? Things that we would talk about in today's talk. Before we actually go into what talk about different kinds of learning paradigms that's there. These are the more major categorizations of learning paradigms. Supervised learning, unsupervised, reinforcement learning, online learning. Supervised learning is something when we already know what has happened. For example, fraud. We already know if a fraud has been committed or not. This is the transaction information in the past if they have committed fraud or not. Another example is someone defaulting. For example, you have a credit card. You keep by getting a lot of purchases. For some point of time, you stop paying. The company really wants to know is it an abdominal behavior or are you really not going to pay? What's going to happen? That's something which is also a risk mitigation. They want to know if it's a fraud or if it's something really defaulting. What can happen? Great cuts are okay, but home notes. This is what triggered, what's one of the major reasons for this uptrend, is when people are not paying back their home notes. What happens? Can you predict the default behavior for various customers? It can even be organizations, right? It may not be only for customers. It can even be for organizations. So supervised learning is something when you have the final labels and you already know what the final labels are. You have to start information. You know if something has happened or not. What's the final output? That's what the labels are. So you take the training... Okay, yes. So you have the training data and you create features out of it, okay? And again, you feed into an algorithm, you know what the label is. You work on a way to minimize the error and you come up with the final model. The model is here. New data which comes in. So you have to start information. You're trying to predict if someone is defaulting on out. And you have a model in place. And now you have new set of data. And then you create the same set of features and you feed into the model. And then you have an output which tells if those people will default on out. This is a very simplistic representation of how supervised learning is. And that predicted output is always one of the labels or you'll have to... I mean, so again, so you know what a business problem is, right? So if you want default, yes or no, it's just two options. So again, cats or dogs or something else, you have three options. So when we talk about classification, it's finite. I mean, comparing this with regression problem, that's the same thing with regression. You would have all the energy information and use it in a particular city. And you want to forecast what the energy requirement for the next three months would be. So you create a model and the output there would be a real number. It doesn't depend on what your model is. It's built to outputting. If it's classification, it's always... You already know what class labels that you want. And you're going to work on something else. Unsupervised learning is slightly different from supervised learning. In the sense that you don't know what the class labels are. You don't use the class labels. You just have the features. You just observe log information. Let's see. So you want to predict if someone will default on a credit card drawn or not. What all can be the features? Immigrant information. Immigrant information. Income less. Income, okay. It's other label. Other label. Credit score. Fast historical, transactional information. What kind of... Fast payment history. So you have... So these are all information which would go into one person's feature. And then with this information you already know if it's defaulted or not. If it's a supervised algorithm. If it's unsupervised algorithm, you only have all these features. The aim of unsupervised learning is to come up with a better representation. Some of the reasons why it's needed... I mean, some of the reasons are that it might be that there are too many features. You don't want to work with too many features. You want to work with many subset of features. The typical problem is the features. All the data cannot fit into the memory. Then you want to somehow come up with an algorithm which can represent this in a lower dimension. That's one of the reasons why it would use unsupervised learning. And... So one of the main... So principal confidence analysis is one way of doing dimensional view direction. We talked about that later on. That's one way of reducing it from... Say you have a thousand variables but you want to shift it only a hundred variables. Then you would come up with an algorithm for doing that. This is unsupervised learning. We talked about a couple of most of it. Reinforcement learning. When you have... Reinforcement learning is when all the data is not available to you right away. So you have all the data and some output. And then you work with what you have. You predict the new outputs come. Then you modify. So that's one way of reinforcement learning. It uses a lot of games here to do that. Online learning is when you predict something and immediately you have the output. So you use that and you re-drain the model and you predict something and you have the output. A classic example... That's your browser behavior. You do something. Immediately you click. Then you have to do something else. So that's a place where online learning is why we use it. Supervised learning and unsupervised learning. Both are patched. They are not really in real time. So just keep that in mind. You have all the data and you try to work with the way to optimize based on the historical data that you have and then come up with the output. Before we get started on the hands-on we talk a bit about pre-processing. We'll introduce one or two algorithms. Then we'll go on implementation. And then we'll come back and do more algorithms. So one of the things that when we have the data the first thing that we would do is to pre-process the data. If we don't pre-process the data it can have a lot of problems. It could be long data. It could not have outlayers. These are all things which would mess up the machine learning algorithm. So we would need to do a lot of pre-processing. At least a bit of pre-processing. One of the first things that we would do is to do something called data things in operation. You look into all the features and for all the features you want to see if there are outlayers in the data or not. If there are outlayers, how do you want to handle it? Do they make business sense? For example, we have age as one of the features. Someone has age of 865. Is it true? Does it make any logical sense, business sense? So you have to evaluate all the information. So as I told this part obviously there will be a picture in mind. Look at what's happening, what's around and then get started. The next thing is missing data. This is something which we would see a lot in practice. You have data, you have a very important feature but data is missing for many records. What kind of activities would you do to replace missing data? What do you think you can do when you have missing data? Just to find out the meaning. Okay, so you can substitute it with an average of us. That's a very common practice to do. We call it data amputation, so impurity, missing values with the mean of the column. Any other? Sir, for other features, for which data is available, find similar records and then do the means. Okay. We could use other columns. So the way I would say is you can use other features and build a model for this particular column. You can build a model for missing records and then you can do it. But you can see the problem. It's a very simplistic approach. It's very easy to do. Building a specific model for all the missing values is really complicated. That's really difficult. But then if you think that columns, the features are very important, then you go ahead and do it. Outlayers, what do you do for outlayers? If it is multi-grade data, so you build a product and look at outlayers. No, no. So how do you look for outlayers? The way you look for outlayers? First thing you have to do is First thing you have to do is just see how all the distribution is and then you would know if the data has outlayers or not. If that particular feature has outlayers, one of the most common things which practitioners do is to cap the data. So you cap it at the 95th or 99th percentile and you cap it at the first or fifth percentile. You cap it at both areas. For example, someone has an income of minus 200,000 dollars. That's definitely an outlayer. That's not a missing data, but we know that that's an erroneous data. How do you handle that? You cap it at zero or you find the distribution and find the fifth percentile of the distribution and then you cap it there. That's something that's done. This is something which we do. Most of the algorithms, this is mandatory to do. If you don't do it explicitly, the algorithm doesn't do it this way. We have to standardize or normalize the data. When you normalize, all it means is it has mean, one, and variance. It has zero variance, so that's the thing that... We don't make any great variance. That's the thing that you work for when you standardize data. When you normalize data. Standardization is similar to normalization, just that you don't really normalize it to unit variance. For example, you can... The reason that we do is to ensure that all columns are equal and similar. Income will be in 100,000s of dollars. Age will be less than 100. Age number will be in 0 and 100. When you build a model, it can easily happen that one particular column can really dominate the others, just because it has a large... New make value. To account for that, what you do is you ensure that everything is within the same range. So this is simply essential. If you don't have it, the output gets messed up. The algorithm would still run, but the results are not... This is another thing, for better representation. We have a lot of data which doesn't fit into the memory. In those cases, we would need to do a principle of confluent, principal confluent analysis. The next thing that we would do is to see future creation and transformation. We will talk a bit about this. Principal confluent analysis. It's an unsupervised kind of algorithm. We take a lot of features, and then it gets... You try to get it to a lower dimension. You have, let's say, 100,000 columns. 100,000 columns with a million records will not fit into your laptop. Then what you do is you try to come up with an algorithm which can reduce it to 100 or 200 variables, top variables. Then you take it and use it for your models. That's one of the things that it does. It sometimes happens that... With data reduction, how can you ensure that the accuracy of prediction is not confidence? Okay, so we'll talk about how to... That's about how... Which model you select. Your question ultimately points to which model you select. I built one model with the actual features. I built another model with principal confidence of these two which models should I select. We'll talk about it. It's the same thing. It's the same thing. You can use an acoustic regression or you can use a subway transmission. It's two computing algorithms. Which one should I use? There's no direct... So there's no loss of information, actually, because we can actually translate any point in, so there are two different dimensions. There are different policies. So we can... People argue about this. I mean, so when you do principal confidence, there is a loss of information. You're not using all your features. The only way to do this is if your features can represent the maximum variance of what's available in the data. If you have it, then you can go ahead and use those features. So you're taking care of most of the variation in the data. Not all the variation. You take most of the variation. That's the right word. So you can never tell. So that's the reason why we would use something called model... When you have model evaluation techniques, we'll talk about how we would do when we have... I mean, so if people follow what's happening, the question is... So instead of using all the features, we're going to change the variables and then use it. Which means that we're not using all the features. So the question is, what's the output quality of compromise in prediction? I have a good example. J-type compression, we use a lot of features. It's not exactly easy, but we use a lot of... This thing is used for the visualization of data rather than actually doing the predictions. Because you use the data, you do the prediction. No, you would use... So that's a very good example. You compress the data. For example, when you do a lot of... computer vision and things like that, until the deep learning came into place, principal confidence was like one of the first steps which people used to do. Because you have a lot of data, which cannot fit into the memory. When you have a lot of features which cannot fit into the memory, you always try to come up with a subset of features so that you can put it... Ultimately, you need a model. So if your data cannot fit into your memory, what can you do? You have to find an algorithm which can do that. And that's one of the reasons... That's a place where this will definitely help. And again, we'll talk about model validation techniques where it doesn't matter what your features are. It matters on how it performs on actual real data. There is principal confidence on actual features. If it works well on your actual data, then it means that that model really is a close representation of your unknown target which you really don't know. That's something that we have. Feature creation transformation. Sometimes what happens is the features standalone as such may not be very significant. It may not tell what impact it has on your final variables. So when you have those scenarios, you might want to transform. For example, the log, you can square, you can get a queue, you multiply the features. See, if two features by themselves may not be able to explain why something has happened, but when you multiply the two features, it can happen. You don't have to be an example through the lens. You can think about it, right? So two extremely different things can... So I can do something like this, my wife can do something like that. But once the model is there, we are married, you can predict how the performance, how purchase behavior will be. So some... Basically, there is an example for this. It's not a prediction, but some of the features they use is distance between eyes, distance between an eye and the nose, something like that, but the distances themselves can't actually be used for comparison because of the scaling and things like that. So they find the ratios of it. So the ratio of distance between eyes will be distance between an eye and a heart. That's a good example. People get it, so when you have facial recognition, actual features, you can take the ratio of the distances and then use it as features. And that has come out to be significant. So how does transforming a feature from a linear feature log with an order help? So if I have a salary and a feature... I mean, I'm not telling that salary is a feature to take a log with. So the way you do it is you plot, you find correlation between two variables. If they are correlated with the feature as is, then that's the variable you use. Or if it's correlated with the algorithm of it, then you use as... So you didn't know. There's no hard and fast rule telling that, you know. So even before you start, you have to take a plot and you have to square it. You didn't know that. One thing, of course, when you plot it, you get a sense of how it is. If you know how to do it, plot is going to be exponential, plot is all over the place. You would know that what kind of transformation would help. Sorry, sorry. Yeah, I was just saying. Some systems are immediate exponential or longer than that. But it's a response or a response for years to some. It is longer than that. So if you are predicting something in terms of, you know, effect of some, then you may want to take a plot. Sir, I'll give you another example. Okay. I mean, also all of you use credit cards, right? So when you get a new credit card, the chances of you cancelling the credit card is very high the month or two after you get it. Somebody would have just got a signature and you would have got it. You know that they have, then they come to know that it has yearly charge. You don't want to spend, you call and cancel. It happens always in the first two, three months. But if someone has been using it for a long time, then you know that the chances of the person cancelling the credit card is very low. Because he has been actually for so many years that he's not going to cancel. So it's always recover like this, like you would know. And another example is probability of dying. When you're infant, the probability is very high. And then it drops in teenage, middle-aged, it drops. And as you become more, the probability of dying will increase. So these are common way of distributions which companies, for example, if companies use this kind of distribution to model anything which they do. So those are examples where you know from your business that that's the kind of system that you're looking for. And so transformation, you can think about transforming to fit that kind of distribution. So one more point of view for transformation is one thing is we only need to know how to do linear algebra. So whenever we have non-linear relations, since we don't know how to work with them, we use a transformation to linear algebra. So for instance, if you want to find a relation between two variables, which is a quadratic relation, say between y and x squared, you turn that into a linear relation. I mean, say between y and x, then you turn it into a linear relation between y and x squared because you don't know how to do anything except linear algebra. So that's what it comes from. That's trivializing it, but that's what it is. No, c, okay. So I agree to that. But the thing is, the reason why we are doing it is linear equations are far more algorithms than linear equations. It's more tractable. We know that, I mean, we're not covering any mathematical things here, but mathematically doing something in a linear space is always easier. It's tractable. It can be solvable in finite time rather than doing something non-linear. When you do non-linear, okay, so I will talk about that in a couple of slides. But this is a very common thing, like in school spectrum machines. We transform a multi-dimensional feature into a Gaussian space using colors to get a hyperplane. So that can help. It's lost, but it's not a Gaussian space. Okay, we'll not get into that. But the concept is the linear part is with the coefficients. It's not with your features. Your features, it can be combination of features. It's about how you generate it, right? And once you have it, you know that your feature is generated. You have a test set which is going to come. You can do the exact transformation before you feed into the algorithm. So under algorithm fix, a linear coefficient for each of the things. It builds a real value for each of the feature variables. That's what linear regression does. When you do something called non-linear stuff, all you're trying to do is, you're trying to tell that my feature should have a power of something. Okay, that's a lot harder. And you need to use a lot of calculus to do that. I mean, I don't know how many of you remember calculus, but the thing is, if something has to be solved, then you need to know if it's a minimum or maximum on the conditions. First derivative, you have to be equal to 0 or you can be stuck at local minima. So things will learn a long time. So that's one of the reasons why we don't get into non-linear that easily. Because there are very good linear approximations. Yeah, so that's what I told. So when you do that, then there's a concept of local minima, you'll be stuck. I hope not much later. Remind me when you come to this part, I'll explain what happens. Something will happen. So this should actually come before that. But you want to ensure that you visualize and summarize your data before you start modeling to know exactly where things are. So you come up with an experiment. So you have an intuitive understanding of what your features are, what the distribution is, what kind of things make sense. Again, ties back to my idea. First, like, go to the picture, have that in mind when you start the modeling process. The first thing that, the first algorithm that you look at is linear equation. Recorded any questions? Yes, as deep as the user name. This is the main person here working on it. Here's something. That's what we are trying to see. Here's something. Thanks. All of you have our installed in the system, our studio. Anyone who doesn't have... What do you mean? Okay. You have, right? You can work. You can work. Anyone who wants to, without a task, sitting on the street, maybe you can move to... You can find someone who can ask you. Okay. Questions? Going back to what you said. You said normalize data. Normalize the data. Normalize the data. Thanks. No, no, no. When you tell that it's normalized data, the way you do it is you do this, right? X minus mu over sigma. So you subtract it with the mean of the column, and you divide it by the standard deviation. So that's called normalizing stuff. That's the way you normalize. So you take each value and subtract it. In the feature vector, you take each value, subtract it with the mean of the column, and divide it by the standard deviation of the column. When you do that, you're standardizing it. You're normalizing it. When you're standardizing it, there are other techniques. So instead of using standard deviation, you divide it by the maximum and the range. So you take the value, subtract it with the mean, and you divide it by the range of your column. That's called standardizing. So you can do this, standardize, or you can normalize. The preference would be to normalize, but this is when standardizing data also gives equally good business. Particularly when you have various solvers in place, it makes sense to do this range. There are more high solutions. You need to install both. First install R. Which side? You install it last night. It works. Right, right. Which one? R studio doesn't... Are you able to install R studio? If you have R, good enough. Everything is working R. If you don't use R studio, yes. I installed after coming here. Okay. First install R, then R studio. Don't install R studio before installing R. We didn't talk about kernels. Kernels is another way to... He covered that a bit. So you want to transform your data into a higher dimensional space. Instead of computing the entire higher dimensional space, you just use a kernel to do that. It's highly, it's predominantly used for support interventions, but you can use it for transformations. We will look into it during the hands-on. Any other questions? Okay, so this was linear equations. So you have a lot of data points all over the place, and you try to fit a line. That's what this is. So this is something which we would have done in school, engineering, college. You do y equal to mx plus c. This is how the equations would look. So you have... These are all your features, and you try to... This is what you're going to estimate. The betas are what you're going to estimate, your features. That's what the algorithm does. It tries to find what the betas are for each of your xx. Ultimately, you will come up with a given equation like this. Once you have an equation like this, when the new data comes, you can substitute these values. So the algorithm would tell when the new data comes, you know what these features are. You enter it, and you know what your output values are. It's the most simplest of algorithms. I'm sure many of you would have even tried it in Excel. You just plot some points, and then right-click, put a trend line. You'll get a... What it does is basically a linear equation one. That's the most simple sort of stuff. The way it does this, something called a least square approach, you're going to minimize the least square. For each of the values, for each of the data, your training data, you know what your output is. You know what your output is. You're going to predict something. So basically it starts with some random assignments to this, and it finds some prediction. It squares the prediction, because there will be negative, positive, squaring it so that it's all in the same sign. And you're trying to minimize this error. This is called least, minimizing least squares. This is the objective function for linear regression. This is what it does. For classification, what can be a problem for this? When we do something like this, we want to predict someone is defaulted or not. You have to define some arbitrary thresholds to do that. You cannot calculate a least square. This answer is right. You can calculate. You have all these values. You know all the features. Your y-value basically is defaulted. If someone is defaulted, it's 1. If someone hasn't defaulted, you set it as 0. That's how you code it. So you have a value, you set it up, and then you set a model. But you can see where it goes. It's not bounded by 0 and 1. So linear regression doesn't give you a natural bound on what it is. Depending on what your features are, your value can go from minus infinity to plus infinity. And in that case, you have to set up a threshold which defines where, if it crosses a threshold, then it's defaulted, but it doesn't, then it's not defaulted. That's what it is. And now you understand why we want to standardize or normalize the data. So my feature vector is something like salary. In your data, your salary maximum that you see is only $20,000. Certainly someone comes with a $2 million and your Y value goes over the roof, which means that your prediction is not very robust. And that's one of the reasons why we try to standardize the data to ensure that your algorithm is not constrained within a particular value. Again, the thing is, the problem with linear regression is you have to define your own thresholds. There's no automatic way of demoralizing or constraining the values to give predictions. But then it's one of the simplest algorithms. It's very fast. It's good. And it's been around for a long, long, long time. And almost invariably, this is the first algorithm we all start doing. There's just a moment. The Wi-Fi has been temporarily set up. I request only if it is absolutely necessary, you know, go for it. We have a temporary one running now. We are trying to figure out the main issue. So it is absolutely necessary because there is a slight bandwidth problem. So if it is necessary, you can use the same Haskeq Wi-Fi and Haskeq Wi-Fi with the same password which we mentioned previously. Geeks are. Thank you. Okay. So logistic regression uses a logistic function. The logistic function has basically e to the x over 1 plus e to the x. Just this 1 over 1 plus 1 to the minus x. This is basically if you divide by e to the minus x, you will get e to the x 1 plus e to the x. That's called the logistic function. This function is always bounded between 0 and 1. The logistic function, the logistic function is bounded between 0 and 1. So it's very natural to use this for classification algorithms. When you want a classification between 0 and 1, you know that your probabilities are going to be between 0 and 1. Then it's easier to define a threshold. I mean, it's easy. Point for you can set a threshold and anything above point for you is defaulting. If it's less than point for you, then you're not defaulting. So this is your linear model, right? This is what we did in the last lecture. You take the same thing and you fit the same thing in the logistic function. And you use the same, and then you again fit this and you try to come up with the parameters for this. You estimate all the betas. You get your logistic equation. This is one of the most widely used classification algorithms. For example, for credit scores, possible score or in the US when you want to get your credit score. Until a few years back, almost everyone used the logistic equation. The de facto algorithm for logistic regression. Two algorithms. What we will do now, I have more things to cover, but let's do a little bit of hands-on. We will ensure that we do some data summarization and we will trigger a logistic regression model and then come back and continue with what we have done. Okay? What do you think, sir? Credit score? Which is a new algorithm? You just mentioned, a couple of years back. They will come to that. We'll talk about that. So now people use, I mean, we soon do run the same algorithm. Rangon for us, or regularized logistic regression, they use certain things which we normally use. Okay? So this is our studio. People run the same now. People come from all... This is an idea for us. It's like when you have Java, you use Eclipse for credit score. I don't know if people still use Eclipse previously they used to use. That's something like that. This is an IDE basically for R. It's very good. It's the most widely used. It's the most widely used IDE for R. You have... So you have a console where you can type the commands. The output comes immediately. And we have environment as whatever variables are stored here. All the historical commands will be stored here. All of you... All of you opened... The first thing you would do is to check what kind of version do you have. You have some... You should have a case version 3 and above. The latest one... I don't have the latest one, so we don't even have the latest versions. Next thing would be to see what your current working guide is. I have set myself to... So you might want to create a directory to give it. If you don't know... So the way to set working directory is this. Set working directory. And then use the path of your wherever you want to set it. It's not very easy. Can't see the point. What about this? So the next thing would be to do a set working directory. You have to... You can do this. Go to the directory that you want. Okay? So now I have the data storage. The way you do it is... There's a command. This is the assignment operator in R. What's for use equal to... This is the standard way of doing it. You are going to read the CSV. You don't have... Okay, ready? So we went to read the CSV. I mean, we read the CSV, but we want to know how much time it takes to read the CSV, right? So maybe it's... It tells how much time this class takes. It's on the funnel. I have run a drawbox rotation. So how many of you don't have the data? Okay. Few people. We'll have a break sometime soon. So we can just follow and then in the break, you can get a pin drive and copy it. Okay? The next step is to see how much time it takes to load the data. When you have the data, you go into the environment lab. The data is the... It's got a bunch of columns. Columns features. The aim of today's session is to predict default. You know, when someone is defaulted, it's the default of a predict card. We have list of features. And default 1 is 0. 1 is default, 0 is not default. Instead of clicking those, the other way to do it is you can do a head up and put data or any prints. The first few are columns with all the columns. It's okay when you have less columns, you have a lot of columns. It's not divisible. First thing that we want to see is what are the column names. We do names of input data. It gives the list of names. Default is what we want to predict. That's the target that we work with. We have a list of columns. Number of unsecured lines, age, income, rate ratio, amount of dependence, the category of the stages, and these are the things. Now, it's not always true that the data that is loaded is in the same format as that we need. So let's see. So these are all numeric, right? These are all numeric. And we have the... This is a categorical variable. We have one categorical variable, and all the other variables are... So when you have data, it's generally a mix of real variables, integers, and categorical variables. Again, categorical variables can be of two types. There could be... For example, if this is grade, we know that grade A is better than grade B is better than grade C than grade B. Okay. Or it could be that these four are totally separate. There's no comparison between these two, because we have ordinary and regular variables and normal variables to... to your... Okay. Here are the controls. So now we have the data. But the first thing that we're going to do is to see what the column does. Okay. Okay. So how much do I have? Yeah. So hi, everyone. I will be representing the solver shop today. So here are some pen writes. It has the image file which we'll be using and plus instruction on how to import the images. So just pass the pen dice around and just copy the content of this. Okay. So now we saw the names of all the columns. We want to see the types which are imported make sense. Okay. So the way to see the types is you do something called... How do you do it? You can do something called illaplight. Apply is a family of functions. Okay. It takes an argument of... It takes an argument of what the data set is. Okay. So it asks if the function has to be applied row-wise or column-wise. And the function that we need to implement. So here we have default as integer. So new make is basically real. Okay. So integer, real. And then factor is the same as categorical. Okay. So you want to know... If you want to know what are the different... So how many of you know SQL here? Is there anyone who doesn't know SQL here? Okay. So the way... Everyone knows SQL. So I want to know what are the unique values present in state category. Okay. The way you do this, select the state category from input data. The way you do it in R is you do this. Okay. So the way... This is how you... The problem with R is a good thing. Whichever way you think about it. There are multiple ways to do the same thing. Okay. When you learn R, you can... It can be so frustrating. Because there is... There are too many ways to do the same thing. For example, to access that particular state category, there are like almost a dozen ways to do it. Okay. This is one way. So you give... You know what the... You know what the data set that you want is. And you know what the column name is. The way to select column name is with $. And you take $. Some of you might have worked with mattresses before, mattresses before. The way to do it in that state is... It's a 12-core. It's the same thing. So it's my matrix with the 12-core. Okay. It's exactly the same thing. So unique is the command to do that. Okay. Now. Okay. We are going to take a round. What's that seed? Can anyone tell what's that seed? Does... Yeah, it randomizes. Is this a seed? Yeah, it does randomize... Randomizes which can be reproducible. That's what it does. I don't know other languages. Java has something similar to it. What is it randomizing? I will just see it. Okay. So it affects us. So it affects us to randomize. Yes. So when I do a round of five, it always gives me the same randomization. Okay. Okay. The reason that we're going to do this, we're going to create something called the train list. Okay. I'm going to take 80% of my data and create that as my training set. We will create our models on 80% of data. And then we'll create it on the remaining 20% and see how our accuracy is. Okay. So sample is a function which creates sample values. I'm telling you to generate sample between 1 and 150,000. I want... How many numbers should be generated? We should generate 80% of the rows of input data. So before that we do... How much is it? 150,000. Okay. So input data is 120,000. So we need 120,000 numbers. Okay. So it generates 120. So you can look into a data in this. It may tell me what it is. The way to remove it is... I've removed it. And you people don't remove it so that we know what it is. I don't remember the work out there. We will see. Is it the same number? It will be the same number that generated it. 120,000 will be the same as what it generated before. Can you please explain once more what we are training in? What's the objective? Okay. We have... Okay. I don't know if you can see this. We want to predict if a particular user is defaulting or not. This is a credit card information. You have credit card data. How much... What is the revolving rate? How many... How much percentage of this credit has he used? What's his age? I think that's number of times... between 30 and 60 days. What's the rate ratio? What's his monthly income? Number of dependents? What's state? The category of the state that cares. It's got various values. You can see some values of this thing. Looks like a real data. That's... So we are trying to predict if someone will default. If... This is the information. We... I mean, the... What generally happens is you're getting... You're given this information. We'll create a model. And then when new data comes, you want to predict if someone will default or not. Okay. So what we're trying to do is we are taking 80% of the data. We are building a model. We'll do some machine learning models on that. Basically, we'll do a logistic regression on that. And then we'll predict it on... We'll predict using the 20% data and see how accurate our model has been. That's what we need to do. And the training list right now is just a vector of... 120,000... Which one? The training list right now. The training list is 120,000 records. Is it recorded? It's just... We have 120,000 records of... 11, 12 columns. Yes. So these are your features. We have 100,000 records. So that's... Yeah. So same thing, that's... What do we need? We're cropping off. Do we need to... Yeah, the same thing, right? Okay, so the way to remove something from the district, from the environment is to the Rm and then it will be all the same thing. So train, I am going to create a train data set. I am going to take, so this is a list of rows which I want. So it is row comma column right, the list of rows I want is, what is there in train list? So it will give you my train. Okay, so what will be my test? This is basically input data minus train list. Whatever is not there in train list, I am going to create this test. So you can see the train is created with 120,000 records and this is created with 30,000 records. Okay, I am finding that if I run n row on train list, I don't get it, I get null. Okay, because do this class of train list, it is an integer. Okay, your rows and columns will work only on matrices. Okay, so data frame and matrix, two important things in our data frame is, and both are like tables for your SQL. Matrix is homogeneous, all the data types are the same. Okay, so everything is numeric integer, it's homogeneous. Data frame is like a SQL table, it can have characters, it can have text, one column can be a date, one column, so that's basically the difference. Matrix is all numeric. What was the idea behind the sexy? Re-producible random numbers. Just specifically here. Samples. Okay. I am creating a sample of something between 150,000 and I need 120,000. I need to generate 120,000 numbers between that. I can switch, I can clear this, I can run it again. If I delete my train list, if I run it again, it will still create the same list. We need that to ensure the same. So these numbers are not coming from CSV file, nothing to do with that? Nothing to do with CSV file. This is all our doings. So now train list contains 150,000 random numbers. 120,000 random numbers. Correct. So when we see the head, does it show me the row numbers which have been picked up in that quantity? Correct. Correct. So when you go, yeah, exactly. So when you go head up, I did a check. I do head up train list. It tells me the row numbers that's been picked from my stuff. Okay. Yeah, the first six. The first six of the trains. Okay. We'll do a bit of visualization and it will take a break. Okay, so I'm just going to rush for a bit because these are fairly straightforward. This is the dimension of my training data set. It's 120,000 rows by 12 columns. Same thing. We can do one. This will give me what my row counters. This will give me a row call. Yeah, it will come to somebody. I did class of train list. Basically data in what I told. Okay. You can even find what class of train of 12. Okay, so for all the columns, it's trying to find, it's filling. What the minimum is, what the first quartile is, mean, median, maximum. Okay. Interestingly, it tells that, you know, for age. Okay. It doesn't tell here. Okay. Number of dependence is, there is some null. There is null in monthly income. Okay. A couple of variables have null values. Monthly income and number of dependence have null values. Okay. So what I did was, I did a loop. This is how you run a loop for open bracket. This is very similar to what you do in other languages. One to 12. I'm going to print the column names. I'm going to print what the classes and the summary. Okay. So the previous way, it was all coming together. Now we are just, it's depart, it's integer. This is the value. Okay. And stop showing it. Okay. I'm doing it very well. Okay. Any questions on this? I mean, it's the same as doing this. We did this earlier. It's the same. Okay. So what did you do with that loop thing? I mean, I'm just running a loop which tells, see, when you do a l-apply, it's going to give me, like this, right? Which means I'm adding all the things, column, class name, summary. Okay. Okay. Okay. Okay. Okay. So in train, we have all the rows corresponding to the numbers in the train list. Yes. So you can open the train here and you can see the row number is given here and it has all the data corresponding data. You can go and check with the input data. If the 42459 data has the same information. Are these the ones also that you find? Can you forward this command? No. After we do it, after this session, I'll do it. Yeah. No. Okay. So now let's try to do, okay. When I try to do maximum, it's going null. You can do it for h109. Okay. But we know that it's not coming. It's because it has null value. So what you do? You can tell all the maximums. Okay. So this is summarizing information for a while. Okay. If you want to change the name. Okay. This is the name. And you want to change it to something else to use this. To change in train, you have to change in place also. Okay. Of course, you can create tables and see how the table is. So 120,000. There's almost all four of the categories, 30,000. Okay. So I wish to be able to do it. So I'm doing a, so you can do plots. This is something that you have to do. You're doing a bar plot. You're doing a bar plot. The way to do this, bar plot of what you want to do. Counts basically has a table. So you create a table of what character that you want to do. Okay. Other thing that you can do is you can do a histogram. What is the histogram of h? The histogram of h. Okay. It's very discussed of the problem. You can just do a plot. This will give us capital plot. So now we know that there are outlays. Okay. So people are asking how to identify outlays. There are definitely outlays. Everything is clear and we have some outlays. We can ensure that we handle four outlays. You can even subset the data to do. And where to plot only, yes. My particular value is less than 50,000. Okay. We know that it had outlays. I'm going to do it for later 50,000. Now it looks loaded. Okay. We would take a break now. See, if you look into the previous plot. If you look into this plot. You know that there are outlays. Right. So I'm cutting it off at 50,000. And then I'm trying to see how it is. What's the sequence? It's 166. Okay. It's a column. Column 16. Okay. So basically I'm telling, it's always row column. Okay. I want all my rows which are less than 50,000. Unless it's column number. So the row can be a number or it can be a expression. Yeah. Row can be a number or an expression row. I mean, it has to evaluate to a list of numbers. Right. It evaluates to a list of numbers. Okay. Good point. So what happens when I do train of negates? It's a logical record. It will give me true or false. Okay. It will give me it's true, true, true. Okay. So now it will take that record. It will find if it's true or false. If it's true, it gets better. Otherwise it doesn't get it. Okay. We'll take a break now. When we come back, we will do some more stuff. So what's the access? Yeah. So it automatically does. So if you look into. So the way to do it helps us something like this. Okay. When I do question mark, it gives what a help us. You can look into what I can look into the syntax of it. You can set your own frequencies. Okay. Record number. Yes. Let's set a column number. Okay. Hold on. No, no. We'll take a break now. Okay. So we'll meet after the break. Any questions? Okay. We'll start again. I'll go over all these concepts quickly. Okay. Now we talk about creating models. Okay. There's something very important that people should know. That's called bias variance trade-off. So this is the data that I have. Okay. This is how I generate it. It's a sine curve. I use the sine curve data. And I use the sine curve data. Okay. I use the sine curve data. Okay. I generate it. It's a sine curve. I use the sine curve data. And I distorted some of the values. Okay. So we know that these sources are sine curves. But let's see what happens when we try to fit it using some complicated algorithm. Okay. I can actually write an algorithm which connects something like this. Okay. This is called a spline. You can fit a polynomial regression which can connect me to all the points. But you can see the problem here, right? So let's see what happens when you extend it. The way it goes, it goes here. But we know that the model goes up. But the model is going down here. So this is a problem called low bias high variance. It fits in the entire data. But it wouldn't work very well when you have new data in place. A simplistic model would be something like this. But then here the thing is there will be more training error. And there will be more training error. We call that higher bias. But it's going to have lower variance in your actual test data. Which one should we select? It's not always. So it depends on your big picture in mind. It depends on what your application demands. It's always better to use a simpler model. We call this a concept called Occam's razor to do with something similar. Okay. So for more training error? Yeah. You just call over. That's my next slide. Okay. So yes. Okay. But I'm just telling that this is a place where I'm just telling two competing models where when you have absolutely low bias, you wouldn't know if it really fits everything or not. I'm telling you. Here I fitted everything. But you get the concept that you can have a low bias high bias. But always it's better to go for something which is simpler than something more complex. Small practical. Small practical. This is one practical example. This is how my data is. Okay. I'm going to predict. You can take any sine curve application. This is how my data is going to look like. But when I do the machine learning algorithm, I have two options to do it. Which one should I select? You should always keep that in mind. This is machine learning concept. Okay. In machine learning, this is how bias is used. Bias is when you don't have any input, how much are you offset with? Okay. Now, data linear regression. This is your bias term. This is your bias. With respect to my input, I'm still biased. But you can optimize it to an extent where bias is low, bias is high. So the aim is to ensure that you get to a level where your model works very well. So the way that I showed here is non-linear. But again, I can transform it into a higher-dimensional and I can do a linear separation. It's not necessary that it works. Okay. This model is a non-linear model. Okay. I agree. But then you can also do a linear model with that. Okay. What is the low bias and high bias? The way I understand from this, high bias means error in your model with respect to the training data. That is right. That's what it is. That's your error with respect to your training data. We'll define what your bias is. And your variance is on your unseen data, which you don't know. So how do you find out? We will talk about it. There's something called cross-validation, cross-validation to give an estimate of how your variance is going to be. The aim is, see, bias and variance are universally related. Okay. The aim is to get to a trade-off where your model works reasonably well on unseen data. That's what the big picture stuff. How do we do that is what we're going to do. Look at it. So the bias is, are we setting the bias? It's automatically depending on how the model is. Bias is models. Bias is from the model. Okay. Variance is unseen. We don't know how the variance is going to be. See, this is my data. These are two competing models. I am telling you that I generated a different sine curve, so the data should go up. But we don't know, in reality, right, when we have to predict default or something, we don't know how the output is going to be. So we wouldn't know the direction of it. So it's always better to go for a simpler model than a complicated model. And that's the reason why, so that's the reason why we call this bias variance trade-off. You want to go with low bias, high variance, or high bias, low variance. It should be somewhere in between. When, this is, I mean, what can be, what can be the best high bias, low variance stuff? Variance means the output doesn't vary with when your input comes. The best high bias, low variance stuff is straight line. I fit a line like this. With respect to my input, my values are same. So that's high bias, low variance. There's no variance in my output, but it's not trained, right? But this is relatively better. This is better than this. So that's the thing. So when do we stop? It's something that I'm going to talk about now. Okay. Overfitting. For what he told last, this model is entirely overfit. You should turn every training data. Now, if you look into the diagram I showed first, the machine learning thought process, there is some unknown algorithm which gives us the data. The problem with that is the algorithm also has inherent noise. We can never model the noise. Overfitting what it does, it exactly models the noise. So we don't know what the inherent behavior is. And it generally happens when you fit a very, very, very complex model. It's generally not... You wouldn't know when you build the model. You would know when you are actually deploying it in the production. It can cause a lot of problems. How to avoid overfitting the two predominantly two common concepts, regularization, cross-validation. You use both of them. You don't use any one of them. The way it's generally used in practice is to use both of them. Each one of them has helped to minimize is to balance your bias variance trade-off. In practice, we use both of them. Regularization... Okay. Think about it. This is the only mathematical equation in the entire technique. So in linear logistic regression, what we saw is we are trying to minimize this, your topical least square. You have an input. You know what the actual output is. You are predicting something. You are trying to minimize it. In addition to that, I am imposing some constraints on what can be the weights. I can add a square term to my weights or I can do a linear term on my weights. These are two options that can happen. Okay. So when you add only a square term, it's something called rich regression or least large... Okay. If this is zero and only if you use this, it's called lasso. Common types of regularized regression, both linear and logistic. It's the same thing. Okay. The one good thing of this is here, because if you don't use this, if you just use this, it tries to use all the variables and it tries to shrink it. It tries to impose penalty on the coefficient. So the size of the coefficient is constrained by it. Here, the coefficient can become zero. It does feature selection also. That's one advantage of this. This one is easier to do. This one is slightly harder. The thing which people have been doing in the last few years is to use both of them. It should sum up the one and it's called elastic regression. These are the three common regularizations. Okay. Before we talk about cross-validation, you should know this. Okay. We have the observed data. We don't know what the variance is going to be. That's going to be unseen data. The way to do it is, you split your data into two sets, training set and validation set. Okay. You build a model on the training set and you use the validation set to see what your variance is. So that is considered as a pseudo variance and that you know is your model performance on unseen data. And then once you decide based on which model predicts the lowest error on the validation set, you pick that model. That's generally the process. So then you have a test set which will run on the algorithm and do it. Okay. The problem with this approach is that you can have a bad day and you select a validation set which really misses up your model. Okay. You pick all the wrong data points. Not all data points are correct. You pick only the wrong data points so you get a wrong model. Okay. How do you overcome that? Can you use something called 5G sets? Yeah. You do 5 sets or 10 sets. You build model using this and you validate on this. You build model on the training set and you validate. You do it for 10 times. For each time, when you, this is, when I mean test set, you build a model. You predict on the test set. You know what the actuals are. You know the variances for all the 5 or 10 things. So this is called 5-fold cross-validation. It's called K-fold cross-validation. You run the model 10 times on 10 different partitions. You know what the error on each of the things is. You take the average. That's considered my cross-validation error. Okay. You can run different models. You can run SVM. You can run logistic regression. You pick the model which has the lowest cross-validation score. Okay. That will, that's, I'm sure this is going to talk in detail about this tomorrow. This is one way of doing model validation. You build all the models. I have an output. Which one should I select? So these are the metrics that you look at. You look at something called precision, recall, sensitivity, specificity. It tells how good am I in detecting the true positives. How good is the test is avoiding false alarms? These are good. For example, fraud detection. Is it more important to classify correctly or wrongly? Okay. Fraud detection. Is it more good to detect, to be error on the safe side or be very, very strict and tell only if it's correct, I should be. So things like that's very, very important. For example, I'll give an example. You go to a grocery store. They give a coupon, $10 coupon. You get a meal. You go. But you didn't get it, but you have that, it's already expired. And they scan it. It's already expired. It's always better for them to give you the coupon because it's sitting on the safe side than telling that it's not valid because there are chances of you being a repeat customer. So it always depends on your, it depends on your actual application. You decide these metrics and tell if it's, if you call us more important, decision is more important, which is the way that you should go. Asha is going to cover a lot more about this tomorrow, so you know more just then. So all the things, this is also related to the type one type program because taking a statistics course. So can I say that depending on what the models where you say, what measurements we are doing, we can really manage the results of that. Yes, you can. How do you do that? Good question. So how do you do that? See, this is your objective function I'm telling you, right? You can define your own objective function. I'm going to minimize my error. I'm writing an algorithm which, so it's my least square. This is what I'm telling you, I'm going to use least square. You can define your own minimizing algorithm and you can write an algorithm to minimize that. So it's a variable Google that works. That happens for healthcare, law of things, when the traditional stuff doesn't work, you go and pick your own metric and then you write an algorithm which minimizes that metric. Okay. It's very common. Even if you minimize the problem, it's not necessarily a good thing also. You can reserve it the whole three days, which might work. That's why you do cross-validation. That's why you do cross-validation, right? Cross-validation data is the data sequence. You do 10 times cross-validation. It's not at all 10 points of mistake. All 10 sets of mistakes. You don't do, you have 100,000 records. You have 10,000 records on each of the things. Not every 10,000 is messed up, right? Assuming that you have at least good data, at least 80% good data. If you have 20% good data, then come on. That's always a challenge. Then it gets a data problem. That's not an algorithmic challenge. Yeah, that's what we do. So you're telling that I'm doing a black-box model, but then where's the big picture in mind? What do you do? So what you're telling is I'm going to put a black-box model. I have absolutely no idea about what the system is. Right? Right? Okay, so yeah, good point. That's something which I didn't cover, but there are reasons why for certain class of problems, certain algorithms are better, and that's something that you should look at. For example, we talked about bias variance. So there are things that you want to consider when you select which algorithm you want to do. That's the same thing. Okay, so if you talk about the... So now we talked about logistic regression, linear regression, kind of stuff. This is another paradigm of how something can be done. Decision trees. You want to play cricket or not. You look into all the forecasters, humidity, windy, all the climate, and then you can decide how you want to do that. It's a very easy way of doing it. It's very easy to represent. If someone asks you how something is being done, this is what people used to do previously. It used to be a very, very popular stuff. The biggest problem here is... I'm splitting it based on outlook. This is my first variable that I'm splitting it on. If my data is wrong, if it's messed up, or if it doesn't come in my test data set, my model becomes very unstable. It's very... The stability becomes a big issue. This is not that big an issue in linear logistic regression because we have all the columns, and it's all the same equation. But here, it's hierarchical. So you need to go this, and then you need to go... So then it becomes complicated. So that's a problem with decision trees. How do you overcome that? Of course, people have to figure out how to overcome that. There's a concept of bagging and boosting. Without boosting first, boosting has helped to reduce bias. What it does is it takes a set of weak learners. I mean, individually, that doesn't tell much about your model. It tells much about your data. But then you combine all those things together to form one strong classifier. The way you do it is you use decision trees. You use a lot of decision trees, and then combine them to create one single strong classifier or combination of strong classifiers, and then print the output. A very similar approach is something called bootstrap aggregation. We have 100,000 records. I'm going to do a bootstrap. What bootstrap does is you're going to pick sample. I'm going to take 80,000 samples at a time. It means that when you bootstrap, it means that you're going to take samples with replacement. You take one sample, and again you have 80,000. You can pick another one. You have another 80,000. You can take the same. So you pick sample like that, and you create a model. You create a decision tree. The thing is you build hundreds of decision trees like that. And then once you have all the decision trees, everything would have given some probability, some classification. And then you take count of what classification comes to most, and you use that as your output. So that's something called model aggregation. So one classification can take A, and then all the others can take B. So whichever comes most, you use that as your output. So that's called bagging. It's said to reduce variance. Very, very, very related stuff is random forest, which uses bagging. The only thing which, it's exactly bagging. The only thing it additionally imposes is random selection of split. So when you talk about this, one decision tree outlook can be the first split. In another decision tree, windy can be the first split. And then the model goes. This way you can have different decision trees, different outputs. And then when you build like thousands of trees, you take in model averaging and you can get. And the choice of that feature is again, you would use different data sets. Different data sets, you'll get different outputs. Yeah. I mean, it's the bag data set, right? So this is a random sample. Yeah, yeah. So that's the thing. So because you're doing it over multiple things, it assumes that your variance gets reduced because it's the same process. And generally, you would use about a, it doesn't use boosting. No boosting here. What is boosting? Boosting uses boosting as, so what are called ensemble techniques? Okay, these techniques have been around for many years. They become very popular up to the Netflix competition. Okay, Netflix competition was won by combination of these techniques. Okay, to your question, boosting uses to create a strong classifier. Here no new classifiers are formed. It's the same classifier. Okay, so if you look into it, it merges the columns to create one classifier. So boosting what it does, it multiplies inherently. You can multiply. We will talk about when we get to the code. You can tell what the interaction depth could be. I can tell you that my classifier could multiply, outlooks with humidity greater than 70. I can get into that level of detail. In bagging what you do, every variable is separate. And then you try to find the split based on what... Here in random forest, you don't have all the variables in picture. Say for example, you get only 70% of the variables. It's randomly selected by the model. And then when you have that, in the 70% which is the most important one, it takes all this. The reason why performance better this, all those noise features are eliminated finally, because 70% of the data is still using accurate data and you would get a better model than what you would get from other places. In many, many, many applications currently, in all the things that I currently work with, this is considered state of art. So in industry, traditional marketing or stuff, this is considered state of art for doing it. And if you go to speech, vision, deep learning is considered state of art. So two techniques which currently is performing better than every other known technique. I don't know how many of you know, but Kaggle is an online platform which hosts machine learning competitions. Invariably in the last two years, most of the competitions have been won either by deep learning or by random forest. So two things extremely powerful. Can more than 50% of the winners use art in tech? I mean, it's Python are just the two things that you have to use. Even Python is very good. I can tell all of you, it's Python. For example, it does, so art is running entirely on, that's always a big problem. Python allows you to see a laser object and it's better when you handle a large amount of data. The scale with which Kaggle is running you are able to use a cluster. So it doesn't really matter that scale because you will have enough RAM on it. If you do it on AWS, it doesn't matter. You still pay money for your cluster, right? If you have your own cluster then yes, it doesn't really matter. I personally feel random forest implementation and Python gives me better results than I don't know why. Even though the authors of random forest have done the code enough. Not sure. Anyway, it's a practical challenge. I don't know why. This is the last algorithm that we look at. It's the most simplest of everything. Forget all concepts. I have a point. I talk to my K and Eres neighbors. I'll find what their classes are. My classes, it's very easy. My nearest people are these people. Whatever they tell, I'll repeat. It's as simple as the most name, easiest way of doing. You just form clusters and then you do it. But again, the problem is you could end up in a place where all your neighbors have missed the data and so your predictions are wrong. That's a problem, but the easiest, most simplest stuff is results. Almost everyone's neighbor itself. The neighbor itself would be subjective. You could use any number. I'm going to ask five neighbors, I'm going to ask ten neighbors, I'm going to ask two neighbors, I'm going to ask one neighbor. You can do any of this stuff. One thing is you can define how they measure us. What do you define the neighbor? That's something that you can define. It could be an utility understanding. You can do whatever you want. We talked about why transformation is needed. We have this data. This is not linearly classifying. You cannot run a linear model in this space. I know that this is definitely different from that because I can draw a circle and I can find it. But you can see that any linear way, I'm going to have a lot of misclassifications. But when I transform it into the square dimension, I can clearly find differences. This is done using kernels. The way it's done using kernels is the theory of it is not easily explainable. It's a session on its own. This is something called cognitive optimization. Support Victor vision does the same stuff. You have data. I know that it's linearly classifying, which one should I select? Support Victor vision tells that you should select the line which has the maximum margin between these two. You have a variable to specify what can be your margin or error. That would be how you would select a line, a plane score between the points. That's basically it. There's an ideal plane and there are two things which cuts the nearest point. That distance is called the margin and that's they call the support vectors, which the vector here and here which passes through the opposing classification points. It's done using optimization. There's a problem with support Victor machines in that when you have a lot of data, a lot of features, it takes forever to optimize. It's not a good idea to do because it just optimizes on that. The advantages is that it can transform it into a very, very, very infinite dimensional space and it can find accurate differences. It used to be very good until a few years back and you didn't have a lot of data. You can put data into memory and you can easily find the best model for some support functions. Just two minutes and then we'll get done to code. There's some things that you should know that don't kill the data, don't make it come first. If you fit all kinds of models, one model will still be good. Always use your brain. Don't lose the big picture. Again, you can complicate stuff. You can build two complex models. These are problems. You will think that this model is better or this is how the data should be. You can have your own bias or the data, not all the data's correct. Model specification could be wrong. When you do cross-validation, you might actually use the validation data or then you're stooping the data. Your model will still not be fine. There's been many classic examples of why data stooping can result in very bad, out-of-sample performance. Do you know the story behind this? Is this story behind this? This story. Do you really want to know? Anyone wants to hear this story? This is a selection, guys. This is about how you do the data. When you talk about cross-validation, one data can be really messed up. This is an example of that. Two men and a baby were fighting US presidential elections. They were saying 40 or 52. I don't remember. Look at the date. I don't exactly remember. This newspaper. You can have it on a daily basis. What do people do after polls? What do you exit polls? You call people and you find. You don't pay money. You pay money before vote. After vote, you just call and find out. You call and then they took a survey. And then the next day headlines, it comes with a headline. It's true man. I'm just writing for you at 52. Next day votes were counted. Two men won. Very comfortably he won. And he flanced at it as acceptable speech. That I have won. Now how can... So this is very sound statistical model. Even in 1950s, statistics didn't change much. Even then it was very, very sound. They did the right stuff. All their influences, everything was right. What was the mistake? Why? It's a biased sample. Why? It's a biased sample. Why? There should be a variation. How do you pick the sample? No, it's biased sample because they call people only who have telephones. 1940, telephone was not very popular. Which means that people who had phones definitely voted for devi. But they were not the ones, they were very less minority. So your point is right. So they didn't pick the right sample of the proportion of stuff. And so the output got messed up. So it's very important that when you have data, you'll do the right sampling technique, you know... So the sample data that we work with is something like this. It's very common in practice. For example, in fraud, anything that you take, the positives are generally very, very limited. Number of people defaulting will always be less... Number of people committing fraud is very less compared to people who do transactions. I mean, you can... Number of people who click on an email advertisement compared to the number of emails it sent, you don't even know the scale. It's about billion to 100,000. It's 0.0002, something like that. It's infinitely small number. And that's the number which companies like Facebook, Google are working on. If you can even increase it by 0.0001%, it transforms into billions of dollars. So the aim is to ensure how you can increase that particular proportion of some email that sent, how do people click, how do you direct fraud, when you do... So that's the thing that when you do that... Okay, I'll give an example, right? You have 96 people did right transaction, four people committed fraud. So you have 100, you have 96 people committing right transaction, four people committing fraud. I am fitting a model which tells that all 100 people are right. What is the accuracy of my model? 96% okay? Even without doing anything, I can get a 96% accurate model. And in web traffic, it's like 99 points something. It's very easy to tell that, you know, I can fit a very good model. But we know how wrong that can be. That's why selection is very important. So we use something called the stratification of your output. You stratify your target features in a way that when you create all these cross-validation stuff, the distribution remains the same. That's why when we use a sample or random, it always ensures that your distribution is the same as your target stuff. If it's not, then you're going to really get messed up. So it can happen that I have only four, right? I do ten-fold cross-validation. If all four are in the one model, nine models are going to tell that it's correct. The tens model is going to tell the other way. Model averaging, what happens, your nine models spin. You're still going to end up with wrong models. Always keep this in mind. Something doesn't matter then, you know, where all problems can occur in any of these cases. So this one, please. So if you have that kind of a very skewed kind of stream, you don't sample? No, so you do something called over-sample, okay? Instead of taking all your data, you take all your negative samples and you take some of your things and you keep running this lot of time. That's the concept called bootstrapping. You run the model, instead of running it once, you run it like a thousand times and see what my average accuracy is. That's over-sampling is one way that we do it. The other thing what I told you is the stratifying. When you create models, you ensure that everything has all the cross-band, all the validations that have the same proportion, so that you get, which is probably how your action, in real life, that's how it's going to be. So that's another thing that we do. I mean, there are other things. There could be an important variable experience. Should always consider black box or this white box model. This should read white box. It's easy to expect. Sometimes you might want to showcase to your VP, then using something simpler might help. Sometimes it's probably very essential. For example, to identify whether someone is a terrorist or not, it doesn't matter what technique you use. You should be able to identify. Then you do something. Neural networks is a black box model. You really don't know what's happening. I used to go to banks, adherence to regulations. The fourth step, you know, adherence to regulations is very important. You've got this black box inside the tables. That's where all those things are. Now, you know, when we get back to, you know, hands off. Any distance? Yeah, yeah, yeah. That's boosting, right? Variant boosting machines is a boosting algorithm. It is boosting. It combines many weak classifiers into strong classifiers and then tries to do it. It does many initiatives. Yeah, so, yeah, correct. That's a good point. That's right. See, I'm just talking about the high class of algorithms. The way boosting can be done in boosting, there are many algorithms within boosting. Variant boosting machine is one. AdaBoost is the first boosting algorithm. Variant boosting machine is what is considered the state of art in boosting technology. It uses many decision trees. It builds weak models and then it tries to merge to create into somewhat simple ways which can minimize your variance. That's what it does. We'll talk after. I mean, they're all outdated ones, so I'll suggest you use standard policy. Cart is classification and regression tree. It's the same as decision tree. It all varies in how you spin. So cart and CHA ID differ in how you select your, I showed a cricket example, right? Which class, which is the first classifier to select? That's the thing which varies. Yeah, so decision tree is very simple, so it has a lot of bias waves. It comes to random policies. Everything, it doesn't matter. Classification, this is classification. It can be anything. It can. If you want to go right, it's fine. Docs and caps. It doesn't matter. As long as you have features, it can classify. It may not be state of art. That's the method. These are classification. I mean, these are pretty much the major classification algorithms that's used. The only thing which I didn't cover, which is state of art, is neural networks. But everything helps. So all practical purposes, this would be. Yeah, people use SVM. It has nice mathematical properties. If people do use SVM, yes. It works well. It works well when your data size is not large. So for $120,000, you can understand. When we actually run the code, you would see. SVM takes more time than any of the other stuff. Because it actually tries to optimize and find the perfect margin. So that's, it doesn't, for all the other things, you just need to find the difference. So here it tries to find. Okay, now we are going to try. Yeah. So this is how you write a function. The first thing that we are going to do is cap the outliers. We saw that income add outliers. And then to cap it at 1% or 19%. So you write a function. Quantail will give me the percentile. I'm taking the first percentile and 99%. If that value is less than the first percentile, I'm setting it to the first percentile value. If it's more than the 99%ile, I'm setting it to the 99%ile value. Okay. So I have missing values. We will handle missing values after this. Before we fix this missing value, before we fix missing values, I'm going to say cap the outliers. First we handle outliers. We do this. I've just written a function. We do this. Sorry? Yeah, yeah, I will do it. So now, sorry. So before we do that, we will find. So this is for age. For age, 109 is the maximum. Okay. So the first percentile is 24. So we're trying to have all the ages between 24 and 87. Okay. So I'm replacing train with this function. Okay. All it does is if it's less than that, we do the same thing for this. Whenever you do something in train, you don't just do it. It should be exactly the same. It should be similar. What's the next thing? You are actually flipping the data for age. Yeah, when you tell cap, that's what you do. You have outliers. I mean, 109. So 109, I'm setting everything to 87. I mean, it's very arbitrary. I'm not doing anything. It depends on your application. I'm just showing what you should do. I just put age as an example. It may not be right. I don't know. We are mapping the train at 87. We have the test also at 87. So it will be again calculated like this for 10. Good point. There are two different... I mean, there are two different approaches that people do. Okay. The way it has to be done, the actual rate has to be done, is the value at which you cap for the train is the same value you should use for this. Okay. For sake of easiness, I'm doing it. It's sloppy. You're right. It should basically be the same as 87. Okay. Right? So, do you get this question? So maybe we should do that before we split it into train and test. What? Oh, we shouldn't do that. Then it becomes data snooping. Okay. That's what I told. You don't look into what is there in validation. Okay. You handle the train and validation separately. When you get your train, you take your train, you find what the capping levels are, and you cap it. And you apply the same metric to... I've done the sloppy thing which works for now. But then when you actually do, you should always use... I'm missing this. Yes. See, age runs from 0 to 109. I don't want age to be 0. I don't want age to be 109. Instead of it, I'm setting it to 1% which comes to 24 and 87. So, if it's less than 24, I'm setting it to 24. If it's more than 87, I'm giving it 87. If it's anywhere between, I'm giving it the same. You have to do the same code test also. Okay. I mean, just for same code... Yeah. Probably you do not have to change the ACA. Yeah, I have a solid line there. You don't need to change the rate of it. You're just doing it here. So you're doing this to find out, remove the outliers. Yes, you're doing this to remove the outliers. So you're removing all the 99% there. I'm capping it up. Anything above is set to 99%. How did you decide this? Yes. For example, I've given it here. Okay, the way you do it is... If you remember, I showed a plot of income. I don't know if I have the plot here. Yeah, the plot is here, right? You know that the value is here. I know that 50,000 is wrong. We'll do the same thing. So you first plot everything. You see the distribution. You see where the outliers are. So here, what I did, I took the first percent and 99% there. 24 to 87. For credit card default, I thought that that made good sense. You know, someone less than 24 chances of them having credit card defaulting is less. And more than 87, I don't really care for less than 87. That might be... And you apply a big picture card. You apply your business logic to do it. Okay. If you remember when you do some of your things, you know that two columns have null, right? Which are the two things, monthly income, and the number of components. Okay. So that's basically six columns and 11 columns. You have the thing. So I'm going to do that. I'm going to create a vector, which is missing columns like six columns, six and five and 11. And then I'm going to run this column. So for the missing columns, I told apply command. Apply command, what it does is that it takes a data set and for every row or column, it tries to apply the same function. What is the function that we are going to do? We are defining a custom function inside. The function replaces, if it's null, it replaces it with the mean value of the column. We have two columns, monthly income and number of dependence. I'm doing it and replacing it here. Okay. So I think this is the, so we can't check things for both of them because one of them is very good. Okay. Good point. So what? There's only 13 different values. Yeah. What's possible? Yeah. I mean, you can experiment. Yeah. You can. I mean, that's, that's everything. I mean, for the sake of example, I'm just saying, but yes, again, it depends on what your application is. It's number of dependence. What happens? So now you get this point. It's telling that it's a numeric variable and then it's an integer and then now we have a fraction. So what do you do? You round it up. Okay. That's what I actually do. So you round it up. Now you come back to maybe a week. Okay. The other thing that we have to do is we have state category as categorical variable. We cannot use categorical variable directly. So we have to do something called, you have to convert them into numerical values. We call that one more encoding various terms to do it, but we have something called, in this library, there's a function called indicator variable. What is an indicator matrix? So let me tell you what is this. It's right at the top. You have a question on this? Okay. So you go to history and let's see. Where were you? Yeah. Let's see. The head of train nurse basically C, C, D, B, C, D. Okay. Now we'll see we created a variable which converts it into indicator values. Okay. When it's C, it's one. All the other four are zero. And it is C same thing. It is B, D is one, the other three are zero. So basically you're replacing that with four columns. Enjoy that. Okay. So basically you come. You replace train. So C-bind basically is concatenating it with columns. So concatenating train, you're removing the first column, which is categorical. You're replacing it with first. Okay. So there's a reason why you should, if there are n categories, you should always use n minus one categories. There's a reason for that. It becomes, it becomes, it becomes linear, whatever. You're not supposed to use all of that. You're going to use only the n categories, you're going to use only n minus one columns. You can pick any n minus one of these. And now your train is done. I'm sorry I'm rushing it, but I'm going to do the same stuff of this. Okay. So test, I'm going to do the same thing. I'm going to replace the values of test. People need time to copy the code. Yeah, but I'm not going to internet. Oh. They have it. They have it. You want to try it away? It's 15 months. You can go over everything. You can just see what I'm doing. So again, we do the data transformation. So now data is ready. Now we are ready to start with the models. The first model will do the logistic regression. So GLM net is the package. So this is lasso. Lasso is also called L1 regression. Okay. This one. I'm going to tell that it's binomial. It's only zero one. It's binomial regression. This is my train. My target is the first column. All the other things are in the features. And I need to do a tenfold cross-validation. This is the cross-validation input. What am I trying to minimize? I'm going to maximize my area under curve. Okay. Area under curve. I think Hershey will talk tomorrow. So you're basically trying to find or get a lot of maximum area under curve. Lasso. I'm going to tell you. It takes a while. Okay. It runs for a minute. There's a way we can increase the performance of this. All of us have a multi-core machine. Each core can run a fold. Because we went to two, ten partitions. Once the dataset is ready in the RAM, each fold, each cross-validation can run in one of the cores. That's one way where we can increase the performance of this particular model. It's definitely taking more than a minute. No, you have to specify. You specify. I'm doing a tenfold cross-validation. Yeah, a cv.glmnet is a function. Yeah, it's standardized. If you want to not do it, there's an option to turn it off. No, it's got nothing to do with it. The results will not be accurate. I told you, right? So you can have different income. Yes, in thousands of dollars. And you have hundreds of years. You'll be within zero. This is speaking. It's not working. What's it doing? What kind of logs? No, it's not like Java where it doesn't tell anything. If the model has an error, it will tell after it runs. I used to think that used to be such a cool feature. It doesn't really tell for this. That's a challenge. I don't know if you can see the score at all. I don't think it's... It's not even visible from here. Okay. So the way... So now what we have to check with the model is run. You want to find what the cross-validation... What are the average cross-validations? How do you do that? See, okay, so I'll give a brief example of what happens. In regularization, this is a regularized logistic regression. Because it's lasso, right? There's a lambda. We didn't talk about how to select lambda. How is lambda selected? Lambda is selected using cross-validation. So it takes a value of 0.1 and it goes and runs the model and finds what the lowest cross-validation error is. And it keeps doing it with various values of lambda. Ultimately, it picks the lambda which gives the lowest cross-validation error. Here, our cross-validation error that we circles area under the curve. So it leaves a maximum area under the curve. The maximum cross-validation error was 73% over there. The model is able to predict with good accuracy. So we can tell that our out-of-sample performance will be somewhere less than 73% because that was our training error. Are you following this? It's a maximum area under the curve for this model, yes. When we use this model, that's the maximum that we can achieve. So this is the cross-validation error which gives a sense of what our out-of-sample test data set could be which means that our test performance is going to be less than 73%. That's the accuracy of this model. Any questions? Do we need to increase the accuracy? Yes. I mean, so fit different models, different feature transformations. That's the way you go about doing it, right? So you play with, so I do glass. So you can do rigid regression. You can do elastic regression. This is elastic regression. You can go to random forest. First, support vector missions. You can do all sorts of stuff. And do that. You guys mentioned something about module. I'm going to do that. So what does the mission have? Okay. So you can do this. Load this package. I don't know. I need to install this package. And then if you give it code, it's already the mission has eight cores. So I'm going to register seven cores for processing. Now I'm going to do the same thing. Okay. I'm going to run the same model. Address enter so it takes well to run. Okay. I don't know how much it takes, but what I'm going to do is I'm going to do, it's exactly the same code that we did about, I just added a couple of stuff. I added my threshold. I gave that parallel is true. So it's going to run. Each cross validation is going to run on one of the cores. That one took a lot of time. This one took only four seconds. I'm going to run the same command that you did. So this is every other command that you do. Sorry. Once you register it, it will run. So as long as that can be parallelized, if there's an option to add parallel equal to two for that algorithm, you can do it. Okay. Not all algorithms can be parallelized. So even here, only cross validation can be parallelized. It depends on your algorithm if it's parallelized or not. Okay. This is not even getting into big data stuff. This is just with your laptop. Big data stuff is centered in the whole different volume, its own complications. But this is one way to think about how we can parallelize. The way you start putting this, you write an algorithm, how can you parallelize so that you can take advantages of the performance. 73 percent, now it's a 73 percent. Okay. So it shows some advantages. Because I changed my threshold, I did some couple of changes, iterations and thresholds, and now we have 73 percentage. So you can work, so as a homework, you can write changing wherever, alpha or alpha. Okay. And then you can see how the model improves. So now I want to predict, now we have a test, right? I'm going to predict now my test. My test data set. I'm going to predict and I need class. So there are two things I can do. I can either have probabilities here or I can do class. I can do class. And then we'll compute all the metrics that we talked about. So we talked about precision. This is the code for precision. All predictions are zero. Okay. This model is kind of really bad. This didn't work because all the predictions were zero. We're telling it, no one already called it. But if you have the thing, you're going to get... I have a question. I'm trying to use the dom-c that you used. Yeah. So I'm trying to install dom-c. It's saying package dom-c not available for R version 3.1. Absolutely. Which R version do you use? No, no, no. I have no idea. I'm using 3.1.0. So we never upgrade R version 3.1. So that's what I told in the beginning itself. We'll talk a little bit about decision trees. Okay. You have decision tree. It's something called R part. I mean, the codes are very straightforward. Once you know the algorithm, it's almost the same. You will install the package. You will call the package. You will call the function. You will give what the data set is. And then you will tell what you want to credit or optimize. It's constant all over. Those are the easier things to do. The same thing. You want to do random forest. Instead of R part, you will tell random forest. You will tell what it is. It's exactly... So this is classification tree, decision tree. It just used two rows. And it just did two splits and did it. And it had an error plus it got some really good error. And CP basically is complexity parameter. So it adds... It creates a new branch only if it improves the performance. It reduces the error. Just after adding two, it couldn't find another split which reduced the error. So it stopped at that. That's what the plot does. This will tell you exactly how it went about. So this is one of the reasons why people use decision trees. Okay. It's very, very robust. It talks about what variable importance was it. What's the first node it considered? And what was the last percentage? What's the confusion matrix? And then what are the splits it considered? And once you have it, it goes to the other code. It does the same stuff. And when it comes to the third split, it couldn't find anything better. And so it stopped. Okay. You'll do the same thing for random forest. It's the same stuff. So again, random forest model for your training features, your target, and how many trees you want to pick. Remember, it's basically, you're fitting many trees and finding all the averages. The number of features it picks up random. Yes. That's it's randomized. The number of features. See. Okay. It's okay. The last split, there are 15 important features. There are 1000 features. It picks the top 15 features. And randomly across all the trees, it picks one of the 15. Okay. That's the way it does. It picks what's the... You can optimize it. You can run how much should be selected each level. This is a rare percentage. As the number of trees grew, your error kept dropping. And you can predict it's always the same stuff. And you share the code for this. There's one thing I didn't talk about. Okay. See, I'll show you how we can do a song. See, okay. I'm going to take a cross product. I'm going to multiply each picture with all the other features. So you have 14 features in my train. Okay. 14 features. I'm going to multiply everything. So I'm going to do a 14 choose 2. I'm going to add it with the train. That's what I'm going to do here. This is what it does. So now we have... If you look into this particular dataset, it's got 192... Yeah, it's... 120,000 with 92 columns. So I do 14. Each record, I multiply it with all the other 30. I take the second and I multiply it with all the other stuff. Okay. I added features. We talked about adding additional features, right? Now we can add additional features. Okay. See, okay. There are three features which doesn't have... which is all zero. I know that you've already done it. When you share the code, you can look into it. So we are now running principle confidence. It's very easy. Again, call the function and what it does. So the code is extremely straightforward. It's always the same. Very similar. You just need to know what it is. I wanted to show principle confidence because of this. So there were 92 things. I removed three columns because there were zero. So 89 columns. And it's telling... This is something that we have to look at what principle confidence is. It's telling that my first principle confidence is able to explain 15% of the variance in my data. Second principle confidence... 7%, 30%, 30%. So you will generally do it until it's like 99% of the variance. Okay. Maybe you will take about 50 columns to... as a feature. So you will take... you will run principle confidence on your data. You will take the first 50 columns as a feature for your logistic regression or random forest. Any of those things. So that's great to be missed. This... What is all classification? For classification, I thought... In your principle confidence... So now I... You can take much... I can take transformation. I can take thousands of features. And then I can take the one which explains the variance. And use that as input to my model. Okay. How... How do you think... You just multiply... You divide itself or you go... Yeah, you can multiply... I mean, it's endless. What I told us... You're complicating your model. You're increasing your model complexity. So when you do that, your variance goes up. Does it really help in getting better out-of-sample performance? Can you get better cross-validation score? If you can do that, that means that your model is getting better. We don't know beforehand. So it's all right. And we keep doing it until we get a little more. Now... Okay, we are out of time. So I have to stop. Any questions? I'll share the code. I know that all of you are lost with... A little frustrated. What's happening around there? No, it's not perfect. I can see in everyone's face. So I'll share the code and you can look into it. I'll share it right now. Now we know that... It's the fourth time. See, R runs entirely in memory. So in big data is something which doesn't fit into the memory. We have a lot of data. You need to figure out how to parallelize various operations. It has its own set of algorithms. It cannot be run on desktop. You can run it only on a cluster or something. We need AWS to do that. There's a company called Revolution Analytics which specializes in building algorithms for it. It's the most famous data site. I deal with? So what do you use R for? I mean, for all the things. Now we just predicted default, right? What are the kinds of tasks that you use? To build models, you take... So it's got very good graphical usage. There's a library called ggplot2 which is very good for data analysis. Useful data analysis. So you're creating all the plots. You can have better prediction models. It's used widely in academia. The new technique that comes in the market is there now. It's used extensively. Not exclusively. Even Python is very good. What are the kinds of data sets? What are the types of data sets you use? Anything which fits into the memory you can do. Yeah. How popular is R to do? R to do. How popular is it? I mean, it's being... People are starting to use... But then the thing is there are other... For example, Apache Spark. It's a very... It's a new paradigm. It's just showing better results than what R for loop does. I mean, these are all computing systems. Machine learning for... Okay. I'll tell you the biggest challenge for big data. Machine learning for big data is... If you look into all the things, you need all the data to optimize it. We talked about... We need all the data to do it. The thing is, when you have big data, you cannot have all the data together. It's all... You need to come up with algorithms which can work in a distributed setup. And that's been a challenge. That's something which a lot of companies have worked on. The last one which you have not used... It doesn't fit into the memory then. I mean, it's... Is that the only thing that you have used? No. I mean, for example, it takes process and Python is better. Between R and Python, I think it covers most of the stuff. We can take this up later. Any other questions or comments? You have the classification techniques. Which techniques work or don't work? Can you have more than two of them? No. All of them have options. You mean support victim machines, logistic division. They all work multinomial. You can do multinomial stuff. Yeah, it works. It works. It works very well. That's why we use very extensively in industry. It works very well. Okay. I think we can get them close.