 So, we have Arjun here, hello, and he is going to talk about how I hard coded core machine learning algorithms from scratch in Python, what I learned and why you should do, okay. Right. So, welcome everyone, welcome to the first talk of the day. My name is Arjun and I will be talking about how I hand coded the core machine learning algorithms from scratch in Python, what I learned and why you should do. So, let's get started. My name is Arjun and I am an independent software developer and hacker from New Delhi, India. I was a Google Summer of Code student in 2016 with JQuery Foundation. I also worked at a high growth startup in Bangalore before moving back to Delhi and pursuing my other interests in computer science and music. My current interests include Python, machine learning, full stack development, lower level kernel stuff, Rust, and Lisp. You can find me on the social links on the right. So, the premise of the talk or let's start with the why. So, machine learning just like any other field of science has certain basic concepts and algorithms that act as the fundamental building blocks of everything wonderful that you see around you from your YouTube algorithms to your phone's camera. It's everywhere. Unfortunately, these algorithms have been provided to us as wonderfully packaged, optimized, and easy to use abstractions. Think scikit-learn and friends, but we essentially use them as black boxes. So import a classifier from your favorite library, throw your training data at it and boom, you have a predictive model. You can start making predictions and believe it or not, that's a magnificent thing. It's a massive win for practicality, pragmatism and performance. But to me, it seemed a little bit of too much of magic. So I decided to pop open the hood and look inside and see what happens under it. So this talk gives you a crisp take on that. Okay, so demystifying machine learning. So, what is machine learning? I'll do you one better. Who is machine learning? I'll do you one even better. Why is machine learning? So machine learning is a very overloaded term these days and there's an entire spectrum of beliefs about what machine learning is. So what one extreme of the spectrum believes what machine learning is. So they think it's magic. You just give data and you get predictions. And what the other extreme of the spectrum believes what machine learning is. So they just say it's the statistics man, it's the glorified statistics. So well, yes and no, it's somewhere in the middle. The fundamentals of ML are still true to core statistics. But in practice as of today, it has grown into something much bigger. In a simplest form, when you zoom out and tune out all the noise, machine learning pertains to extraction of knowledge from data by building intelligent agents which are often self-learning in order to make predictions. So ML is a wide field, we'll just be talking about the supervised learning part of the field in this talk. So supervised learning in a nutshell, we have labeled data, we give it to a machine learning algorithm which gives us the predictive model. We give new data to the predictive model and we can make predictions. Simple, right? So chapter two, the algorithms. We'll be covering the perceptron classifier, the Adeline or the adaptive linear neuron classifier and the ever so popular logistic regression classifier. We will cover the maths behind them, how they work and how you can write your own classifiers. So let's get started. So you can't do data science without data, right? The data set we'll be using is the Iris flower data set. It's very popular. It has 150 samples of the Iris flower. It has three different species, Virginica, Cetosa and Versicolor. And we have 50 samples of each, with four measurements of each sample. This is what a data looks like, right? And when we say x, 150 top, one bottom, we mean the first column of the 150th sample. Now when scientists began with the field of artificial intelligence, this started with the neuron. So as you know, neuron is the basic unit of the human brain and the nervous system. So how the biological neuron works is it gets signal from one end of the extremities called dendrites and then those signals propagate through the cell body. And if the collective strength of the signal is greater than a threshold and output is generated on the right side of the axiom and so it propagates. Now scientists modeled this behavior as a logic gate with binary output such that if the input exceeded a certain threshold, the output was 1, else it was 0. And it was done by two gentlemen named Warren McClough and Walter Pitts. And this came to be known as the MCP or McClough Pitts model. So shortly after, a gentleman named Frank Rosenblatt, he proposed the perceptron algorithm based on the MCP model. And the perceptron algorithm was the first algorithmically described ML algorithm in which he proposed that the algorithm would automatically learn the weight coefficients which when multiplied by the input samples would decide whether the neuron, the MCP neuron would fire or not. And this could be used for the classification purposes such that if it fires, you can classify it A and B, else B, which is often numerically encoded as 1 or minus 1. So we'll be talking about vectors a lot in this talk. Let's take a minute to properly understand what we're talking about. So there are basically two major systems for understanding vectors. First is the physics perspective and second is the computer science perspective. The physics perspective defines vectors as arrows in space and these arrows have magnitude and a direction. And as long as those two are same, we can move around the arrow anywhere in the space and it's still the same vector. So an arrow or a vector in a 2D plane is a 2D vector and an arrow in a 3D space is a 3D vector. Well the CS perspective defines vectors as, get ready for it, ordered list of numbers. That's it. In CS, vectors are just a fancy name for ordered list of numbers. And what makes a vector 2D is the length of the list is 2, mind you, right? So if you want to model a house as a vector, you can do so if you only care about the house number, the price and the square footage. And if you change the order of these measurements is no longer the same vector because they are ordered, the ordered matters. You can add two vectors and you can multiply a vector by a scalar. That scales the vector, therefore it's called a scalar. Now how the perceptron worked without getting into the math is first you initialize a weight vector or a random set or an order list of weight values which is and the length of the weight vector is equal to the number of columns in your data set. And for each training sample, mind you, each training sample, this is the schematic of the perceptron model, right? You can, we'll handle the W0 term in a minute. We, for each data sample, we calculate the linear combination of the net input and the weight vectors which goes to a decision function which, if the input is greater than a threshold, we predict 1 else minus 1. And if we have an error, this is used to correct the weights so that the next prediction is better. And this is done for each of the sample. So the terms that we'll be using is the input vector or an order list of your data. The weight vector or an ordered list of weights which are often random numbers. And the net input which is the linear combination of the weight vector and the input vector which we can denote as w1x1w2x2wmxm is w is the weight vector and x is the input vector. Vectors are just ordered less, mind you, don't get... So a refresher on matrix multiplication. So when you have to multiply matrices, the inner dimensions have to be same, right? So if you have a matrix of n cross m, so other matrix has to be of m cross p. So you'll get n cross p matrix. Now, so in perceptron, the decision function is a variant of the unit step function. As you can see from the yellow circle definition, if the input, net input is greater than some threshold theta, we predict 1 else minus 1. For simplicity, we bring this theta to the left side of the equation which now becomes minus theta plus z greater than equal to 0. This minus theta term is also known as the bias unit and we denote it by w0. So now our net input becomes w0x0w1x1 tilde w2x2. Now x0 is 1 and w0 is minus theta. We also call it the weighted 0 and this has to be treated separately. Now the length of our weight vector is 1 plus the number of columns in data because we added the bias unit. So the model also learned the threshold. Now perceptron learning rule states that you have to update the weight according to weight plus some weight update delta w such that the weight update is defined by the learning rate eta multiplied by the target minus the prediction that you make into the feature that you're calculating the weight for. Now, as you can see from the pink circle here, if you predict correctly, if you predict the class was 1 and we predicted 1, the weight update is 0. We do not update the weight if we made the correct predictions, right? Now as you can see from the pink circle above, if we make wrong predictions, the weights are shifted to the side of the correct class. For example, if we predict 1 and it was minus 1, if we predict minus 1 and it was 1, which means the input did not cross the threshold, which means we need to increase the weight so that next time it crosses the threshold and we predict 1. And also the amount of weight update is proportional to your misclassification. The bigger the misclassification, the bigger the weight adjustment. So let's get started. So this is the Python code to import the perceptron from the scikit-learn library and this is how most of us write it. We import the perceptron classifier, we give it three arguments, the n iterations, the eta and the random state. Then we have the instance of which we use the fit method to pass our training data which fits our data on the model. Then we can use the predict method of this instance to make predictions and we have to have some accuracy measure to test how this model perform. So from this we learn that we need a class perceptron that has three attributes, the learning rate eta, the number of iterations and one random state and the instance that we use of this class must have a fit method, must have a predict method and must have some way to calculate the efficiency. So let's get started. With all this knowledge, we know everything to write our own perceptron classifier. So we'll be using numpy because all this operation will be backtrice. Now we initialize a class perceptron, then we give the three parameters as we just saw eta and iterations and random state with some default values and we assign them to the self-attributes of the same name. We all write. So now we have the net input method which is the linear combination of the data and the weights plus the bias unit. So this is the linear combination that becomes a net input and we have the predict method which says that if the linear combination is greater than 0, predict 1 else minus 1. Now the fit method is where the stuff happens, right? We give it our data set and our training label set. We generate a random set of weight vectors with random numbers and the random state was used to seed the random number generator so it's reproducible. Then we maintain an errors array to see how a model performed. Then we loop over the data set for n iterations and during each iteration we loop over each sample and update the weight according to the perceptron learning rule. eta into target minus prediction into the value of the sample, right? And we update the bias unit separately and we maintain how many errors we got per iteration and it returns that instance, right? Now this is the code to plot and train how a perceptron model performed. We just import data, we pass it to perceptron and we use a function to model it. So just to visualize, this is what our data looked like. We just, for simplicity, we just took two samples of flowers and we just took two dimensions. So this is the data looked like. This is how the official scikit model performed and this is what the model performed that we wrote. As you can see, it's pretty close, right? But the logic must be similar. Now this is the error graph. As you can see, after sixth iteration, there are no errors, which means the model converged, which means it find an optimal decision boundary. The drawbacks of perceptron model is it's actually a very simple model. It works well for linearly separable data but not so much for non-linear because the decision boundary is linear and at any point of time at least one sample is misclassified, which means the weights will never stop updating, right? So that's why we need a number of iterations to stop the model after those iterations, even we have misclassifications. So based on this, we can learn about the adaptive learning neuron or the Adeline model. Now Adeline model is pretty similar to the perceptron model except it illustrates the key concept of defining and optimizing a continuous cost function, which really lays the foundational groundwork for understanding the most sophisticated works like machine, like logistic regression and support vector machines. So it's pretty similar to perceptron as we can see, we have this new term called activation function whose output is now used to update the weights and not the output of the threshold function. This is how it compares to perceptron, right? You can visualize it nicely. We have the activation function as output correcting the weights and we have the threshold function, which predicts based on the output of the activation function. So the difference between Adeline and perceptron is the weights in Adeline are updated by the activation function and what is this activation function? In case of Adeline is just the identity function phi of z equal to z. Basically whatever you give, it just gives you back. Basically it does nothing for now, but we need the concept here. So perceptron compares the true class labels, which is the correct class label in label data against the predicted class labels, which is also one or minus one, one or zero, one or minus one, sorry, they are binary outputs. But Adeline compares the true class label with the activation from true class label from the output of the activation function, which is real value. It may not be an integer because it's just forwarding the linear input, the linear combination of the weights. Adeline uses cost function minimization and gradient descent, which is also another very overloaded term. We'll be explaining, we'll be exploring this. In perceptron, while weights are updated after each sample, in each iterations, you could understand this would get tiresome if you have a data set of million items. You'll be updating weights after each weight, which is not ideal. So Adeline updates the weights collectively after each iteration, not after each sample. That is why it is also known as batch gradient descent. There's a variant called stochastic, which is used for deploying web, but we'll be talking about bad gradient descent. Now, what is cost function minimization? One of the central features of the machine learning algorithm, specifically supervised ML, is a well-defined objective function that has to be optimized while the algorithm learns. Basically, the cost function is how the algorithm knows, it tells the algorithm how to learn. For example, when you're traveling, you could optimize for time. Then you would take the actions that it'll take to your destinations, the fastest, no matter the cost. But if you're optimizing for cost, you'll probably take cheaper means of transport and they'll take more time. So you could optimize. So the cost function is the degree of what we have to optimize while learning. So the cost function for Adeline is sum of square of errors, where error, y is the true class label, phi of z is the activation output, which is just your net input. And why do we choose this function and this function only? Why not any other function? Because it has the following benefits. First, it's continuous, which makes it differentiable. Second, the cost function is convex. It's sum of square of errors. You remember the graph of x square, right? It's a parabola. And both of these factors enables us to use a very powerful optimization technique called gradient descent, right? Now gradient, the gradient descent can be thought of as going downhill until you reach a local or global maxima. Gradient is, by the way, just a fancy word for slope. And how you optimize, you remember how you minimize a function. You differentiate it, which gives you the slope. And if it's zero or closer to zero, that would be your minima. Now we differentiate the function, we find the gradient and we take steps opposite in the direction of the gradient. Since it's a convex function, it was always bring us towards the minima, right? Now, in each pass, we take a step in the opposite direction of the gradient. And the size of the step is determined by the learning rate eta that we choose, which makes the learning rate important. Now, the addling learning rule is pretty similar to the perceptron learning rule. Update the weights, w plus delta w, but now delta w is defined as the opposite step in the direction of the gradient. And step size is proportional to the learning rate eta. The cost function, when we differentiate, when we have to partially differentiate with respect to each weight, by the way, then long story short, it comes down to the one on the bottom, the pink circle, right? And which is just true class label minus the activation function output into the sample one, which is pretty easy to code, right? Now, we can write a addling classifier. The in dunderinite method is pretty similar to the perceptron classifier. We take eta and iterations random state and we assign them to the self variables of the same name. Then we have the net input, which calculates the linear combination plus the bias unit, of course. Don't forget the bias unit. And the activation function is now just an identity function. You can see, we pass a network, it just outputs unit input. Then we have to predict method, which is also similar. The output from the activation function has to be greater than equal to zero for the neuron to classify it as one. Else we classify it as minus one. Now, the fit method is where the changes happen. We start in a similar fashion. We see the random number generator. We initialize a weight vector plus one for the bias unit. Then we maintain a cost array. We loop over the entire data set. You can notice we don't loop over the e-sample here. We calculate the net input from our data matrix. Then we calculate the output, which is the activation functions output. Then we calculate the errors, which is the true class labels y minus the outputs of the activation functions. These are NAMPA errors, by the way. So all this is vectorized. We update the weights based on the Adeline learning rule and we do it for the entire sample, which is mathematically represented as feature matrix transposed and we take dot product with the errors. And we update the bias with the sum of errors because it's for the entire data set, not one sample. And we maintain a cost of how this iteration performed and we maintain it on an array. So this is what our data set look like. Now, we'll be training, just to show you how the learning rate affects it, we'll be training the Adeline model with two data, two learning rates, one of which is too big, one of which it is too small. Right, 0.1 and 0.0001. So the one which was too big, you can see the costs are increasing, which means because the step size is too big, we are always overshooting the minima and the weight correction is proportional to the amount misclassified. So it will increase. And you can see on the right hand side, when it was too small, we are actually going down on the cost, but it's very slow. It would need many number of iterations more. This is how the one with learning rate one performed and it did perform, but it did not optimize our cost function. And this is how the learning rate was too slow performed. As you can see, it would have converged given more iterations. Now, we get to the ever so popular logistic regression model. So you can see everything builds up on the previous thing. So logistic regression also builds up on the concepts we have learned till now. It's very popular. That would be an understatement, right? Now, in spite of the name, it's not a regression model. It's a classification model. Now, why it is so popular? One, it's easy to understand and implement. Number two, it deals with the probabilities of a sample belonging to a certain class given certain features, not binary yeses or noes. When your weather channel says it's 80% chance of raining, they probably are using logistic regression among other things, right? It's pretty similar to Adeline, but we have a different activation function and we have different cost function. Now, the terms that you need to be familiar with to understand logistic regression, first is the odds ratio, which is the odds in favor of the particular event, right? And the odds ratio can be written as P divided by one minus P, where P is the probability of the positive event. By positive, we mean the event that we're looking for, not necessarily positive in context, like presence of a disease. Now, the logit function is just the natural log of the log odds. What it does is it takes a probability measure, which is between zero to one and converts it over the entire real number range minus infinity to infinity. This is how the graphs look like. As you can see, the input is between zero to one and the output is spread over the real number range. We actually want the opposite of this. We want real valued inputs. That would be a proxy for a net input, which is the linear combination. And we want an output between zero to one, which would be could treat as a probability measure that a sample belongs to that class. Now, the opposite or inverse of logit function is the sigmoid function. It is defined as one, you can see on the green dot. Phi of z is one divided by one plus e to the power minus z, where z is again our linear combination of the input vector and the weight vector. Now, as we can see, the input domain is minus infinity to infinity and the range is zero to one. It takes real input converts it to zero to one and there's an intercept on zero point five, right? So now instead of the identity function, this is our activation function. This is the schematic of Adeline versus logistic regression. As you can see, the activation function is sigmoidal, which whose output is now fed to the threshold function. You can use that output as a probabilistic measure, instant of a binary yes or no. But if you need a binary yes or no, those are also present. Now, the cost function of logistic regression long story short is the one on the bottom, right? So we want to maximize the likelihood that a sample belongs to a particular class. So in practicality, in practice, it's easier to maximize the log of that likelihood, which we call the log likelihood. And since we have to, since which is equal to minimizing the negative of log, maximizing log likelihood is equal to minimizing the negative of log likelihood. And we know how to minimize, right? Gradient descent, gradient descent, gradient descent. So this cost function where yi is the true class label and y hat is the probabilistic output from the activation function. Just to give you an idea, when y is zero, the first half becomes zero. When y is one, the second half becomes zero. And this function has been chosen for many reasons. Now, also because the gradient comes out to be similar to that of Adeline when we differentiate. Yeah man. So the logistic regression classified is pretty similar. We start with the init method. Then we have the same three attributes. Then we have the activation function, which is now sigmoidal, as you can see. Then we have the net input, which is linear combination of weights plus the bias unit. Then we have the predict method, which now predicts that if the probability is greater than 0.5, we predict one. As we predict zero or minus one. Now the fit method is pretty similar. We generate, we seed a random number generator. Then we make a weight vector with random weights plus one for the bias unit. Then we maintain a cost array. Then we loop over the each data set. We calculate the net input. We calculate the output from the activation function. We calculate the error vector. These are also numpy errors. So this is vectorized. We update the weights according to the gradient descent model, which is similar, x transpose dot time errors. That's why also that function was chosen. And we maintain the bias, we update the bias unit as sum of errors because it's for the entire data set and not one sample. And we maintain the cost. Now you can see the cost function is different. It's not sum of square of errors like Adelaide. And we append those costs. So this is how you train the logistic regression model, which you impaired from scikit-learn. You use, you seed the classifier. The C term here, by the way, comes from a concept from support vector machines that we will not be talking about. But to understand the logic that we studied was enough. The fit method, we can pass our data and we can use the predict method to make predictions and then we can calculate the accuracy score. So this is how the official logistic regression model perform. And this is the one how we wrote perform. Identical. So to summarize, you learn the very core fundamental machine learning algorithms. You learn the math behind them. You understood how they worked and how you can actually write your own. You learn the perceptron, Adeline, and logistic regression and powerful techniques like gradient descent. You learn the necessary math. You implemented them in Python. And you learn how the data wraps, goes inside the classifier. So the learnings were that why you should also do this? Like why? First is you understand that machine learning is much less of the magic than media makes out to be. It's backed by hard science. Second, data is hard. Data is a really hard one. You can structure your data in any ways, number of ways possible. And if you know how the data is going to be utilized by your classifier, you can structure it that much better. And your model, by the way, is as accurate as the data you feed it. So even though black boxes exist to save you time, it's all right to sometime look inside them. People often give the example of driving a car and not knowing how the engine works. And because they don't need to, which is fine. But think about the days that your car won't start or breaks down in the middle of the commute. If you know how the engine works, you can probably figure out the latest. You can probably figure out or narrow down what is wrong and get it fixed that much quicker. And libraries, which is in my humble opinion, always a better position to be in. So libraries like scikit-learn have done a tremendous job by giving us highly optimized tools. But when you really understand what's going on, on the day that your model does not work, you can debug it that much better. And that, according to me, makes us better engineers for it. So thank you for listening. So acknowledgements. I couldn't have done this with the help or knowledge of these wonderful people. And the Pycon organizes Abhishek and my reviewer, Mr. Ashok Govind Rajan. Thank you. And yeah, question and answer. We just take two questions. Hello. Hi. I had this question. You said that for the algorithm to work, the function, it needs to be convex. If it is not a convex, because if it is a non-convex region, then how will we deal with that? So we chose the function that we want to optimize, right? We chose the Samo square of errors. That's our cost function. We decide what the cost function is. And it may not be convex, but then probably you won't be able to use gradient design. We wanted to use gradient design. Therefore, we chose a convex function. So if we have non-convex. The cost function, it does not come with the algorithm. Basically, the people making the algorithm decided on that cost function, right? Thank you. Are you on this? Suppose, if I have a less amount of data, so how you can train the model and what will happen in predictions means? Less data. So your predictions quality depends on your data, right? So if you don't have enough data, it would probably have high variance. So it won't generalize to new data better. The solution is either live with it or get better data. If you have low data, you'll get a quality of the model, which is proportional to the quality of the data. The only way to increase the quality of the data, you can optimize your hypermarameters, but they only go so much. They can only do so much for you. The quality of data is directly proportional to the accuracy of your prediction model, right? So my answer would be get better data. If you can't get better data, work on your hyperparameters. All right, everyone, we've run out of time. So thank you for being here. Right. Meet me outside if you have any other question, right?