 Okay. So, module three, seeing all these slides, we're going to introduce you now to neural networks or artificial neural networks. Actually, I'm going to interrupt you, Dr. I'm going to ask you to restart over. And so, I should have mentioned this before, but when I was just working on some of these videos online and video and so people going to skip and watch different videos to their leisure. And they can't start a video and says, Oh, you've seen this before. Right. So yeah, you should go through the, the creative comments before each slide deck. Okay. Assume it's a new video that it's first time somebody's watching you. All right. Okay. So welcome to the Canadian biochromatics workshops on machine learning. The workshop is aligned with the creative commons attribution share alike license so people can share and share alike according to creative commons license. So today we're going to be talking about neural networks or artificial neural networks. And we're going to give you a little bit of an introduction. So in terms of timing, we've got about an hour and a half scheduled for this one might be a little long. So I'll try and move quickly. So neural networks are essentially derived from our understanding of biology and neurons. And so I'm going to try and relate that information to the architecture behind neural networks, and then explain how they evolved and how they were developed. So the concepts with neural network algorithms, this idea of one hot encoding, which they introduced earlier back propagation and the idea of hidden layers and the need to have these to perform discrimination and decision. We're going to look at the Python code for an artificial neural network for Irish classification. So in the previous lecture, we looked at Irish classification using decision trees. And the same thing with artificial neural networks and we're going to compare the two between a and n's and decision trees. So an artificial neural network or an essentially is a way of simulating the function activity of the brain with simulated or synthetic neurons, which are those circles and axons which are those lines. And most of you would know neurons are connected to each other through through axons and this connectivity actually is a reason why we are able to think and rationalize. They are artificial neurons, which is why we call them artificial neural networks. They were modeled after the concepts originally seen in the human brain and also analysis of other mammalian brains. And before they can use it for both regression and classification. They have been used in supervised learning, they can also be used in unsupervised classification. And it's really described in detail in 1986. And as I said they mimic the way that brain works layers of neurons are connected. And because of the connectivity and I'll explain this a little later. It's possible to do a variety of logic operations, which are essential for doing pattern analysis. To simulate a brain, we actually don't, you know, create chips and put wires from chip to chip. What the computer does is it just converts the neuron locations and connections to basically tables or arrays or matrices. So all those connections that we saw in that diagram with nodes and edges. They become numbers and positions in an array. So this keeps it sort of mathematical or virtual. And that makes it obviously a lot easier to code. So on the left side we're showing you a real brain with the typical neurons and electrical signals being sent out from the neuron down the axons. On the right is sort of the synthetic brain where we've got neural connections and below is this synthetic model of neurons and edges nodes and edges to mimic neurons and axons. And this is actually illustrating. These are the hand drawn diagrams from anatomists but if you have, you know, good histology slides you to the same thing but this is showing the neural cortex and the human brain. And that's the same thing with the wiring diagrams, the connections between neurons and, and people have noticed in general that there are multiple layers in the brain where there's sort of dense collections of neurons that it's lighter and then more dense collections than it's lighter. And that, that layering is something that's been known for probably more than 100 years. And it's six layers that they generally see. And they are part of the brain where most of the logic and reasoning is done. So this is known, you know, more than 100 years ago, and people also aware that, you know, much of the thinking in the brain was done through links and connections between neurons and axons. The way the neuron works in the brain is that there are little axons called dendrites that collect signals from other related or nearby neurons. The nucleus, which is sort of the node, if you want, for a neuron integrates the signals, and then it sends and does the appropriate chemistry calcium release electrical potential generation to send the electrical signals down the axon onto other dendrites, which will then connect up to another axon or another neuron. So this idea of having a node, which is the cell and then edges or links, which are either axons or dendrites is something obviously that we know about from neurochemistry. So this accumulation of signals coming from other neurons to a central neuron is involving essentially integration of signals. If those signals reach a threshold, then the neuron will fire and send other signals to related or nearby neurons. And so that idea of integration, signaling, firing, sending information out can be really described as a matrix and the set of interaction nodes. And so the top is sort of a schematic diagram of neurons in the bottom is essentially a sort of a more mathematical node structure the same thing. In many respects, these interactions can be described as Boolean interactions things that are on and off ones or zeros. And this too also was sort of behind the reasoning with neural networks. In terms of being able to recognize images. This is one of the great strengths of neural nets is also one of the great strengths of the human brain. You can take a whole bunch of observations and here I've got sort of corrupted numbers of a number eight in the digital form. And most of us if we looked at those would sort of realize that those all look like the number eight. And so we'd be able to fill in the missing data we'd be able to impute or interpret and say, yeah, these look like they're sort of slightly corrupted information loss, but with the number eight. So instead of image recognition. If we were to do that in an artificial neural network, we would have our training set. And this is sort of the input and output part of the concept. And then we would have an input layer and a hidden layer and an output. And so we have essentially kind of like a two layer if you want network. So what this neural network is supposed to be able to do is take these corrupted images and spit out the corrected number eight. Each of those layers input layer the hidden layer the output layer are represented by nodes, the circles, and the connections are are essentially weights. They're considered equivalent of our of our axons or dendrites, but they have a certain amount of weight indicating the strength of the signal connecting one node to the mother or one neuron to another. So connections from nodes to other nodes essentially creates a table of, you know, a to b or one to two, two to three, two to one, and those numeric locations can be described as an array or a matrix. Now conventional neural networks typically have one to maybe three layers. And with deep learning, the number of layers is much larger 10 more. And it was only with the development of more advanced computers faster CPUs that the deep neural networks or thick neural networks were possible. Again, or before I guess over the last 10 years, computer power wasn't quite quite enough and there are other subtleties in terms of the mathematics of neural nets that didn't allow for the deep neural nets to be developed. Deep learning is a really hot area, I think as I said, and certainly the deeper the let the neural nets, the more complex patterns and rules can be learned. And this again just mimics the fact that the human brain has at least six layers probably more than the human brain is capable of some pretty amazing things. Now neural nets, even though they were described formally in 1986 by Ronald Hart and McClelland, their idea of neural nets are mathematical mathematical modeling of neural nets have been around since the 40s. And really emerged from McGill, from people working at McGill and the Montreal Brain Institute. And they are interested in trying to explain or model how neurons work. One of the most threshold logic model is this idea that you've got inputs, maybe coming from two different near neurons or dendrites coming into the neuron, the neuron or new road, this is the body, and then an output with an axon. So w one x one plus w two x two equals y. So this is this output so weighted inputs lead to an output. And with this new road you could do things like ending or oring logical multiplication or logical addition. So, zero times zero equals zero zero times one equals zero one times one equals one that's ending. And or is zero plus zero equals zero one plus zero equals one one plus one equals two but because the maximum you can get is one with binary. It also equals one. Pitson McCullough's idea was picked up by Donald have who was McGill and then he realized that if you had a, and also made the observation that if an axon from one cell is close enough to another cell and repeatedly excites it, you can enhance the activity there's a metabolic change that leads to the firing of cell B, which is that star looking cell to increase or enhance its signal. And the fact that the weightings could change, or the strength of the signal can change and this is where this idea of w one and w two. They could also enhance or diminish signals was something that that was important, both for the mathematics of an ends, but also or development of general perceptron models. The name Rosenblatt developed. What's called the perceptron. And this is sort of an extension where it still uses the idea of weighted neurons, and they're summed together, and then neuro road. But in addition to that, there's a function that's applied. It generates the output. And that's what's shown here is a sigmoidal function. And so that either amplifies or depending on the level that's coming out, whether it's above or below that threshold. And it will lead to a different signal. So the use of a thresholding function. So activation was important for the development of, I guess, the precursor to the neural net called the perceptron. So it is a mathematical model of a neuron takes input values it takes weights, adds them, and then passes them through this activation function, or this sigmoidal function. So the color pits model. You've got x one, w one, x two, w two, so on they're all going into the neuro road. There's some together, but then they're multiplied or altered by this activation function and the activation function is some kind of sigmoidal or softmax type function, and that gives you the output. So you get the weighted sum so the inputs and outputs. And then in this case if we had a step function it would be if the output equals one, if the weighted sum is greater than zero. And if the weighted sum is less than or equal to zero, the output is zero. So you get one or zero. The weighted sum can be technically any number between minus 1000 plus 1000. So the step function is a way of normalizing things. In this case, a maximum of one and minimum of zero. But it also can be adjusted and it doesn't have to be a step function it could be as a sigmoidal or softmax. So the perceptron conceptually was very close to the neural net. And the idea and they demonstrated that what you could do with this is you could take data, which was the inputs. And it could essentially learn by taking the input, looking at the output, comparing the output to what was expected, and comparing the output, the difference between the output and predicted in the observed, which would be your error. So then taking that difference between observed and predicted output and using that error to modify the weights on these inputs would actually get the perceptron to predict more and more closely so that the error was consistently reduced. So it was essentially a cyclic process where inputs produce an output outputs compared to the truth or the gold standard and errors calculated. The error then is used to modify the weighting on the inputs again. It runs it through again. Hopefully it's it's better. You can adjust it based on the error modify and you repeat and repeat until the minimum error is achieved. And now the perceptron has learned certain phenomena or certain features. The actual function is the difference between the expected and observed. This is also adjusted according to a learning rate. And also essentially just the slope of the error function. So, this is essentially the 1950s version of a neuron or an artificial neural network. This is describing, in essence, the gradient function. This may have an activation function that could be step function, sigmoidal or linear. You can calculate the derivative of that. We have the, in this case, alpha, which is the learning rate that can also be described as eta, or it looks like the letter n, the Greek letter n, but it's said to be eta. The target output is the gold standard. The actual output comes from the perceptron. And so all of these things from the learning rate, the difference, the neurons activation function, and the weight of the input are used to sort of calculate how you're going to change your weights. So there's a mathematical structure to the weight calculation, but it's able to change the weights so that the perceptron gets closer and closer to the correct answer. So what the perceptron is able to do is essentially perform logical operations like and and or. So those and or or operations are useful for what are called linear separable problems. So it's putting, you know, something on either side of two categories in a straight line. So in that regard, a perceptron is just a linear combination of inputs. And it's showing the idea of how an and function works and how an or function looks. So ending is the equivalent of logical multiplication. And it shows how, say if you had two inputs of zero and one and versus an input of one and one, you would be able to separate the one comma one point from the zero one, and the zero zeroes. Or you could separate things from zero zero versus zero one one zero and one one. So, you know, it's able to draw if you want a hyperplane it's able to separate things in a reasonable way. So, and and or are critical for any kind of logic. Most of you guys know a little bit about computers and Boolean logic is based on and or. There's obviously a threshold of views. So we can also have a bias as well, which is marked with the B, and that can also make adjustments. But this as I say is sort of like this step functions either one or zero. And that way we can do and or or. Now, if you want to be able to do something a little more complicated, where we're not just trying to do a linear combination or something that's not linearly separable. So if you wanted to be able to do measure inequalities or logical differences, you have to use a function logic function called exclusive or or x or. So, you know, zero is zero different from zero. No, is zero different from one. Yes, is one different from zero. Yes, is one different from one. No. So the x or essentially allows you to make distinctions. It also allows you to separate two or more classes, whereas with and and or you can only separate just one sort of simple linear separable group. This is illustrated here where we can see class white circles, zeros and ones and ones and zeros are able to separate, whereas the black circles, one one and zero zero are separable. So essentially we're able to draw two planes to separate things. And, and so this piece of logic, which was realized I think in the 1960s was absolutely critical for being able to do classification. And so, how do you get how do you go from the perceptron to some waiting function that allows you to discriminate. And this is where the exclusive or actually requires not one layer, but two layers. So, we've seen the example of the perceptron where you just have inputs going into the one neuro, but now with a perceptron, you have inputs coming in, and the neuro is identified as that one layer. So, that issue of exclusive or was identified by Marvin Minsky in 1969 in 1975 were both came up with the concepts of using hidden layers and back propagation. The idea was probably forgotten and then resurrected about 15 years later by Ronald Hart and the colon who developed artificial neural networks. Artificial neural networks continued for about 1520 years. And then the idea of recurrent artificial neural nets and deep neural nets started emerging in about late 2012 2013. And it's just taken off. So it's taken 80 years roughly for neural nets to sort of evolve to the point where they're almost ubiquitous. So the original name for neural network was actually a multi layered perceptron because it was giving credit to the perceptron that had been developed by Rosenblatt. It's using not just one layer but two or three or dozens of layers. At a minimum neural net has to have three layers, an input layer, a hidden layer and an output layer. And those are the neural nets that we're going to be using today, and that will describe. And because of these hidden layers, the conventional perceptron is able to do and or an exclusive or and that means it's useful for classification. And also, in order to train and modify the layers, rather than just the gradient descent optimization use this back propagation which is sort of an extension of the perceptron gradient calculation. And as I said, the neural nets are really rediscovered but they've been around for a good 30 years before. I'm going to show a video that I found very useful. It's about three or four minutes, just describing neural networks and how they work. So let's see if this works. What's a neural network. A neural network is a type of program that learns how to do things, instead of being hand programmed to do things. It's inspired by the neural networks in your brain. Your brain contains around 100 billion cells called neurons. These neurons have dendrites, which they use to receive signals from other neurons and axons, which extend outward to send their own signals to other neurons. When a neuron receives the right makes of signals as dendrites, it then sends out its own signal on its axon. A neural network in a computer is similar. Here's a very simple one, one that I've written actual code for. It has an input layer and an output layer. There's also a layer in between called the hidden layer. These circles represent the neurons found in your brain. Sometimes they're called neurons here too, but most often they're just called units. To keep the number of units small to fit in this video, I've trained it for something that doesn't need a lot of inputs or outputs. I've trained this one to count. It can count from zero to seven in binary. I could have made it count in decimal, but that would have been boring. If I set the input units to all zeros, then I've taught it to set the outputs like this. 001. That's the number one in binary. If I give it 001, it outputs 010. That's the number two in binary. It does that all the way up until you give it a seven in binary, which is 111. If you give it seven, then it outputs zero. But the fun thing to do is to give it zero to start with. And from then on, you simply take whatever it outputs and feed it back into its inputs. It just keeps on spitting out zero to seven and then keeps repeating. It counts. I've actually embedded the trained neural network into a web page, which I'll talk more about later. Here I give it a two and it spits out a three. And here I tell it to start counting from 000 on its own. How do we teach the neural network? When we give it some inputs, some magic has to happen with all the stuff in here that makes it spit out what we want at the outputs. Each connection here has a number associated with it called a weight. There's also something else called a bias unit that has some value. And that's connected to the hidden units and output units and its connections also have weights. So to get the neural network to produce outputs, we'll take the input values, do math with all these numbers. And if all those numbers are just right, it'll spit out the outputs we want. And all those numbers will at the same time be such that it will spit out the correct outputs for all possible inputs. So training a neural network involves adjusting all those weights, such that given an input after doing all the math, it spits out the correct output. To train it to do that, we use something called the back propagation algorithm. It's called that because first we go one way through the network from input to output and then propagate back from output to input, adjusting things. We start by making outputs called a training set. The training set consists of the inputs and what we expect as output for each input. So if we give it input 0000, we want something close to 001 at the output. For 001, we want something close to 010 and so on. We'd normally give it the first set of inputs from the training set, 0000. But that will give us very boring numbers to look at, mostly zeros. So let's assume we've already done some and are on the seventh input, 110. At the start of training, all the weights are just random numbers. I won't go through all the math in detail, but we take the values from the first input unit, a one in this case, multiply it by this first weight going to this hidden unit and store that in the hidden unit. We do the same for all the other input units that are connected to that hidden unit and add their results in too. We also multiply the value of the bias unit by its weight and add that to the hidden unit's value too. We then adjust that using an activation function that among other things adjust that to be between 001. We go across and do the same for all the hidden units. Those values for the hidden units are now the inputs for doing the same with all these weights and this bias unit and its weights. And once we've adjusted the results using the activation function, we finally have the outputs. And the first time we do all this, those outputs are likely nonsense. Now we need to go back through the network and adjust all those weights, such that we'll get a better result the next time. But we have to do it gently because we'll also be adjusting those same weights to work with other inputs too. Remember, our training set included both the possible inputs and the expected outputs. Using the expected output for the input we just used, we can calculate the errors for the outputs. In other words, just how far off were those outputs? We then use that error to go back through the network, adjusting the weights by very small amounts so that we'll get a smaller error the next time we give it that input. And that's why we call it the back propagation algorithm. We propagate back through the network, adjusting weights. We repeat that whole process for all the inputs and outputs in the training set. And once we've done that for the whole training set, we then repeat it a few thousand times until the error we start getting is very small, small enough that we consider a trained. For this neural network, I have to go through the training set around 680,000 times to get the RMS error to under .0005. Okay. Guys, I've used this video. What's a neural network? I've used this video a few times, probably because I think it's really nicely explained, even though it's a simple example, it really helps, I think, understand the concepts. For most of you, this is all you really need to understand in terms of the concepts, but we are going to dive into the math a little bit so they understand how the algorithm works. So, this is showing the sort of standard architecture that was illustrated in that video by rimstar. So it's an input layer, hidden layer, and an output layer. And the number of inputs can be four, hidden layer can be five inputs or nodes and the output can be one. You can have many other architectures where you could have an input layer that has 100 or 200 units and hidden layer could have 100 units and the output layer could be three or four or 10 units. So architecture varies with neural networks. They are called feed forward networks because they input moves in one direction towards an output. The back propagation is this idea that you send signals back to modify the weightings and the strengths. But there is a directionality to neural networks. The concept of hidden layer is that layer between the input and the output. So you can have one hidden layer, you can have two hidden layers, three or more. And these obviously help modify input they're critical for being able to calculate exclusive or to be able to do differentiation of classes or groups, also for regression. The forward propagation step is that step in the learning where the input is passed through the network and that's where things are multiplied. So this is the perceptron idea, where you take all the weights, take the strengths, and then you calculate an output where you've multiplied some thresholding function or activation function. The back propagation part is the way that we correct the error between the output and the observed or expected output. And that's where you make adjustments to those nodes, those circles, and you modify their weights. That's in terms of the weight matrices and the biases. You also saw the term epoch or epoch. That's essentially the number of times that you will carry out the training and back propagation. For a set you'll repeat it many times and it's not unusual for a neural network to be trained over minutes, hours or even days. Where there's hundreds or thousands or hundreds of thousands of epochs or epochs that are calculated. So this each iteration of training is an epoch. In many cases what people have found is that batch learning is better for neural networks. It's sort of learning little things at a time. So when you first learned about addition you kind of learned how to add one plus one and one plus zero and one plus two. And then the next batch you kind of started learning about you know what's six plus eight and then next one which what's 20 plus 21. So the numbers get larger or more complicated and so this way you, you, you break up your training into batches. That process is essentially more efficient for learning rather than dealing with large single batches. So these mini batches are one of the tricks that people have found for for speeding up the learning process, sort of taking things one small byte at a time. So this is a model of forward propagation, which is very much like the perceptron. So we start with some input. In this case we've got two hidden layers and an output. We're taking the weighted sums so the w is the waiting. We have functions f one f two, the functions essentially are a weight times some value. So the first one is the function weight times X. The next ones are functions of wise, and we have different output elements. So those are the Y I what we have is a function called E. That's the linear combinations of the weights that are W's and either the inputs which are X's. So this is this is the method of forward propagation you see it goes from left side to the right side that's the forward propagation step. What you're also seeing are indices with the weights. W X one X one two, these are connections between the different tables so they have indices. So W one two or W two four W three five. Those represent positions on a table. We're a matrix and so waiting, we call it as a waiting table or wait matrix, and those are the the arrows are the connections. So after we've, you know, calculated or output is shown in that video. Usually we're not very close to the expected or known output. So at this stage, we calculate the error, which is like what was done in the perceptron. And then we start making small changes to the, the functions and so this case the, the function is usually a cost function, which is again it's a number or a function that we've chosen that might be related to a sigmoidal function which is related to the activation function. It might be another type of cost function, which is cross entropy. So there's any number of cost functions that are used in neural networks. Anyways, that error is these deltas or derivatives are then multiplied by those weight numbers and so the weights are the W one W two W two four W three five. So we can see that instead of going left to right, we're going right to left, and that's the back propagation step. And so those, those errors are propagated through all of the, both the functions and the weights with those deltas. So there is typically a derivative that's taken and derivatives are how we do gradient descent optimization. So if there's some cross entropy evaluation or some sigmoidal function we're using we take the derivative, and that's allowed us to determine that error. When we're taking those derivatives we can determine whether we're in a positive slope or a negative slope, which direction the weight should be moved to the be positive weights or negative weights. So this derivative is, is critical for deciding whether to add or subtract values from the weights as we go through the, the back propagation. So with those derivatives of that cost function which let's say for now is a cross entropy. We can multiply the derivative of that cost function with there's a little data there which is learning weight, and then there's the, the deltas, which are, if you want the weights. And then these are also multiplied by the output value, which is either y1 y2 y3. So, in this case we're able to modify the new weights with this collection of both learning rates, errors, and derivatives with respect to the weighted some of the inputs. And what's written on the left is, is the delta it's a derivative of the cost function. The DFT is the derivative of the activation function that's usually the sigmoidal one, the cost function for this one is some kind of cross entropy, and then why I is the output of each neuron. This is essentially similar to that perceptron formula the delta rule, but with some modifications. Alpha in the perceptron rule is the same as a to here. The derivatives with the activation function is similar to the derivative of the activation function in the perceptron rule, where the differences is the derivative of the cost function, which is used in the propagation step. So, we've moved from layers one and two and so this is again sort of the animation animation of we're going backwards to the propagation and how the different weightings are applied to those essentially arrows and how things are modified. In each case you can still see there's a eta, which is the learning rate, the delta, which is the error function and then the activation function, which is the derivative. So let's take them relative to either it's D e D F D e or D F D F. We multiply either by why one, why two by three, or we multiply by x depending on where you are in the layer. So, it's a relatively complicated bit of math, but it's essentially the way that all of these weightings are adjusted. They're modestly changed. Those changes, as we propagate through the, through the weight matrix and through the neural net, essentially are changing the steepness of the activation function so that sort of moves to a step function becomes very steep as it moves to sort of a shallow function those are lower weights. That's largely what we're doing is we're just changing how how steep or how vertical or how non vertical that step function is for essentially the, the output projection. So having biases is essentially a way of shifting this sigmoidal curve to the right or left. The, if we're trying to get an output of say point five for an input of the five can apply a bias of minus five. So this essentially allows us to scale to different numbers, even though our sigmoidal function just ranges between zero and one, the addition of the bias allows us to get to numbers that are non zero and non one. So the learning rate. So in the perception on that's alpha in the nomenclature of artificial neural networks it's data looks like the letter n. Again, it's, it varies between zero and one, and you can adjust your bias where things might go down incrementally very slow steps. You don't have a learning rate that's too rapid it just kind of bounces erratically before it can get to this minimum. Ideally, with the concept of gradient descent, typically you move in big steps when it's steep and little steps when it's shallow. And so the middle curve is the example of how you can adjust the learning rate, and that you won't overshoot your, your minimum in terms of your cost function. So if we're looking at the math for a neural net, we can think of something where we've got some input vector, say two values, zero and one that's input. And we have a weight matrix. This one it's a two by three array. So one by two multiplied by two by three array generates a one by three array. So this is an intermediate calculation. And that intermediate calculation can also be multiplied by a three by one array. So that produces an output of just a single number. In this case point one three. So we could have our, our hidden, our matrices or weight matrices. In this case it's weight matrix one and weight matrix two. We have our input which is zero and one and we have an output which is point one three. So we can compare or multiply that output through our activation function and determine, you know, is it close to zero or is it not. And we can compare to a desired output. If that's off, then we go back through the back propagation modify using those derivatives of the cost function errors derivatives of the activation function. We can see that those change the leader increase or decrease the weights, depending on the strengths and relative importance. Now we've got a new matrix, and we put in the same vector again and say well are we better. And we found that we've met from point one three now to point zero six two so that's actually closer to zero. So we're getting closer. And then we repeat this cycle over and over until we get to a value. We can go on to another new input instead of zero one it's now one one. We multiply through everything again, and we see if this is actually better and actually it's not quite the output we want. And then we do some modifications again trying to make sure that it's going to be still able to handle the zero one output, but now I guess I should have changed this to one one. We do the comparison. So we try our inputs we try our outputs just as illustrated in the video, and eventually we converge to these two weight matrices, and these are the weights that allow us to take an input and predict an output consistently. So that's, that's all the effort of forward pack propagation back propagation it's really a lot of just simple matrix algebra or dot products that are performed. And this is another example, illustrating the same thing that was shown in that video, where our game we've got an input layer hidden layer and an output layer. So we've got two values. The hidden layer three values output layers one put, but in terms of the weight matrices. There's a two by three array and a three by one array. The three by three array has the numbers like minus point three minus point four four minus point four three and so on. The three by one array has values of minus point two six one point one one and so on. And then the output. So we have an input of two ones and the output here is point seven three. And then we calculate what the error is between the desired output, which I guess I don't know maybe that was point one nine or something. So we measure the error. So it goes through and modifies the weights through back propagation as a forward calculation. And we can solve it went from an error point seven three down to output a point five errors now point five output is point four nine point three two, we're going through various iterations of now our iteration 40 the error is getting smaller there is getting smaller. And finally the error sort of converges and we've got an error point one one. So this particular matrix or this particular neural network converged after 59 iterations, but you can see from this that there are changes. Through these where you can see the weightings keep on changing. So look at the middle and look at the right values keep on changing some of them growing up some of them not changing very much. And by the time they are converging most of the things have not changed, maybe only in the second decimal place. So this is the result of the back propagation changing those weights ever so slightly. Okay, so that's kind of the background and we'll just go through the same sort of process where we're trying to classify using neural networks, the iris data. It's the same six steps to find your problem construct your data set transform your data set, choose and train a model test and validate the model, and then use it to start making predictions. So we've got our data set we've got the iris flower data set that we had previously used versus the color of Virginia can sitosa. We're going to take a forfeit features which are the pedal length and width the sequel length and width. We're going to have this 150 samples. And it's about 105 for the training and 45 for the test set. We're going to have an input layer hidden layer and output layer. So we've seen the same data format when we did this for the decision tree same structure. So you can find right the program. So we could write it from scratch or since it's already been written you guys can go to your machine learning site and module three, and you can choose the Python code. That's the one we're looking at. There is the R code. If you're more comfortable with that you can choose it. And once you've chosen the iris a nn for artificial neural network that's distinct from the iris DT for which is the decision tree. The algorithm for the neural network is more complicated than a decision tree. There are similarities we read the data we check the data we create our training and testing sets. So there's the 70 30 split. But then things start getting a little different. We have to do one hot encoding. So this is converting things. So that it's readable or converted into sort of an array or binary sets. We have to normalize the data that we do. It's called an L two normalization of the pedal lengths and widths and sepal lengths and widths. We have to then encode the labels and do the one hot encoding. We have to define our activation function. In this case we're using, I think sigmoid for layer one and softmax for layer two. We have to initialize our weights and biases. These are kind of random numbers. And we have to determine how many batches we're going to train. So we're not going to just train one batch. We're going to have several batches. Then we have to do the forward propagation from our randomized weights and biases. We calculate our errors from the neural net. And then we have to do the back propagation to adjust them. And then we update our weights and biases and then repeat this over and over and over again until the errors are minimized. So there's, you know, about 15 different elements to a neural net. As before with the decision tree we import numpy and pandas. We also are putting in seaborne and matplotlib to help with some data visualization. A little later, but this is an example of some of the libraries that we use just to make things a little easier for our computations. So like the decision tree we read our data. It's data one and it's a comma separated file. We use the same pandas structure for reading the data and is reading them in matrices. We have the same data checking. We're just making sure that there's no missing data. So it's the same, same little piece of code that we used in the decision tree that we're also using in this one. So it's just good coding practice for verifying the data set. Now, after we've constructed and verified, we're just going to transform the data set. So the first thing is to separate between the training set, which is about two thirds so 105 flowers and then one third, or about 30% 45 flowers. And so this is just this bias where we've got two thirds training one third for testing. So we have to convert our flower data set into sort of a set of categorical or rather binary values or numeric values. So we're taking what would have been the list of the different flowers, whether it's a toast of Virginia, another one that's a toast. We can classify them into this format where we produce a set of subversive color virginica headers and then we indicate with one zero zeros whether it's true or not true. And so in this regard we can kind of encode the flower status as a one zero zero or a zero zero one or a zero one zero. And so that this three digit code is a way of identifying what what different flowers there are. So we're not going to use se ve or vi it's as I say you're just converting it to ones and zeros. In this case it's an array of three numbers. So we create our one hot encoding function so it's called to one hot and then the same it just does what I was showing in that image there to create positions corresponding to the flower position or flower flower name. As I said, in neural networks you have to normalize and in decision trees we don't. So this data transformation is to try and help with the scaling. We've got pedal widths and sequel widths numbers ranging from about point five to like 10. And it's not exactly linear. So what we're trying to do is normalize things. It's called feature scaling L to normalization is the same thing as calculating the distance between multiple points. x squared plus y squared plus z squared and take the square root sort of thing. And the point is that the scaling helps to bring down values that are maybe a little too high or a little too low it makes the distribution. I guess we'll say a little more normal. And so we can see L to is the L to norm. And we're calling this function normalize. We're then going to take the data out from our data set and then perform that normalize on on their data sets of four by five hundred and fifty iris values. We're also going to do the label encoding. So we're now converting species from zero one to and then we can also convert that through one hot encoding to zero zero one zero one zero and zero zero or one one one or whatever we have to do. So this is the one hot encoding where we convert the 123 into zero zero one zero one zero zero. And there's an element in feature selection, which is mostly for things that you guys can do in your lab so we're not going to really dive into that but it's just allowing you to determine what elements to use. So we've now taken our data set, we've done some L to normalization for transforming making it get normalized. This is a common mistake for a lot of people doing neural nets and other machine learning is that when they're dealing with numeric data, they don't. They don't normalize. As I say with decision trees you don't have to, but for just about everything else you do. The one hot encoding is critical for neural nets and other types of network models. So we've seen how we've one hot encoded things. So as we use a neural net we have to define this activation function. The sigmoidal function is very commonly used. This threshold that sort of mimics a step function but it's not a perfect step function it's differentiable, which helps because we have to calculate derivatives, especially for gradient descent optimization which is used to minimize the error between the predicted and actual. So let's just describe to defines the sigmoidal function also the derivative of the sigmoidal function. The sigmoidal function is used for layer one. We're going to use a softmax function for layer two, but this is just some math to remind people about the difference between a sigmoidal function, which is sort of one over one plus e to the minus x. And the softmunch function which is sort of a similar, I guess to the sigmoidal style but it has a lot of different exponential functions. The differential, the derivative of a sigmoidal function is essentially sigmoid times one minus sigmoid. So that's one that you actually you don't have to calculate the derivative. It's it's known mathematically so you can just put in the actual function. So with the second layer of this neural net, we have to use the softmax function. So this, this describes the softmax function and it describes how it changes. It helps, and it's more useful than a sigmoidal function if you're wanting to get outputs that sum exactly to one. So that's that's really useful for getting an output that will sum to one. So we've got our functions to find this sigmoid for layer one softmax for layer two layer two is trying to get an output that's going to range from zero to one. We're going to initialize our weights and biases in our in our weight matrices. These are typically random numbers between zero and one. So you can see how we've got the random functions called some cases we've got a random set another one is a Gaussian random number that we're using for the biases biases can be positive or negative. We're going to determine a bunch of batches we know how many flowers we've got. We're just sort of trying to come up with a nice even number. So that we've got 105 flowers that we make up batches that are five batches of 21 each. So this is just doing the math to make sure we don't end up with a batch of unequal numbers or a fraction or decimal. Once we've, let's say broken up into five batches of 21 flowers as part of our training set sets five times 21 is 105, which is the 70% number. We're going to go through this prop process of forward propagation error determination back propagation update weights and biases and repeat for propagation error determination back propagation update weights, and we'll do this for batch one batch to all the way to batch and then that's one epic. And then we repeat that for hundreds, thousands of epics or epochs. We have a learning rate which is remember the data, the batch size, which, let's just say it's five batches, the number of epochs which we can choose, or we can define that as what's the gradient or change gets to minimal and the algorithm stops. So input, which are batch sizes learning rate epochs, and then output is the trained weights, the new biases, and the training error measurements. So this is the meat of the neural network. It's the forward propagation step, which is what you saw with the perceptron this, what we saw with the video it's what I've explained before which is, we'll take a batch of things, flowers in this case, figured out the size. And essentially do this first layer propagation or calculate the matrix or dot product between the input vector and the weight matrix. So it's a vector times matrix multiplication. And if you were not familiar with linear algebra. Yes. It might be a time to study it a little bit. But again, this is just simply multiplying a vector times a weight matrix. And we do this first for the first layer, then we do it with a second layer. And then we scale them by the sigmoi for layer one or the softmax for layer two. So this is the forward propagation now we're going to compare our output with the known output. So we're going to calculate an error. So this is where we're looking at the batch labels, or each of the labels and said, we predicted Satosa, but we actually it's Virginia, so we're wrong. And in that case, we're not using the words Satosa and Virginia cover using 010 as a way of assessing that. So we measure our output and compare them. And we're going to calculate the derivative of the cost function. And in fact, these ones for this particular thing actually end up being just simple differences. And then we're going to start modifying the weights based on this error that we calculated. So as we go from layer two to layer one is back propagation. So we're going to start again, multiplying the delta with this derivative function. And then we also have so there's the weight delta and then there's the bias, the delta. So those are all being added together. And we go from layer two to layer one, and from layer one to layer zero, as we propagate the errors through this, this weight matrix. So we're going to start by saying, at least in terms of our annotation, D cost, DAH, that's the delta. So the cost function, the DAH divided by DZH is equivalent to DYDE. That's the error function that's sort of shown. So it's DFDE or DYDE. And it's sum that's equivalent to EAH is equivalent to Y. So this is just sort of mapping the different, I guess, letters or symbols to the ones that we were using in the previous model. But these are, again, we're just, we're modifying in a directed intelligent way, how the weights should change to produce the best performance. And we're continuing as we go through the layer from layer two to layer one to now layer zero. And now we're seeing layer one, we've got the error, layer one weights, delta, layer one biases delta. And those are all determined from the previous layer itself. Again, we're getting this DAH, D cost, DAH, those are all being used in the multiplication. So this stage, we've done the back propagation, we can update all of our weights. And we're multiplying by the learning rate, which is eta, or in this case, we're calling it LR. So LR equals eta, that Greek letter N sort of thing. And this is just illustrating how the weight is modified for both weight zero and weight one. And we can update the biases as well with similar form. And again, we multiply by the learning rate. And then all of these are put in from essentially a two dimensional array into one dimensional vector. So the entire process of essentially taking those batches and going from, you know, a whole bunch of epochs for the set, we see the same things where there's a first layer propagation, second layer propagation. Layer two error calculation, layer two derivative calculations, layer one derivative calculations. And then the weights and bias updating. And then we repeat over and over and over again for hundreds or thousands of epochs until things converge. So this is this mix of forward back propagation weight adjustments. And what you see is that in this case we're looking at a thousand training epochs. We start off with an error of maybe about 0.45. And then it starts falling, falling. This is the error or the difference between the output and the true label. So in this case, these are the labels for the flowers, whether you've got Satosa or VersaColor or Virginica. And this was to see the, with the training, the error gets pretty small, almost minuscule. So, unlike the decision tree, which is maybe about 90 lines long. Neural net is quite a long program about 185 coding lines, there's 30 lines of comment. It's relatively fast, partly because the data set is so small, we're just dealing with 150 flowers and their data set of four different measurements. The program is configured so that it trains every time you run it. So that adds a little bit of time, but because it's such a simple program, you'll hardly notice it. So, the algorithm, in all its gory detail, showing you, you know, all of the derivatives, the cost function calculations, the choice that we had to use with both sigmoidal and softmax functions to be able to make sure that the output was, you know, some to one or some of those are details that most people would rather not know and typically don't have to know for a neural net. And as we'll show you tomorrow, a lot of those gory details are handled through sort of simple function calls. But, you know, this, this is, these are the guts of what a neural net is. And so it's a fair bit of math. That's, I guess I'll say non trivial. It's a mix of both derivatives, partial derivatives, matrix multiplication. And if all of this sounds foreign to you, don't worry. It's foreign to most of us. But the intent is just to show you, you know, this is how it works. And for the most part, when you're trying to do machine learning, you can call these functions and they'll perform the necessary tricks to do the adjusted waiting and and learning and back propagation. So once we've tested or trained on initial set, we can then validate on our real data set. We can take our training parameters and test our training function and just test to see how things are doing. This is the forward propagation, the test set and how things are propagated to produce the final output. So here are the results of the training and testing. So you recall, I think at the decision tree, we had a perfect performance in the training set. And on that, we don't quite have a perfect performance. We don't have the diagonal one one one. We have a slight error with distinguishing the virginica and versa color. Then when we run the testing data set. So training sets 105 flowers testing set is 45 we get a performance of one one and then 0.95. Overall, the performance for prediction is roughly 97 98% correct, both in the training and testing, and that the performance between the training and the testing set is also quite comparable. So it tells us that we've done sufficient training. That it's not over trained. It's not overfitting and so we can be comfortable that the model is is robust. I compared the DT script the decision tree with neural net. And you can see that they're about the same, arguably the neural nets a little better than the iris in terms of when it comes to seeing new data. And that's maybe not entirely unexpected given that neural nets are a little more sophisticated than decision trees. That's probably more just a statistical noise artifact so both I would say are equivalent. And, and that just underlies the power of decision trees or random for us. But neural nets are good for plenty of other things that decision trees don't do well. This is just a comparison between the Python version and the our version. And you can see that our version is a little longer ours a little slower, but that's typical for our. So, what we've shown is sort of the guts, all the guts of pure Python program to predict iris classes using artificial neural network. It is fairly generic code. So we could have actually applied it to other types of problems just like we did the decision tree. And essentially as we go to the lab now, we're just going to go through some examples. So I'm going to just dive right in because I know we've got a sort of a limited time here. I just want to show and make sure that you guys can can call up the iris and it's the same way that you called up the programs for module two. If you didn't weren't able to get module two working, I'd invite you to try module two again I think we've hopefully finished those ones. But it's the same way with iris A&M. You can look at the code again, you can try and see if you can interpret a little bit more about the logic if you want. You can talk with life and the visa. About the specifics of the code, like the iris DT one you still have to upload the data. So this isn't the data CSV. So you can click on the little folder. You can click upload or you can download the data on your computer and choose it. Same running protocol. You go to run all. This will do the training. We'll also run the test and also produce a confusion. And you can see if it matches what we got in the slides. And what you can do, once you've got this going, you can start playing around. You can change the number of epochs. You can change the batch size. You can change the learning rate and you can see how it performs. You can also optimize the epoch numbers and see which ones give you your best confusion matrix. So for the minimum number of epochs or combination of parameters that give you the best accuracy.