 All right, so we're starting into module three, same usual things. What we're now gonna do is switch from decision trees to neural networks. And this is the granddaddy of original machine learning. It's the basis for deep neural nets. It's the basis for convolutional neural nets. But we're just gonna learn the simple artificial neural net or ANN, because again, we only have two days to work on this. So we have about an hour and 20 minutes or an hour and a half to try and cover this one. It also is a lecture and a lab. So we're gonna give you a background of why neural nets emerged, concept of this neural nets. We're gonna talk about some of the concepts that are used in actually neural net algorithms. We're gonna talk about one-hot encoding. We're gonna talk about propagation forward feeding or feed forward hidden layers. We're gonna go through the Python code for an artificial neural net and we're gonna apply it to iris classification. So we used a decision tree to do iris classification. Now we're gonna do an artificial neural net for iris classification. And we're gonna compare the two. And as I've mentioned before, usually what people do in machine learning is they're gonna compare a couple of different models and they're evaluating their performance. And the real point is that building the model shouldn't be the challenge, but as I'm showing you, these models do require a lot of work. And historically in the 80s and 90s when we were doing machine learning, it was a real pain. It would take weeks to code a decision tree, weeks to code a neural net, weeks to code a hidden Markov model. And you just sort of tire out and you wouldn't try as many models as you want. That's changed as you'll learn tomorrow. And so now it's not unusual for people to try five, six, even 10 different machine learning models to see which one performs the best. Okay, so we're talking about artificial neural nets. So this is brought up before. They try and simulate both the function and activity of the brain. Could be the human brain, or just off the brain, but it's the brain. And you've got inputs, outputs, and what we call neurons. So the connections, you're seeing the circles here, these are nodes, and they basically represent a neuron and the lines from those essentially are the axons or neurites that are connecting one neuron to another. So you can see it's formed as a layer. This one has six layers and it's designed to simulate the neurons in a brain. Now, just like decision trees, you can use neural nets to do classification, biomarker grouping, species classification of the flowers or microbes or viruses, or you can do regression, which is fitting things to numeric data and modeling it. We sort of gave an example if you could actually use machine learning to figure out the distance fallen equals one half AT squared, where A is the acceleration of gravity, T is how long it is. So if you had information of how long something was fallen and what the time was, and you had those points, you could use machine learning to figure out what the gravitational acceleration is, you know, 9.81 meters per second per second. So you can do neural nets for supervised. You can also do it for unsupervised classification, as I say, the vast majority is for supervised. Neural nets are first described by Jeffrey Hinton, really, and others in 1986. So he's, as I said, the godfather of machine learning. I said it's also something that mimics the way the brains work. Layers of neurons are connected to each other to perform pattern analysis and logic operations. This is Boolean logic. So things like and, and or, and nor, and xor, exclusive or. What you're doing, at least in computer, the computer world is you're not actually drawing out neurons and not actually, you know, wiring and connections. You're using tables, tables that represent the nodes, and you're using tables that represent the connections between the nodes that we call these matrices in math and physics. So you're basically converting the network into arrays or tables or matrices. And that's how you can do, represent it nicely in a computer and perform the operations. So on the left side, we're showing a brain and you can see the neurons inside the brain if you had a supermelloscope microscope and you can see the electrical pulses that are emanating out from the neuron along the neurites and dendrites and axons. And the right side is a digital brain showing all those connections and looking sort of on the inside of a layer or set of artificial neurons. So that's sort of a schematic view of the difference between a real brain and a neural brain. But this is actually data from examples, drawings done I think in the 1800s where people had stained human brain and we're drawing out all the neuronal connections. And if you look particularly on the one on the left, there are basically six major layers in the cerebral cortex. You can see dense layer, less dense, another dense, less dense, more dense, less dense. Those are the six layers of how a brain is organized. And then on top of those, you can then draw those connections in another layer. But this is how neurons are connected to each other. There's interneurons, pyramidal neurons, they're axons that go from the gray matter to the white matter. But this is really the inspiration for neural nets and actually for deep neural nets because this layering of six layers is sort of what we thought would be the ideal set for a deep neural net. So biologically, if you look at what a neuron is like, it's a cell, a cell body, but it has these dendrites or neurites that stick out of it and then it also has other longer ones called axons. The dendrites collect the electrical signal. There's an integration component inside the cell, sides, whether it should fire or not. And if it fires, it'll pass the electric signal down the dendrite to another cell or to an effector cell. So these things are all connected, but it allows you to take information from one thing and to output it to another thing. This is how we think. This is how you're processing the data that I'm giving you. And literally there are billions of these in your brain, all firing, all working in unison. So they accumulate signals from other neurons, they integrate the input signal, and then if a threshold is reached, they fire, which propagates signals to other neurons. People have realized since the 1940s, you can actually schematize this as a mathematical network of nodes and edges, just like the decision trees or as a matrix with values with certain points or tables. The interactions between these nodes can perform things like addition and subtraction, they can do differentiation, they can do Boolean logic. This is something our brain does naturally. We are able to distinguish differences, we can make decisions and or if. So all those functions are possible through a collection or in some cases just one or two neurons. So with imaging, we have the ability to take partial data and on the left I'm showing digital numbers, eights where some of the LEDs have been turned off. I think most of you could look at these partially damaged eights, observe them and you'd be able to figure out that that's actually a digital eight and you'd sort of fill in things or understand that. Some of them are maybe a little harder than others, but this is what we can do for image recognition and our brains routinely do this. With the neural net, we can take the same things, but what we do is we have to train it because obviously we're adults and so we've had years of training and so our brains are able to recognize a lot of things, but a neural net is like an infant brain, it's never been exposed to anything, so it has to be trained and it has to sort of have the complications and the depth of multiple connections, which is what our neural net is. It typically has an input layer, a hidden layer and an output layer, but you can have many other layers and what we do is we would train these examples of these digital eights and tell it that this is actually what it's supposed to see, which is the digital eight. After a while, it's trained, it's learned and so then we could show it completely new things that are also corrupted versions of eight and it should be able to tell us that yeah, that's number eight. So a conventional neural net we can see has an input hidden layer which is marked in pink orange and then an output, which is sort of the gray layer. So it really just has typically one hidden layer. The deep neural net has in this case, four or five hidden layers or six or seven and that mimics what we see in the brain, but it also seems to allow more complex rules and patterns to be learned. And historically, the reason why we didn't go into deep learning was the computers just weren't there. We didn't have the computer power. To teach or train a deep neural net, you have to use a very powerful computer for days or weeks sometimes to get it to train versus a simple conventional neural net might only take you a few seconds to a few minutes to train it. So there's a complexity and with the complexity of training, the time to train, there's this benefit that it seems to be able to do more complex assessments, classifications, regressions. So neural networks really didn't begin in 1986. The concept started showing up in 1943 with McCullin and Pitts and then other models that emerged from the 49, 58, 69, 75. So I'll go through that a little bit because I think it helps for people to understand. So the McCullin-Petz one was really the, I'd say the main concept. It was both important and was 90% of the way there towards what a neural net or an artificial neural net is. They called things neural roads rather than nodes but neurons in a node. And they imagined you could have multiple inputs like a neuron and you'd have a single output. And they wanted to show how this could potentially work. You'd have different weights, some strong inputs, some weak inputs, that's the weight W1 and then the output is Y. Input is X and you can do and. So zero and zero in terms of addition is zero. Zero and one in terms of addition is also zero. One and zero is also zero. One and one is an output of one. You can do or. Zero or zero is zero. Zero or one is one. One or zero is one. One or one is one. That's Boolean logic and this is what they wanted it to be able to do. And it was sort of shown with that McCullin-Pitts model. Hebb, who is a Canadian researcher in McGill, did a lot more biological work and he noticed that there are these things of thresholds. You can send in the signal into from another axon to a neuron and if that signal isn't strong enough, that neuron won't trigger an action potential. That's in the top. But if you keep on hitting it and say, do this, do this, do this, just like you repetitively train or practice, then eventually it starts realizing, oh, I should maybe give, produce a stimulus. It primes that neuron. And so the next time you give it a stimulus down in C, it actually responds. So what's really showing is that in B, this is the learning process. C is once you've learned, now it can take an input and produce an output. So this is critical, which sort of highlighted part of the concept of learning, part of the concept of activation, the concept of threshold connectivity and how and why people have been copying and trying to copy more biology as these neural nets have gotten more sophisticated. So in the 50s, Rosenblatt developed the perceptron, which went a little further. It could have more than one input or two inputs. It could have n inputs. He talked about summing, threshold, and then he used Hebb's observation about having this threshold, which uses a sigmoidal function to say, did you reach it? And if you did, okay, you've learned it, send an output. So this is mathematically another important development. And the perceptron technically was the precursor to the neural net. So it's a mathematical model and it was designed to handle neurons. It takes input values, weighs and adds them, and then passes them through an activation function to determine whether it's a fire or don't fire. So here's your inputs from four different neurites or other axons that's coming into this neuron. They sum together, sum with different weights, weight zero, weight one, weight two, they're not all even. You sum them up and then if you're above a threshold and this activation function, then you fire. If you're not above the threshold, then you don't fire. And so there's this threshold and this is a step function that the step function could be a sigmoidal function. Could be any number of mathematical functions. So in the perceptron model, they said it used zeros and ones. So if the weights from the inputs were more than zero or the sum of those weighted inputs, you would fire. If it was zero or negative zero, negative or below zero, it wouldn't fire. And so this is the logic. And again, this is just simply saying what we've observed or what have observed about how neurons work. We're using a step function. So just if or else, that's a step function. So how do you make a perceptron learn? So what you can do is give the perceptron a training set, you know, a bunch of inputs with weights and you produce a function and you get an output. And then you say, is the output right? And if it's right, you don't make a change. But if it's wrong, then you modify things. And this is where you take the output error and take the input and correct them. So the error between the actual output and the predicted output is used to change the weights of the connections. In other words, it's like changing the signal intensity of these neurons as they feed into the other layer of neurons. And what you do is you just repeat it over and over and over again. Computers are good at repeating it, but it modifies the weights until you finally get the output you want. And then you could take another input and see if that works. And if it doesn't, then you tweak things again and you take another input and see if it works and maybe it does and then you don't have to train anymore and you do this over and over again. And eventually you take your test data and see, does it also work? And eventually, if it's doing the job, then you've got a valid model. So the adjustment that's used to the weights and the perception model was something called a gradient descent, which is an optimization method that Isaac Newton developed and others. So weights change according to the slope of the error function. Now the error function might be linear, it could be very complex. So this is where derivatives come in. You multiply the errors of the predicted minus expected by the inputs, which are given as X in this input vector, and then there's a learning rate function. And so the weight is adjusted by the learning rate in the error and the weight of those input functions or the collection of those input functions. So it's a simple function, but it's just trying to differentiate output versus input. How much do we change? How rapidly? This was evolved a little further into step functions, sigmoidal functions, sort of gradient semi-step functions. So there's a constant, that's the learning rate, that's alpha. The neuron activation function, you can see there's a derivative, G prime. The weight is the sum of the inputs. So it could be whatever that is. And again, that function could be sigmoidal step or whatever. The output is Y and Xi is the input of that adjacent neuron. So this is a more sophisticated version of how you would modify it than where we're using maybe a non-step sigmoidal linear function in terms of how things are being activated. So this is where the math gets a little more complicated because we're trying to mimic what biology is like and sigmoidal functions seem to be pretty common in biological neurons. But you can see that this, it's comparing input from output, adjusting inputs and adjusting it based on that threshold value. I have a question. Yeah. The other activation functions you were showing, if you would just go back briefly, the top two, am I correct in understanding this as on the right side of the vertical axis, you can have a partial output rather than a, either zero or one, or could you help explain what these are showing? Yeah, so if you're on the Y axis, the point, if you're above the point on that Y axis, that Y intercept, if your sum is greater than that Y intercept value, then you get technically the full value. And if you're below the Y intercept, then you don't get the full value or no firing happens. Okay, so what is the difference between these three functions? One is the two above ones are differentiable. So you can calculate a gradient more precisely. The step function is not differentiable. And the step function has to be done as a low logic calculation rather than a derivative one. And when you wanna do things like gradient descent or any other kind of optimization function, you want to have a differentiable function. Right. Okay. Okay, so with those types of the Delta rule, the perception architecture, they showed in the 50s, you could classify linearly separable problems. So things that could be classified through a straight line and into two separate categories. And that's because it's decided on a linear combination of inputs. You know, some of w naught plus w1x plus w2x, so on it's linear. So what's showing here is sort of this Boolean graph. You know, 00100111, these are the four points sort of identified on a graph and you can, you know, plot out what an and function, a line or hyperplane separating and and the results of all zeros. And then you can also draw another plane for or and separate those. So, you know, that's good. It can do really simple Boolean logic. So you get and and or. And this is just showing that the net result is, if it's greater than one or the sum is greater than this, you get one, if it's below or negative, you get a zero output. And again, that's just this thresholding concept, whether it's the sigmoidal function or in this case, essentially a step function. But what people noticed is that you couldn't get the perceptron to do things that were non-linearly separable where in fact, you had to see draw two lines or a circle around data. And this one is called the exclusive OR function. And this is, you know, been a Boolean logic function for a long time. And the exclusive OR is the logic or it's essentially an inequality measure, zero plus zero equals zero, zero plus one equals one, one plus zero equals one. And then one plus one, which is a difference gives you, you know, zero. So it measures inequality and being able to measure inequality is an important thing when you're trying to classify things and the perceptron they realized in the 50s couldn't do it. So that was kind of where things fell apart. And I think people had some idea that if you could make the perceptron a little more complicated, so instead of having just the one layer, it had two layers, what we call the hidden layer, then you can actually solve the exclusive OR problem. So 69 was when the exclusive OR problem was detected by Marvin Minsky. Then in the 75 through 86, people were wandering in the forest lost trying to figure out how to solve this. But the concept of back propagation started coming out and the idea of a multilayer perceptron. And so first term for neural nets was multilayer perceptron. And so instead of having the two layers input output, they have three layers, an input layer they call the hidden layer and then in the output layer. And so with that, you could get exclusive ORs, you could get ANDs, you could get ORs, you could do sort of addition. And the way to do that, at least to modify things just like the way that we did the perceptron delta rule, they had to come up with a new algorithm called back propagation. And so the discovery of back propagation and the ability to use multiple layers basically revived the long dead perceptron. And that led to the development of neural nets in the 1980s. So I'm gonna, as a video that is really nice, I'll play it, I think it's about three minutes explaining how neural nets work. What's a neural network? A neural network is a type of program that learns how to do things instead of being hand programmed to do things. It's inspired by the neural networks in your brain. Your brain contains around 100 billion cells called neurons. These neurons have dendrites, which they use to receive signals from other neurons and axons, which extend outward to send their own signals to other neurons. When a neuron receives the right mix of signals on its dendrites, it then sends out its own signal on its axon. The neural network in a computer is similar. Here's a very simple one, one that I've written actual code for. It has an input layer and an output layer. There's also a layer in between called the hidden layer. These circles represent the neurons found in your brain. Sometimes they're called neurons here too, but most often they're just called units. To keep the number of units small to fit in this video, I've trained it for something that doesn't need a lot of inputs or outputs. I've trained this one to count. It can count from zero to seven. In binary, I could have made it count in decimal, but that would have been boring. If I set the input units to all zeros, then I've taught it to set the outputs like this, 001. That's the number one in binary. If I give it 001, it outputs 010. That's the number two in binary. It does that all the way up until you give it a seven in binary, which is 111. If you give it seven, then it outputs zero. But the fun thing to do is to give it zero to start with. And from then on, you simply take whatever it outputs and feed it back into its inputs. It just keeps on spitting out zero to seven. And then keeps repeating. It counts. I've actually embedded the train neural network into a webpage, which I'll talk more about later. Here I give it a two and it spits out a three. And here I tell it to start counting from 000 on its own. How do we teach the neural network? When we give it some inputs, some magic has to happen with all the stuff in here that makes it spit out what we want at the outputs. Each connection here has a number associated with it called await. There's also something else called a bias unit that has some value. And that's connected to the hidden units and output units. And its connections also have weights. So to get the neural network to produce outputs, we'll take the input values, do math with all these numbers. And if all those numbers are just right, it'll spit out the outputs we want. And all those numbers will at the same time be such that it will spit out the correct outputs for all possible inputs. So training a neural network involves adjusting all those weights, such a given an input after doing all the math, it spits out the correct output. To train it to do that, we use something called the back propagation algorithm. It's called that because first we go one way through the network from input to output and then propagate back from output to input, adjusting things. We start by making up what's called a training set. The training set consists of the inputs and what we expect as output for each input. So if we give it input 000, we want something close to 001 at the output. For 001, we want something close to 010 and so on. We'd normally give it the first set of inputs from the training set, 000. But that will give us very boring numbers to look at, mostly zeros. So let's assume we've already done some and are on the seventh input, 110. At the start of training, all the weights are just random numbers. I won't go through all the math in detail, but we take the values from the first input unit, a one in this case, multiply it by this first weight going to this hidden unit and store that in the hidden unit. We do the same for all the other input units that are connected to that hidden unit and add their results in too. We also multiply the value of the bias unit by its weight and add that to the hidden unit's value too. We then adjust that using an activation function that among other things, adjust that to be between 01. We go across and do the same for all the hidden units. Those values for the hidden units are now the inputs for doing the same with all these weights and this bias unit and its weights. And once we've adjusted the results using the activation function, we finally have the outputs. And the first time we do all this, those outputs are likely nonsense. Now we need to go back through the network and adjust all those weights such that we'll get a better result the next time. But we have to do it gently because we'll also be adjusting those same weights to work with other inputs too. Remember, our training set included both the possible inputs and the expected outputs. Using the expected output for the input we just used, we can calculate the errors for the outputs. In other words, just how far off were those outputs? We then use that error to go back through the network, adjusting the weights by very small amounts so that we'll get a smaller error the next time we give it that input. And that's why we call it the back propagation algorithm. We propagate back through the network, adjusting weights. We repeat that whole process for all the inputs and outputs in the training set. And once we've done that for the whole training set, we then repeat it a few thousand times until the error we start getting is very small, small enough that we consider a trained. For this neural network, I have to go through the training set around 680,000 times to get the RMS error to under .0005. Okay, I think it's actually a really good explanation. I mean, they avoided some of the math and the math can get really complicated. And we are gonna show some of this math but I don't want people get too consumed with it. What's a neural? So the neural net, artificial neural nets are also called feed-for neural nets. They're multi-layer perceptions and it's been determined that they can basically do almost any universal function and approximate it. So that's pretty powerful. And they're able to distinguish and classify data that's not linearly separable. So drawing two lines or drawing circles around things or drawing hyperplanes around things. That's what they can do. That's really useful for classification. So just like the decision trees, there's some terminology. So within neural nets, whether it's deep neural nets or regular simple neural nets, there's always a hidden layer. And that can be multiple hidden layers. So any layer between the input and the output is called a hidden layer. And they do a lot of transformations. They allow a lot of the function transformations, function approximations. They make associations between the inputs as well. Forward propagation is the first step when you learn that sort of you take input and you pass through the network. And that's where you multiply the weights and evaluate by the activation function to get your output. Back propagation, as you were seeing in the video, that's using that error. And it's just like in the perceptron between the predicted output and the true output to change the weights. This is why you train on data. And you can see the numbers he was talking about. You were doing 680,000 training exercises or efforts just to get it to be able to count from zero to seven in binary. Now, you can apply this training in just one round of training, but a lot of people will apply it into what we call epochs, just like geographical epochs, the Pleistocene and the Permian. But this is where you sort of do some training for a while and then start all over again, where maybe you've done an adjustment to the bias or you've introduced another type of training dataset into it. So that iteration of training, so I think he just did one iteration. So his epoch was 680,000. He could have done another epoch and another 680,000 and maybe his error might have fallen by another 50%. There's also something called batch learning, where the training set is broken up into batches. And so within a given epoch, if you've only chosen to do one epoch, you can still break up your training set. And if it was 100 or 1,000 events or 1,000 training ones, you might break it up into three. So one set of 333, another set of 333 and another set. But that one round of training is still called one epoch, but you've done many batches of three groups within that epoch. So that's just some terminology. People can choose to do batch learning or not. People can run multiple epochs or not. The video just had no batch learning and had only one epoch. So this is an animated gif here, so you can watch it, but I'm gonna speak along with it. So a neural net is a multi-layer perceptron. So it's not a single layer perceptron, but it propagates the training set forward and it then calculates the error between the predicted values and the actual one and uses that error to change the weights and the biases that was B and W is the way to reduce that error. So the forward propagation step is just taking the weighted sum of the inputs and applying the activation function. The output of the activation function is a propagate for the next layer. So we've got two inputs, X1, X2. Their function is calculated whether it's linear or whatever. The arrows are all matrix elements or table elements or array elements. Y is an output. Now Y could be a vector, X1 and X2 could be a vector. E is this combination of weights and inputs. In this case, it's a linear combination. And so you can see it's a function of W, X, W, X2. And we can see in this diagram that things are being moved along, starting with X1, starting then with X2. Then the weights are propagated from the first layer to this hidden layer and finally to the final output and an output is generated as the Y. So there's two hidden layers at this one. One input layer, one output layer. So it's a four-layer neural net. So you've done your first calculation and usually it generates garbage. So then you have to fix it. So that's the back propagation. So forward propagation, then fixed with back propagation. So the output error is found by subtracting actual value from the predicted value. And you wanna look at it not just from the original output, but you also wanna look at what was the error affecting the previous layers. And so you go back one layer, that's to layer two or hidden layer two, then back to hidden layer one and then back to the output or input one. The deltas are derivatives and the cost function could be, it could be a genie index, it could be information gain. This one's using cross entropy. So in the first slide I was showing you, you saw blue arrows going forward. This one, we're seeing red arrows going backwards. And we're still getting sums, but here's where we're calculating the difference between the gold standard Z and the output Y and then we modify the output and put these derivatives and change essentially the values in these nodes. So the cost function is the thing that helps us to determine the error. We're trying to minimize the cost function. So this is where derivatives come in. So this is sort of a pseudo gradient descent method. We're looking at the slope of the cost function. And if we're trying to find a local minimum then you can see this thing rolling around these different minima where it should settle and where it can find the absolute bottom, which is right there. And unfortunately, I guess Anazio has a question. Sure. I was just wondering, where does it get the first model from? The first model with the weights, it's all actually, you start with just random weights and you fill your table with just random weights. Is it the same for regardless of what your input or output is or? Yeah. Okay. So you just have a random number generated to fill your tables of weights and node values. So the first time you try something it's gonna generate totally garbage results. So it's, you know, you're starting blindly. You know, you could be smarter about it. You know, maybe people have a good set of weights that sort of works for something and you can start in a better place than the number of iterations, the number of epochs is greatly reduced. But most, because it's very much a black box, no one really has a good idea what the best weights starting weights are. So that's why you just start with random numbers. And then for the subsequent ones, is it just, is it trying to find a model based on some mathematical equations or is it all trial and error? Well, this is where it's using a cost function which is and a learning rate. And so those are things that are sort of fine tuned but you know, you choose your cost function. It could be just a difference function. It could be mean absolute error. It could be standard deviation and L2 cost function, the different ones that people will use to try and minimize the error in the output from the input. So, you know, when you saw that video, he was trying to go from, you know, 0, 0, 0 to 1, 1, 1, 1 counting. But the actual numbers that he would get would not be 0 or 1, 1, 1. It would be 0.992, 0.994 and 0.967. Those were his, the actual numbers. He might have a threshold that says rounded up to one. But, you know, he's never getting exactly zero and he's never getting exactly one but they're so close that, you know, either with a threshold or with a rounding function, you could say it's 1, 1, 1 or 0, 1, 0 or 0, 0, 0. Thank you. I guess Janice has a question and then there's one question from Chad. So I'll let go Janice first. Yeah, Janice. Hi, I guess in the video, there was a three inputs, three hidden nodes and then three outputs and then some of the other pictures you were showing there's like two inputs, three hidden nodes and one output. It's wondering about the hidden nodes. Like, do we determine the number of hidden nodes or is that generated by the model? Yeah. Yeah, sometimes, you know, the size of the input structure decides how many nodes or how many inputs and how many hidden layers you're gonna need. As I say, usually it's one or two for simple neural nets. Getting more than that, it gets really time-consuming and complicated. The number of nodes in those hidden layers can be, is again, it's partly the data size and the way the data is structured. So, you know, for these, if we try it for these flowers, iris flowers, we're only talking about four measurements and three species. So, the size of the neural net is pretty small. But if you're looking at, we'll do one later today, which is on amino acid sequences. We have much larger input nodes, a larger number of input nodes, much more complicated hidden layer structure because there's sort of more data where ways of encoding the data is more complicated. Okay. So yes, the data that decides how many nodes to be used is not like, hmm, I'll try this model of three or four. Yeah, it's mostly that way. But in the, you know, SK Learn approach, you can kind of just arbitrarily decide that. Okay. But there'll be some accommodations that the functions will do for you. But in the way that we're doing it, which is, you know, as I say, climbing efforts the hard way. Yeah, we have to think about the data format. Sounds good. Thank you. I believe Movin also had a question. How to know which activation function should be used in a given function? Say that again. How to know which activation function should be used in a given situation? Yeah. It depends. And you'll see some examples where we use two different types of functions in the hidden layers. And some of it's just by experience, people have found some of it's, you know, what's differentiable given the complexity of the function. There's a soft max function. There's a sigmoidal function. We actually use both sigmoidal and soft max in the examples here. One for one layer and another for another layer. And that was partly done through, you know, observation and trial and error. I don't know if anyone really knows why. That's the way you could do it. And this is where neural nets are a bit of a black box. Okay. I think we'll carry on because we have a fair bit to cover. So again, this is where there's a lot of math. And we're talking about how we propagate backwards. These deltas, which we were talking about in the perceptron now are also in the neural nets. So we're driving, you know, taking derivative of the cost function. If it's the L2 norm or mean absolute error, whatever you want, just the difference. So we take the derivative or the delta and compare that to the input layer. We also multiply it by the input, the corresponding input and then we have this learning rate. So you're seeing a weight, weight 1, 4, is being modified by the, you know, that's the updated one by the old weight subtracting by the learning rate multiplied by this derivative, multiplied by the activation function and the output from the neuron. So it's technically a partial derivative and this is where choosing your activation function can be challenging. Again, we're seeing how these then propagate. We went from that layer near the far right and I'm looking at layers on the far left. Again, it's the same set of functions, same deltas but eventually through the complete back propagation every one of these things has had this, these formulas apply to them so the weights are now adjusted. So at some level, what we're doing is we're, when we're changing these weights, we're effectively changing sort of the steepness of the activation function. So a sigmoidal function, which is shown there is one over one plus a natural log e to the minus z. So that's our y function if we're using a sigmoidal one and as you change the w, you can make it steeper if it's high and if you lower the weight which is the red curve, it's a shallower function. There's a thing called bias and there's a bias node that was in that video. This is sort of shifting the whole curve right or left and so we can apply a bias to get, in this case, moving it left or right by five units or 2.5 units to get what we want. So it's again, it's sort of a clueage but it allows you to get the results partly because sometimes there's just a drift in the way that the data is coming out of the activation function. The learning rate is sort of how to determine what the steepness of the gradient and how many steps you wanna do in terms of learning. So you can take baby steps which is the one on the left. You could take giant leaps, which is the one on the right or you can take big steps first and then smaller steps as you get closer and closer which is in the middle. So often you will have to adjust the learning rate over time and this is part of what's called gradient descent or Newton-Raphson methods to make sure that things converge and you can find your minimum. So learning rate is another function just like bias that allows you to get to your sweet spot in the learning process. So mathematically, we might have an input. In this case, it's just two values. So this is called a vector or a one by two array. Then we have a two by three array. So you can multiply sort of a one by two by a two by three. That'll give you a one by three array and then you can multiply a one by three array by a three by one array to get a single output. So this is where some of these, the geometry of your networks is partly defined by what is your input if it's two values and what is your output if it's supposed to be a single value. And then what can you have for your weight matrices? And in this case, it's a two by three and a three by one. This is sort of your hidden, two hidden layers. There's an activation function. So when things come out with a value, we compare between the predicted 1.1 or zero and the actual one that came out from the neural net. In this case, it's 0.13. So we do a comparison and say, well, it's not exactly zero. So then we go and back propagate using those deltas and differentials and cost functions and make adjustments to those weights. So 0.1 now becomes 0.02, 0.3 becomes 0.2 and all the other ones are changed. So we cross them out. We now have a new matrix and then we apply our input again. We go through the calculations and we get an output and we find out that it's a little lower or closer to the zero output. So it's closer to the desired output. We might have a new input then. We train on this. So 1.1 is supposed to give us an output of one. We run it through this matrix and it doesn't give us anywhere close to one. So we have to then train again, back propagate, modify, then do the forward propagation and we keep on doing this tens, hundreds of thousands of times and eventually the two matrices emerge and these are our weight matrices. So this is again similar to the video but showing some, if you want real numbers. This is one with also a real number. So 1.1s, this is the output. The desired output is I think it was supposed to be a one but these are the actual weights. And so here's the hidden layer. These hidden layers are then being used to calculate what the output layer should be. So it starts with two hidden layer, three output layer, one but then we have to have a weight table. So the weight table is 0.18, 1.11 minus 0.26. And again, sort of the geometry of the weights is dictated by the geometry of the input and output and hidden layers. So this is what we're getting first part and we iterate and we change again. So it was originally 0.73, output was now 0.5. You can see the weights are changing, weights dropped now to 0.49, this is our 30th iteration. Now it's dropped to 0.24, it's our 40th iteration. It's now down to 0.05, it's our 50th iteration. And then at some point we just stop because it's just not moving and so it's gone almost 60 iterations and it's down to almost zero. And you can see that the weights have changed quite a bit. If we go back, they're all close to 0.1 and one but by the end of this, they're minus 20 plus 40, the weights have changed quite a bit to be able to get this result. So I think there's a question. Oh, is it for me? Okay, hidden layer, what is like, is that another feature that you're finding correlation with or is it just another random article? Well, it's part of what, we have these weight matrices or weight collections but there is also this function and those are sort of illustrated which is part of the cost function and we go back. Here is the function F1 of E, F2 of E, F3 of E. It's performing. I think it's a mathematical function. That's right. And it could be a simple sum, it could be addition plus subtraction, it could be a multiplication, it could be something more complicated but in this case, we can just think of it as a sum right now. So if you, like on one side you have the X and at the end you have the Y, so if we are imagining applying this to a genetics data where you have a whole lot of genes as predictors and another side you have a disease outcome and can these hidden, so these hidden layers are just functions but can you also have like something else in between where it does this correlations? Looking for correlations within the neural net. Yeah. Not really, I mean correlation analysis from things that's the way that we do partially squared or principle component analysis and that's another way of doing categorization and taking in piles of data. So correlation analysis is part of what's done with as I say principle component. Neural nets probably do some kind of correlation but again, it's a black box, it's just what is it doing, no one knows. But if you're wanting to do separations or categorization where you're looking at correlations, you should use partially squared discriminant analysis or principle component analysis. Yeah, perhaps some example of where the neural network can be applied would help in especially like in genetic studies. Well, that can be used in many, many different areas. I mean, if you wanted to have a bunch of genes, and we saw this example with the decision tree, we could use the same neural net. So there's a bunch of genes that could be used to predict cancer survival rates with gene expression or gene levels that were detected in different patients. So on the input side, you'd have these gene expressions and on the output side, you'd have some sort of like markers for cancer or other disease. Did they survive or not? So the input was the set of genes and then the output was, did they survive five years or not? Again, you have to frame your problem. That's what I'm trying to understand, like what kind of problems can be, can use neural network simulation. So any problem that needs to, where you're trying to classify things. So that might be a biomarker. So I wanna identify people with a disease or people without the disease. Neural nets can be used for that. You can also use neural nets to do general pattern recognition. So they're used for looking at text patterns or sequence patterns that might be promoters or terminators or secondary structure or identifying words. Is it like only one biomarker? Because like in your network, you would, I imagine that you need a very high dimensional data set. So like if you have one just one biomarker, then a simple regression would be sufficient. But to use a neural network simulation, what type of data would be ideal for it? Like is it like a lot of biomarkers that causes one disease or combination? Yeah, so if you have, again, you'd have to have a lot of training data. So if you have, you know, 500 patients and you have, whether it's gene expression data or SNP data for those 500 patients, and let's just say 250 or had good outcomes and 250 didn't have good outcomes or maybe there's 500 patients and 500 controls. So 500 people are sick and 500 are healthy. But you could have, you know, literally hundreds of SNPs or hundreds of genes or hundreds of proteins in each of those ones. And you might not know or it might not be obvious to you which ones are causing the disease and which ones don't seem to cause the disease. So is it suitable for a data set that it's like extremely large, like, you know, 200,000, 500,000 participants with like 50 million SNPs and I don't know, 50 or 1,000 biomarkers? Yeah, to some extent you can do that. I mean, it attacks a lot of computers to do that. That's what I mean. Like it's going to be very computationally intensive. Yeah, and this is where, you know, it would be smart to do some data reduction or feature selection. I think a lot of people just sort of, you know, it's easy to generate lots of data and 99% of it's garbage and if people learn how to filter their data, they can maybe reduce that by huge amounts and then they don't have to wait for the computer just to sort of crunch through and toss out 99% of their data all over again. It's, I mean, I think there's a tendency for people to think that machine learning is, no, let's just dump all my data and the magic ants will come out. If you're feeding the computer garbage, it will learn garbage. And so this is why, you know, feature selection, data reduction, you know, thinking about your data, learning about your data is really, really important. And, you know, when people say, I've got, you know, 100,000 patients, 50 million steps, tell me the answer. I just tell them, go back and start thinking about your data again. You should be able to, through various methods, very standard statistical methods, reduce a number of features quite dramatically. So it would be more beneficial to first find, like, you know, like a summary level, statistics, if you start with that, then you would highly reduce the amount of data and possibly like, you know, likely positive results and then you can work with that. Yeah, 50,000 you can probably end up with like a few hundred SNPs. Yeah, it's so lovely. You want to just get rid of the garbage data and then focus on the important one. That's what we call feature selection. But yeah, there's a tendency for, because it's so cheap to collect so much data, people don't think about trying to, you know, clean up the data or rationalize it in some way. Yeah, it sells to the question of, like, will it as feasible with any of our network simulations? Well, I mean, there are examples with large language model where they had, you know, 500 million words and 20 billion parameters. But you had to have a computer or collections of computers that were worth about a hundred million dollars to process that over one solid year. So you can do a lot with, you know, deep neural nets, but you also have to have, you know, a lot of the money. The resources, yes. The resources to do that. But that was a case where they were dealing with rich data, you know, human language is rich. There's a lot of, you know, whether it's DNA sequences or SNPs or things like that, but are completely uninformative and we know that. And so that's where some cleaning up helps. Right. I think we're gonna have to move along because we've got a fair bit to cover. So I probably won't be able to take any more questions for the rest of this. So we're gonna try a real example. We're gonna talk about classifying iris flowers. So this is a case where we are repeating the same problem. This is what we were doing with the decision tree. And so that's nothing new. It's exactly the same. We have our set of 550 flowers, VersaColor, Satosa, and Virginicus. I've talked about the same flowers. We've seen this slide before, you know, about the petal size and the sepal size. So we're gonna take four features and we're gonna predict the category. There's gonna be three layers, three outputs. So four features, three outputs, three species. And then we're gonna have a single hidden layer. And we're gonna train on a subset of these 150. So 105 and then we're gonna test on 45. So we're just doing exactly the same classification as we did with the decision tree. That's the same data set, but we're using a different model. So this is the data set. We've seen this before. Now, we can start programming, but unless you're a programming genius, you won't be able to write this out in the time we have. But this is the algorithm. We read the data just as we did before with the Iris one. We check the data, make sure there's nothing missing just as we did before. We create the training and testing data sets. That's the 70% 30%. What's new this time is we create a one hot encoding function. That's to one hot is the function name. And we create a normalization function. So these are two things we didn't have to do with the decision tree. Decision trees don't need normalization scaling. That's a nice thing about them. And they don't generally need encoding. So once we've got the normalization function, then we normalize the data. And once you've got the one hot encoding, we do the one hot encoding. Then we have to define our activation functions. So in this case, we actually use two. One is a sigmodal function and another one's called a softmax. Then we create our random weights and our random biases because we don't know how to start. And then we determine how many batches we're gonna run because we're gonna do mini batches. Then we do the forward propagation. Then we calculate the errors. Then we perform the back propagation. Then we update the weights and biases and we repeat and repeat and repeat and repeat. So does that make sense? We've got, this is the algorithm. This is the general algorithm. So we have to import numpy and we have to import pandas. We also have to get seaborne and mapplotlib. These are libraries or functions they're bringing in so that we can do some nice plotting. This is the same call to read the data file. The same set of 150 files or 150 rows. This is the same data check. Check to see if there's no missing data. So there's a verify data set function. We've seen this same code before. So nothing really new. We have to do a transformation. We didn't have to do this with the decision tree. So we still have to create the training set. We have to do the testing set. That was something we did before, that's 70%. It's not three full class validation. It's just one fold. But we have roughly one third for testing and two thirds for training. The other transformation is the one-hot encoding. And so we have three species, just like it was sort of three colors, blue, green, red. So now we're going to use a 100 or a binary thing. And this is a common technique. One-hot encoding is a way of making data readable for the computer, but also structuring it for the input and output architecture of a neural net. So this is, it'll take these categorical values and it's called two one-hot, that's the function name. Determine how many columns you need to do, create an array and populate it with ones. So we've now got a binarized or binary version of our categories. Now the normalization is something we call feature scaling. We talked about this before. It tries to create things that are, you know, scaled properly so we don't have things that are too large or too small. And so in the L2 normalization, the sum of squares has to add to one. We talked about the other types of standardization. We talked about standard deviation has to add to one. This is because we're dealing with lengths and widths and some of them are, you know, 0.2 centimeters up to about eight or nine centimeters. That's a fairly large range. So we've created this L2 norm function. It is a normalization function. And this is the basic math for it. So we take our input data and we normalize it because we've now called or created this normalization function. And so we've normalized all of our dimensions for the sepal and petal lengths and widths. We've also, you know, we've also created a normalization function. And this is the basic math for it. So we take our input data and we normalize it because we've now called and we've also labeled or relabeled the species and rather than type in setosa virginica versicolor, we're using 0.1.2. So that's also another thing with encoding, just like we encoded with the one-hot encoding. So we get our table that we've read and we do what's called one-hot encoding. We're doing what's another thing called flattening, which we'll talk about later. There are other things that we've got, which will be used a little later. You can play around with the exercise. So I'm not going to really dive into this part here. So we've done our normalization with L2 norm, which is making sure that the square standards of the differences is one. We've done the one-hot encoding. We've done the relabeling of species type to 0.1.2. And then already we've chosen that we're going to use an artificial neural net. We could have done the same thing if it was a support vector machine that would have had to have done the same approach, I think. A few others, a convolutional neural net, same thing. But we've chosen a simple artificial neural net. We also have to choose our activation functions because we've chosen our artificial neural net. So one activation function that we're using is a sigmoidal function. And this allows us to do the gradient descent optimization because it's differentiable. It allows us to have trained things. Here's a bit of math. We're using a sigmoidal function for the first layer. We're also using a softmax function for the second layer. These are all sort of exponential functions, e to the x or e to the minus x or e to the minus z. Derivatives about sigmoidal functions, sort of shown here, but derivative of a sigmoidal function is a sigmoidal function times 1 minus the sigmoidal function. There's an explanation also about what the softmax function looks like. Typically, an output layer uses a softmax activation function. And that allows you to get the output values that will sum to 1. And this is very helpful for neural nets. So that's used for layer 2. So the input to hidden layer is the sigmoidal function. And then from the hidden layer to the output layer is the softmax. So I talked about this before, and you guys had asked you, what do we do for weights and biases? Usually the idea is to just fill them up with numbers between 0 and 1. Now, by the end of it, they could be well above 1 and well below 0. And there's a set of weights for the input and the hidden layer. And then that's w0 and w1 is the weights from hidden layer to the output layer. And then there's the biases for the hidden layer and biases for the output layers. So this function essentially generates all these random numbers and these random functions, which are like in NumPy. And then we're going to do mini batches. So we're going to determine the batch size and training size. And this is one that calculates the batches. So we're not doing half batches or something. So we make sure that there's a whole number. And if we've got n batches, we are going to determine the process where we do forward propagation, error-determined batch propagation update in each batch and repeat that for each batch, multiple times. And then we do this for multiple epics or epochs. So we have a training function and it's supposed to be able to return the outputs, weights, biases as well as the training error measurements. The forward propagation, that's our first step. So we take our data, it's extracted from the training set. We propagate it to the layers like we've shown with some of the other pictures, multiplying by the weight matrices, adding the biases, running through the activation functions. So these are explained with the text and you can see there's, you know, multiplication and addition with weights and biases all added through these things. Once we've completed the forward layer, then we have to determine our error. And so we look at why and we compare it to a batch. This part, which is supposed to have partial differences, this layer is basically simple differences, but we're still doing a gradient descent approach. This is the batch propagation as we go from layer two to layer one, and then layer zero, which is the first input. So we've got layer two delta and bias layer. These are, again, called through dot products and what's being multiplied here. And then as we go from layer two to layer one, then we have to go from layer one to layer zero. And I've just numbered these things here. So we're moving back to the next one. And then we have to do the output hidden layer, the weighted sum, the DAVDH to DZ, which is the derivative, and then the cost function, which is the delta. So these are then calculated, moved through as we propagate. And this is still finishing up that calculation. So put it all in one slide. Then we update our data weights. So the weight matrices, those connections are the weight matrices are updated. And we multiply by the learning rate and subtract the current weight values. We also have to update the biases. Again, they're modified based on the layer. And then we repeat this process. So we repeat it both through the batches and then we repeat it through different epochs. We find the batch, we do a first layer forward propagation, error calculation, derivative calculation, do it for layers one, layer two, zero, do the weight and bias update. And what we're tracking as this is going on is we're tracking a total error. And as we go through different epochs, in this case, it might be a thousand epochs, the error just keeps on dropping. But eventually it flans out. And you can choose the threshold and say it's just not changing. It's not getting any better time to stop. So this is a neural network program. It's 215 lines, comment lines. Because it's relatively small data set, it only took seconds to train it. And because the training is so fast, it actually trains every time you run it. And it collects certain cells to train with new input data. So we've trained on 105 samples, then we take our test set of 45. And we evaluate those as well. There's a forward propagation for evaluation and so on. So a little bit of different code to do the evaluation, but this is the result. And then in the testing set, we should have got 111 along the diagonal. We didn't quite. And then in the testing set, we also would have hoped to have got 111. But we didn't quite. But overall, the performance for this, where we just take the diagonal values, the average performance is probably 97, 98% for this. So that's the difference between the training and testing sets. So we can say this has stabilized. It's a valid, well-tested one. It's something that can go out for use. If we look at the neural net versus the decision tree, you can see that the neural net is slightly better than the decision tree. Again, you want to have the diagonal being 111. They're not perfect. And in terms of statistical difference, I'd say they're basically the same. But if you want to quibble, you can say the neural net is better. We've also written it in R, lines of code. It's more with R. R is generally a slower program, programming language. But if you want to run the R version, it should work. We've written a program that runs in Python. We've trained it on 105, tested on 45. You can do it for other things. Again, it's fairly generic. It doesn't have to be irises. It could have been genes. It could have been proteins and metabolites.