 And today we'd like to talk about Bayesian neural networks. So I assume that, you know, some of the, well, many of the concepts behind Bayesian analysis, but if you have any questions yet, do just interrupt me at any point. I may not be able to see the chat, so just pick up and I'm happy to be interrupted at any point. Right, yes. So today we'd like to talk about Bayesian neural networks and to do that, I'll cover three main topics. We talk about regressions, starting from standard regressions and then going into how neural networks can be used for regressions. I'll talk about classification and then I'll give you a bias selection of examples of recent studies that I happen to have been involved in using Bayesian neural networks in biological research. So yes, as said, do interrupt at any point and otherwise I'll get started. So I would like to start with just like a basic model of linear regression, which I'm sure you have already covered in the first part of this Bayesian course. And so in a linear regression, we have a predictor which we want to relate to another variable and then we try, we fit a model basically to be able to use this predictor to make an estimate of these other values. So for example, if we have a diet here like broccoli intake in pandas and then body mass of pandas, then we can regress them and try to figure out whether broccoli intake correlates with a change in body mass. So we can write this as an equation where we have the body mass equal to a slope times the predictor, so the diet, plus an intercept. We can also show the same model using these nodes and lines here. The lines here represent co-efficient. So the slope and intercept in this case and the nodes here represent the input that is the diet and the output. There is a body mass in this case. Our prediction here in Y will be basically an estimate of pandas body size plus minus a standard deviation or plus minus an error, which we can also estimate in the model. So under this model, we have essentially three parameters. We have the slope, the intercept and the error. If we are to fit in a likelihood framework or in a Bayesian framework, this type of model, what would be our likelihood function here? What would they use as a likelihood function? So the likelihood quantifies the probability of observing a specific body mass, given my parameters and given my data. What would be our likelihood function in a linear regression? Have you seen this earlier in the course? If you want to step up. If not, I'll go with this, but then you're gonna have to tell me everything about the priors. So a likelihood function here that we can use is based on a normal density. So we could say the panda body mass here is normally distributed with a mean centered in my prediction Y in a standard deviation set equal to the error that we are estimating. So for any given combination of the slopes of the slope and the intercept and the error, we can calculate the likelihood of our data. So the likelihood function will be basically a probability density function of a normal distribution. So the probability of observing the body mass, pandas body mass that we do observe, given our prediction in a sigma is given by this formula here, okay? And then if I fit this in a Bayesian framework or in a maximum likelihood framework, I will be looking for the slope, intercept and error that in a maximum likelihood framework will maximize the likelihood of observing this panda body mass. If we are fitting this type of model in a Bayesian framework, we are going to have to set up priors on each of the three parameters, okay? So we have three parameters, we've got the slope, the intercept and the error. So we need a prior on the intercept, a prior on the error and a prior on the slope. For the intercept, we could set if the intercept represents the body mass of a panda that doesn't eat any broccoli or where the broccoli doesn't have any effect, if the broccoli doesn't have any effect. So we could imagine this being some sort of mean body mass of a panda. Then what type of prior distribution could we use here for this intercept? Prior is a distribution that should reflect what we know about these parameters, right? So if this intercept here is some sort of mean body mass for a panda, what would be a reasonable prior that we could set on this intercept? There is it in the chat, a normal distribution by Daniela. Okay, okay. Yes, I'm not seeing the chance I guess, but it could be a normal distribution. It could be also a uniform distribution. It could be actually something informed by the empirical knowledge that we have about pandas body size. For example, if we know that pandas are mammals and they are terrestrial mammals, we could define some uniform prior that spans a plausible range of body sizes, right? Body mass. Now, what's a good prior on the error on the sigma? Okay, the sigma will be the standard deviation that we use in this normal distribution. What's a good prior for sigma? What do we know about sigma a priori that being a standard deviation matter, Michelle? Maybe we could also try to set up on a range of the variants that is present on different animals that you have an estimate. We could, yes. So if we have some prior information about like, yes, about random variants in this body mass, we could use that information. What if we know nothing about pandas at all because we are botanists and we know nothing about pandas? Can we still say something knowing nothing about pandas or animals at all, is there anything that we can say about the sigma parameter? Can it take any range of values? No, I mean, it has to be in some range that is possible biologically. There wouldn't be 2,000 kilos more, something like this. Yes, that's true. Is there anything more general about this sigma? Remember sigma is this, oops, it's this standard deviation, right, of a normal distribution. What values can sigma take, animals or not? There will be less than the maximum and more than the minimum, right? Yes, even more general, maybe. Is there any value of sigma that doesn't make sense here? Minus one. Any negative value, actually. Like a sigma is a standard deviation, so it cannot be negative. Because a negative sigma doesn't make sense here. A negative standard deviation is meaningless. So what we could use here is a prior that will not allow any negative value. For example, a gamma prior or an exponential prior. So any distribution basically that assigns zero prior probability to negative values. And the last parameter that needs a prior is a slope. Any idea of what prior we could use here? A slope parameter. Do we have any null expectation on the slope? Did it do regressions in the previous classes? Yes, we did. Okay. We could, like, I assume there is some idea for what kind of effect you would expect or what association would expect. Otherwise, we could set it to zero if you have no idea or set the mean to zero. And, I mean, maybe Gaussian would be too strong and go with the t distribution to have a weaker prior and allow wider tails. For instance, yes. But generally, you would use some priors, some prior distribution that is centered in zero, which is your null expectation. And a prior that allows both negative and positive values because you can have a positive or a negative correlation, right? So you could have a normal distribution at the distribution, Cauchy distribution doesn't really matter, but essentially it will be something that is centered in zero, which is so something that assigns the highest prior probability to a zero effect size. And then a distribution that allows both the positive and the negative range. Here, for example, I'm using a normal distribution centered in zero with some derivation of one. So this would be like a typical setup for a linear regression where we have some prior on the intercept, some positive spanning prior on the error, and then usually a symmetric center distribution centered in zero for the slope. So now we have the likelihood, we have the priors, and we can run our MCMC, for example, to estimate the parameters of our model to do that we'll have a data set, which we will now start calling training set because we'll be talking about neural network in a moment. We have this data set where we have a bunch of observations with broccoli and panda body mass. And then based on these observations, we're going to optimize a model and find basically the regression line that maximizes the posterior probability basically. So we'll do this in an MCMC and obtain not only a single estimate of our parameter, but a distribution of estimates. But let's say that this is the line that we estimate with our posterior estimate of the slope, the intercept and the error. So our trained or optimized model will look like this. And it will basically tend to minimize the distance between the line and the observed pandas. So this model can now be used as a predictive model. So if we are given the diet of a panda, we can make an estimate of the predicted body mass. So if this panda that we haven't managed to measure eats this much broccoli per day, then we can predict its body mass using our regression line. Okay, so we can use a regression as a predictive model. This is what a linear regression looks like. This is what a multiple regression looks like. So when we have now multiple predictors, for example, we try to estimate or correlate body mass with three different quantities. Diet, genome, which could be, I don't know, genome size, for example, whatever range size. The equation of this multiple regression looks like this. So we have a single coefficient for each predictor here. And then we sum everything up and then we have an intercept. Okay, so the predicted body mass is equal to the product between this coefficient and the diet, this coefficient and the genome, this coefficient and range size. And this coefficient times one, which means basically the intercept. We can rewrite this equation. It's a matrix multiplication. Only these matrices are one D, one dimension. So it's a bit silly to do that, but this will help us understand the neural networks in a moment. So this is basically the same thing as this one. Only we've write it as a matrix multiplication where we basically do a multiplication item by item here where we multiply by columns. So the first item with the first column it has a single item here. Second predictor by the second column, third predictor by the third column. And then we sum everything up and this is going to be Y, okay? We see that these operations will come back when we talk about neural networks. So these are two examples of regressions. These regressions are linear by construction. So they can only infer linear relationships between the predictors and the response, right? And the body mass. This is fine in many cases. It's what is nice about these models is that these parameters are clearly interpretable because they each correlate a single predictor. But of course, there is some limitations to what a model like this can do, especially these responses in reality are actually not linear. And this brings us to neural networks. The first neural network I would like to introduce to you is a neural network used for regression. And we're basically going to walk through the steps of a neural network to understand how this is actually not a different from a linear regression, but it still has a lot more power. But the operations that are going to be needed in a neural network are actually very similar to the operations that we just did for the regressions, for the linear regressions. Only here, I'm adding here a bunch of terminologies because this is the terms that are used in machine learning. But in a neural network model, we'll have an input which is basically our predictors. We're going to have weights, which are coefficients or slopes if you want. Then we have nodes that are kind of intermediate steps that take you from the input to the output. We walk through these steps in a moment. So these are nodes, they are just numbers. There is an activation function that we will talk about in a moment. And then some more weights, a bias node that is equivalent of an intercept. And then we have the output that is a prediction. So the terminology is slightly different, but we see that the operations are not that different after all. So let's see what this sort of abstract model actually does. And to do that, we are going to walk through the operations that are needed to go from an input to an output, just like we did for the regression model. So the first operation that we have here is this thing where we go from a bunch of predictors to a bunch of nodes. These nodes are a numerical representation of the predictors in a different dimension. And it's not numbers that we aim or care to interpret directly, okay? It's just some intermediate step that takes us to the output eventually. What is going on at this step is again a matrix multiplication where we multiply these vector predictors by a matrix of weights. These are the, each line here is a weight, is a parameter that we eventually will estimate. And the operation that we need to do to go from X and W to Z is a matrix multiplication. So to explain this, I'll use real numbers, which hopefully will help. So let's say that these are starting diet, genome and range size. So we got three numbers. And these are our current parameter values. So I'm just assigning at this point random numbers to these lines, okay? These are parameters. And we are going to have to estimate these parameters. In matrix multiplication, so now we need to go basically from these three nodes to these three nodes, okay? But we go through a matrix multiplication to get there. So we first multiply each X value, so each of these predictors by a W column. So we take this value and we multiply it by these numbers here and we obtain three numbers that I'm placing here. Then I go to the next value here and I multiply them by the second column, okay? And I get this column here. And then this is the range size, which I multiply by these three weights and obtain this other column. So now I went from three numbers to nine numbers, but I want to end up again at three numbers, right? Which are these three nodes. So the next step here will be sums by rows. So we obtain these two D matrix and now we take these three numbers here, we sum them up and we obtain the first node. So the first value here. Then we go to the next row, we do the sum and obtain the second value. And this is the term. So now we've taken these three values here, multiply them by a matrix. So in this case, three by three and obtain again, three values here. Interestingly enough, I know this is a course using R, but if you are interested in doing this in Python, you can do this whole set of operations in a single line, in a single function. So this will do exactly this type of operation. Now, this vector that we obtain here, these three values of Z are a representation of our input data in an abstract parameter space that we don't have to understand. So we are not really aiming to interpret these values. But what is interesting and important to realize is that each of these value here is already a function of all of the three features, okay, all of the three inputs because they integrate their summing up effect sizes basically from all three inputs, okay? So this also means that these neural network model will be able to account for all possible interactions between inputs. So now we have this set of values. The next step that we are going to, which we can place here, the next step that we need to do is going through this hidden layer. And we go through an activation function. Again, these are these terminology using machine learning, but activation functions can be actually something, things that are really simple to compute. For example, one of the most commonly used and one of the easiest things to compute is the radial function. And this is simply a function that will take any negative value and set it to zero and any positive value and leave it as is. This radial function is one of the simplest that exists and it's still extremely powerful. The whole point of this hidden, of the activation function in the hidden layer is to break the linearity of the response because we want neural networks to be nonlinear or potentially nonlinear, okay? So the activation function will typically be something that is nonlinear. In this case, it looks like this. So everything that is negative is set to zero and everything that is positive stays where it is. There is a whole range of other activation functions that look, for example, sigmoidal. So they are all characterized by being nonlinear. Their point is really to make sure that the neural network can account for nonlinear responses between things. So going through this step is very simple. We just take every negative value, any value that happened to be negative and turn it into a zero. And then we go to the last step, which looks very much like a multiple regression as we've seen before. So now we have a vector of numbers. We are adding here this bias node that is the intercept and we do the same operation basically that we did before. We see there is a question. Yeah, I was just wondering how do you choose which activation function to use? And in your experience, does it matter which one you use or not? I mean, how important is it, which one you use? Yeah, so that's a good question. We talk a little bit more about selecting the settings of these networks in a bit. So we'll come back to that. Ideally, these activation functions will not make a huge difference. In some cases, especially for small networks, then using one activation function or another can give you a better answer. In some cases, it's only a matter of convergence. So some activation functions will make the model converge sooner than others. But again, there is no strict general rule. There isn't a single activation function that will do the best job everywhere. So it's part of the fine theory that we'll see later how to do. But generally models are not extremely sensitive to these types of choice. Yes, Elis, I think you're muted, I'm afraid. Let's have one question, sorry. I'm wondering what you have to bias nodes in every layer just and just at the last one. You can have bias nodes in every layer if you want. It's just about how you did the network, I guess. Yes, it's pretty much like a hyperparameter that you can set. You can choose to have bias nodes everywhere or you can have it only at the last layer or you don't have to have one, basically. Yeah. Right, so in the last step, basically, yeah, we let this bias node that is basically similar to an intercept in a linear regression. This node will be basically a one that you multiply by its own parameter. We do this matrix multiplication again, we sum by rows and we end up with one value that is our output and it's our predicted body mass. So you see that basically going from input to output here implies a lot of operations, but they're very simple operations. There are multiplications and sums pretty much. So nothing really very expensive to compute going through this network. And this is good because we have a lot of these operations, right? Once we have a predicted body mass, we can calculate the likelihood of our panda given our prediction, right? So if our prediction is here and our true body mass is here, this will be the likelihood based on a normal distribution. If our prediction is here, this will be the likelihood. So the likelihood will be higher. Okay, so this likelihood part will work just the same as in a linear regression. So the likelihood function is still the same. We get a prediction and then we calculate the likelihood of our data, which is the observed body mass of a panda based on this prediction, okay? So in a Bayesian neural network, we have a likelihood function that is the normal, the PDF of a normal distribution, the probability density function of a normal distribution. And then we have the priors on the weights. There is no strict rule again on which prior we should use on the weights. The standard choice is to use again some distribution centered around zero, assuming that zero means that there is no effect size. Here, the effect size is a bit harder to interpret, right? Because there isn't just one effect size linked to every predictor. But the typical choice for priors here is a normal distribution or some other distribution centered in zero. So if we have a likelihood and we have priors, it means this means we can compute a posterior. And if we can compute a posterior, we can run any Bayesian algorithm to estimate our parameters. Here the parameters will be all of these weights, right? And we will try to optimize these weights so that the likelihood of our pandas, panda body masses will be the highest, or should say the posterior. So if we have a training set like this, we are going to have a bunch of pandas for which we have the body size and the predictors. These are a training set. And then we can run our neural network. We can use a posterior sampling algorithm. I'm sure you've come across Metropolis Hastings MCMC or Hamiltonian Monte Carlo algorithms. So these will be algorithms that basically iteratively update the parameters. So it will start with some initial parameters. Basically we initialize the weights of your network in some ways with random numbers. You will set them in a, if you're using MCMC, basically you will use this as your current parameters then you will update your parameters using some proposals, right? So you will slightly update the parameter values. You will calculate the acceptance ratio, which will be based on a comparison between the posterior of the new parameter values and the posterior of the previous parameter values. Based on the acceptance ratio, you will accept or reject the new parameter values. And if you accept the parameter values, you will start from that point. If you reject them, you will start back from the initial point and then update the parameters again, calculate acceptance ratio and so on. So I'm not going into the details of MCMC because I think you have seen it in previous classes, but of course if you have questions about this, do us. As we sample basically parameter values from their posterior distribution, we will collect them, right? So we will produce a collection of posterior values of our weights, the parameters of a neural network. So if this is our model, we will collect posterior samples of all of these parameters and all of these parameters, right? So these are like the posterior distributions of these parameters. All good so far? Okay. If I now have a distribution of weights or parameters, basically samples from their posterior probability, distribution, we can use these weights to make predictions. So if we have broccoli intake, genome and range size for some pandas, we can predict their body size by running these measurements through our optimized network. And in fact, we can do it many times by every time something a different weight here. And if we do that, we are going to obtain a posterior distribution of the predicted body size of our pandas. Okay, so we are gonna predict pandas body mass based on these three inputs, but accounting for complex relationships between these predictors and the output. And I say complex because it's been shown that if a neural network is deep enough and complex enough, so if it's got enough parameters, it can approximate in principle any function. So neural network, unlike a linear model, we'll be able to, in principle, estimate any sort of response given our inputs. And it will be able to account for any possible interaction also between our or among our predictors. So this is pretty awesome. BNNs, neural networks in general are very powerful. So why don't we use them all the time? Of course, neural networks are very powerful, but there is a number of drawbacks. Neural networks are basically by definition overparameterized models, which means that they have way more parameters than necessary. And this changes a lot like the whole business of model testing, in a likelihood framework, you tend to find the simplest model that best explains your data. Explains your data. In this case, you're going to have to use different methods and we see some of them. But essentially, there is a risk of overfitting using this model because there are so many parameters. Defining priors in a Bayesian neural network is not trivial because these parameters are not directly interpretable, right? The parameters themselves don't have an interpretable meaning and that means that it's not as easy as it was for a linear model to set priors that are informed by some a priori knowledge. For example, for the intercept on the linear model, we could say that the body size of a power, body mass of a panda has a certain plausible range. So we were using a priori information to define our prior. Here, for these nodes, that's not so trivial. The parameters themselves are not directly interpretable and therefore these models are not explicitly designed for a hypothesis testing. So there are ways to do hypothesis testing using neural networks, but that's not their main goal. Yes? I'm just wondering how crucial it is to have informative priors because maybe if you train your network for long enough, then you'd be able to reach the ways that are adapted to your question. Yeah, so informative priors are useful sometimes because in some models, if you have a priori information and want to use it, a prior is a good place to do that in a Bayesian model. Here, for example, if we had a priori knowledge about the body mass of a panda, it would be difficult to specify a prior that reflects that because the body size, body mass of a panda here does not get its own parameter. I mean, there is this sort of intercept here, but it's multiplied by a bunch of ways that don't really directly relate to any single predictor. So it would be difficult here to set a prior that reflects a priori knowledge. Yes, but if you update them accordingly, at some point you don't really care about the initialization anymore, right? But when I say prior, I don't mean the initial parameter values, but I mean the prior distributions that we use to calculate the posterior. So the prior here would be basically what we used here for the parameters of the linear model, which I cannot find anymore, too many slides. So for the linear model, we had these priors, right? Prior distributions on our parameters. Here, we could define priors knowing what we are doing, basically, because we know what this parameter means. This is an intercept which we can imagine to reflect the mean body size of a panda that doesn't need broccoli or something. This is a standard deviation, so we know what type of prior we want. Here, we want some sort of prior that assigns the highest probability to our null hypothesis, which is no slope, basically, so zero. In the case of a neural network, the parameters are not directly interpretable, so defining a prior on them is not as trivial. So priors here will be typically centered on zero, but if we wanted to add some extra information, it would be more difficult to do that. Okay, got it. More questions? Okay, so these are some drawbacks, basically, of doing these types of model. One specific concern is the risk of overfitting. And this, in particular, in machine learning, this is like a crucial point, because machine learning, unless it's using Bayesian neural networks, is using essentially a likelihood approach. And in a likelihood approach, the typical way to fit a model in a maximum likelihood context is to find the parameters that maximize the likelihood of your data. So that's the typical way to fit a model in maximum likelihood. The problem here is that you have so many parameters that if you did that, you would be almost certainly overfitting because you have so many parameters that if you maximize the likelihood of your data, you would find parameters that are perfect for your data, so like perfectly matching your data, but it would have very poor predictive power because the model would be overfitting the data. So what people do in machine learning, which is not Bayesian neural networks, but we'll come back to that, is typically to split your data set into a training set and a validation set. Okay, so you just take your data set and split it into two chunks. And you sample your parameters or you optimize your parameters based on the training set. In this case, you will be minimizing a loss that is basically the inverse of the likelihood. But at the same time as you maximize or minimize the loss or maximize the likelihood on your training set, you will also monitor the likelihood or the loss of your validation set. So you optimize the parameters on this set, but you also monitor how accurately you do predictions on this validation set. And for a while, these two validation and training loss will follow a similar trajectory, so they will decrease or their likelihood would increase. At some point, however, the training loss will continue decreasing, but the validation loss will start increasing, meaning that after some while, the parameters are being optimized toward overfitting the training set. Okay, so the parameters are getting really good at predicting these three pandas, but they start to become better at predicting these two pandas. So monitoring the validation loss, you can detect when the model is starting to overfit and then basically stop the fitting process. So instead of maximizing the likelihood, you stop the optimization at some point. And you will basically stop at the moment where the loss, the validation loss will start increasing. So the optimization stops here and these are the weights of your model. This is in a typical machine learning framework, so not in a Bayesian framework. In a Bayesian framework, the risk of overfitting is limited and it's limited by the fact that you have priors. These priors that you set on the parameters, they are typically centered in zero and they have a regularizing effect. I'm not sure if we have seen regularization or Bayesian shrinkage before in the course. So regularization is typically this effect that priors have on like preventing overfitting. Regularization is used a lot in multiple linear regressions as well by using priors on the coefficient set, basically reduce the risk of overfitting. Here, having so many priors on so many parameters will effectively prevent the Bayesian neural network from completely overfeed the data. Validation sets will still be useful in Bayesian neural networks but they're not strictly necessary here. So you can run your MCMC as long as you want and at some point the model will reach a convergence and that convergence point is likely to be about right, so not really overfitting. Any questions about this? Yeah, can you explain the last part again of the prior use? Yes, so the priors are used in a Bayesian neural network by basically calculating for any given value of your weights you calculate their prior probability, right? Just like you would do for any free parameter in any model in a Bayesian context. So in any Bayesian framework, when you have a model like this, for any given value of B, here you will calculate its prior probability based on this distribution. For any given value of C, when you calculate the prior here and any given W value calculate the prior here. The posterior probability will be proportional to the priors, so this probability times this, times this, times the likelihood, that is the probability of your data, right? So this is the standard Bayesian construction of a model. In a neural network, you basically have priors on the slopes, right, on these weights and these priors will have a regularizing effect because they will not allow, they will constrain basically the parameters. So in a B and N what we have is a likelihood function that looks like this and then we have all of these priors applied to the weights. So the weights cannot just go anywhere just to feed the data. They will be constrained by this prior. And the prior will favor weights that are kind of around zero. Okay, thank you. Yeah, so this type of priors is used also in a multiple regression context where you have multiple coefficients and using this type of prior will do regularization in a multiple regression. And similarly will do regularization. Regularization means like it will essentially prevent the model from overfitting. It will do so also in a neural network. Does that clarify? Yeah, exactly, thank you. Any more questions about this? Now there are many parametrizations that neural networks can take. So we can change the activation functions as we mentioned earlier. Rail, Loos, Sigma, there are very many activation functions that you can look up online. There are typically simple things to compute but they're always non-linear functions. So you can change the activation functions, you can change the number of nodes, right? So this will just increase the number of parameters. You can change the number of hidden layers. This will make them become deep neural networks. The operations here will still be the same that we did before, right? It's just that you do more of them. How do we choose among these different options? Ideally these different options will not have a big effect on the outcome. So if that's the case, you don't have to worry much about it. In some cases you can have, you can use a validation set, even in a Bayesian network, to choose between different architectures of the models. For example, you can have a case where your training set will sample likelihoods that are lower for a very simple model and increasingly high for more complex models. So for most of you more parameters, this is not necessarily the case, but it could be the case. But if you then monitor the likelihood samples through your MCMC, for example, in your validation set, maybe you find that this is the distribution that you're sampling. And if that's the case, then this model would actually return the highest validation likelihood, meaning that this model is maybe overfitting a bit. So in this case, I would choose this model. This is just kind of a rule of thumb. It's not really strictly model testing, like explicit model testing. Explicit model testing in the sense of calculating marginal likelihoods for a Bayesian framework or AIC likelihood ratio tests or these kinds of things in a maximum likelihood framework. This type of model testing doesn't exist for neural networks because there are just too many parameters that you would have to basically test. But yeah, so once we choose our best model, we can do predictions and obtain predicted body masses for our pandas, given any combination of broccoli intake, genome, and range size. Yeah, so that was all I had for the regression part. Any questions? Do you need a break? Good to go. All right. Then let's keep going. The next thing that I wanted to talk about is networks for classification. This time, this will be a lot easier because we have seen already the neural network as a model. And classification is just another type of task that a neural network can deal with. And most of the network structure will be the same as we had before. So we could have the same input that we had before, the same nodes here. But our question now is not about predicting the body mass of a panda, but it's predicting whether a particular diet, genome, and range size is associated with a panda, a brown bear, and a cave bear. So you can see already from the structure here that the model itself is not very different as except for the last part. So the classification task will be basically to find the parameters here that for any given diet, genome, and range size will assign the highest probability to the correct species, okay? This first part of the network runs just the same as we have seen for the regression model. The second part of the network or the last part of the network, the output layer is now different because instead of predicting a single value here, being before the body mass of a panda, here we need to predict basically three values, which is a probability associated with each of the three species that our model includes. So our model here has to end up with three nodes. And each of these nodes will become a probability. The first one will be the probability assigned to a panda, meaning that these diets according to these parameters will be assigned with that probability to a panda. Second probability will be assigned to a brown bear, the third one to a cave bear. So here instead of having a single node as we had before, we have now three nodes because we want to end up with three probabilities. And these values here, which will come from propagating the features into the hidden layer and into the output layer, these numbers will have to be transformed into probabilities. But these values can be positive or negative, right? They can take any value. And so to turn a vector of any range of values into a vector of probabilities, we use the softmax function. There is a simple function. Again, you can write it in one line in Python and in R as well. That turns any number of, any vector of numbers into a vector of probabilities. The equation itself may look more complicated than it actually is, but what we're doing here is basically exponentiating each of these values. And then we divide each of these exponentiated values by the sum of the exponentiated values. When you exponentiate a value, whether it's negative or positive, you will always get a positive value. So then you divide the positive value by the sum of the three values. And so these three values that you obtain sum up to one, basically, okay? You divide them by their sum and you end up with three numbers that sum up to one. So they qualify as probabilities. So the only difference basically in this network is this one function here that is called softmax that turns a bunch of numbers into a bunch of probabilities. Now, what's the likelihood function in this model? Before we were using a normal distribution as a likelihood function, the PDF of a normal distribution. The likelihood function in this model is even simpler. It's basically the probability mass function of a categorical distribution. The likelihood is already given by these probabilities. So under these weights, and these particular broccoli take range size and genome, my model gives me a likelihood for this panda of 0.11, so 11% probability. By optimizing my model, I'm going to update these parameters, these weights, and I will favor parameter values that give me a higher likelihood for my panda, right? Which will mean that these broccoli, these genome and range size will be associated with higher probability to the correct species, in this case, a panda. Okay, so again, here I will have a training set, so a bunch of species for which I have the predictors and the label, so the species. And then the model will be trained to find the best answer in all of these cases. Again, we have the issue of potentially overfitting, which is clearly a problem in regular machine learning, less of a problem in a Bayesian neural network. In the case of a classification task, overfitting would mean recognizing exactly each item in your training set, like you do here. But then this would result in a poor predictive power, meaning that if I give the model a data set that is different from the training set, then the model will misbehave. So as usual, we want our models not to overfit, so that we have better predictive power. This is done using priors as a regularizing function here. Again, we can also do the same approach that we used before. So we can also split our training set and validation set. So we can split our data into these two training validation sets and then look at the validation likelihood to see which of the different models returns a better likelihood for the validation set. And if a more complex model returns a lower likelihood in the validation set, it probably means that it's slightly overfitting. The degree of overfitting here will not be as high as in a regular machine learning context. Now, if we optimize this model for classification, we're gonna get a bunch of posterior weights, right? So distribution of values here, which means that for any given diet, genome, and range size, we can now make predictions and predict whether these observations come from an individual of these pieces, these pieces or test pieces. In this case, if the model was trained well, and if the ground truth is brown bear, this diet will be linked with the highest probability, posterior probability to a brown bear. We can get a range of predictions basically by running the same features, the same predictors, we can run them through the neural network across many of these parameter values. So we get a range of predictions and we can collect all of these predictions to obtain posterior probabilities of these features being associated to a panda, a brown bear or a cave bear. If the model is trained well, it will give us a good answer, the correct answer for most of our data. One of the reasons why Bayesian neural networks are really useful is that neural networks, like non-Bayesian ones, tend to be very certain about their predictions, even when they're wrong. So just looking at these probabilities, which you will get also from a standard neural network, is not really a good way to determine how certain you are about your prediction, because there is a demonstrated tendency of neural networks to be very confident about their answers, even if they're wrong. And this is a problem if you, for example, feed a model with data that doesn't belong in any of these three species, which could happen. In a Bayesian neural network, because we have this regularizing effect and because we can compute posterior probabilities associated with each class by running the model through a range of posterior weights, when you feed the model with unknowns, so with out-of-distribution data, diets, a diet genome and range size or something that the model has never seen and that is not included in the output, then a Bayesian network will typically give you uncertainty in the outcome. Like, I don't know which of these three it is. It cannot directly tell you it's something else because it's something else doesn't show up in this vector here, but it can at least tell you that it doesn't know which one it is. And it will do that by assigning low probabilities for posterior probabilities to all of your classes. So this is a major advantage of Bayesian nets. We can then evaluate the accuracy of our predictions using confusion matrices, which you may have seen before and calculating the accuracy. Basically, we can take our validation or test set and see how many times did the model get it right or wrong. This is to get an idea of how good the model is, right? So in this case, out of 37, I got 33, they were correct and that's an 89% accuracy. We can look at the confusion matrix, which will basically tell you for every class how many of them you got right. So I had 14 pandas in my dataset, all of them were predicted correctly. I had 10 pairs here in my data set and I predicted nine correct and one wrong. And this one wrong ended up in this class. Okay, so this is what the confusion matrix shows. Here I had 13 K pairs and 10 were predicted right and three were predicted wrong. Now it's important to look at this prediction at this confusion matrix because like high accuracy doesn't always mean that your model is good. And especially this is the case when you have imbalanced dataset. In this dataset I had a lot of pandas and very few bears. And so I have a very high accuracy here, but what the model is predicting is basically that everything is a panda. And even so, like you still get a high accuracy just because the test set is biased, it includes so many pandas. So saying that everything is a panda will still give you a 92% accuracy, but maybe the model has not learned anything. So this is something to be careful of. And this is a general issue with supervised learning. So supervised learning means that you train a model based on some observations and the model will only do as good as the observations allow it to. If you have the chance to experiment machine learning with toddlers, it's pretty amazing because toddlers are basically like untrained neural networks that you can try to train. And I did that twice. And you can see that like toddlers will mistake things in ways that you would not really imagine before. And this is coming from their limited training set because they have not seen everything yet. So they will identify a big tail as a snake if this was their training set or a half moon as a banana. And so this happens to toddlers, then toddlers know better after a while. Machine learning models also have this type of problem. So they need to be trained properly with proper data sets so that they don't end up giving you answers that don't make any sense. I have a question on a previous slide. Please, yes. And I admit that I have not attended the course so maybe it just comes from limited knowledge. So one prior to that, where you have the even one prior. Yes, here, where you said, okay, if really you have data that is not from any of the three bear species, it will give low posterior probabilities among all the classes. So does this mean that they don't need to add up to one? Oh, they will still add up to one. So here they will be 33% each. Ah, okay. So you call 33 a low probability. Okay. Yeah, I mean, yes, like typically you would say I'm confident about the posterior probability if it's greater than 95% for example. Right, right, right. So they're just talking about each one but they do need to add up to one. Yes, they do. Okay. Perfect. Thank you. So it's basically up to you to define what threshold you consider significant. You can do this by simulations or you can use some standard cut off like 95%. But this will be a case where you basically don't know what the answer is. And it's a good thing because the ground truth was that the data come from a different species. More questions? Yeah, maybe one question about the training data. So how big should it be? What is like normal numbers you normally train on? So yeah, that depends on the type of model that you're running. Like if you have a model that is like simple like this one with a few predictors and kind of a simple outcome then you don't need very many training samples. The good thing of Bayesian neural networks is that they scale or that they adapt well to small data sets because they have these priors that you get posterior probabilities. So even if you have a small training set that's okay because the model will basically give you more uncertainty on your prediction. So you don't risk basically getting very confident and very wrong answers. That's why Bayesian neural networks are particularly useful for small data sets. Meaning like, I don't know, like a few tens of these observations would be enough to train a model. And it really depends on how many parameters you have and yeah, how complex you build the model. Obviously, if your data set is small you won't go for like many hidden layers and many nodes. You will tend to go for simpler neural nets. But in a BNM usually you will get a quantification of the uncertainty in your predictions whether it's a classification task or a regression task, just the same. So even in a regression task if your training set is small then you will have a wide uncertainty interval around your prediction, right? So that's good. In a regular neural network then you would need more data set more training data typically to get decent predictions because you have fewer ways to quantify the uncertainty that you have. Okay, so if I understand it correctly the main difference is that normal neural networks they really try to give you a precise answer and while in Bayesian you get basically this distribution as an uncertainty so it doesn't need that much, that bigger samples, right? So yes, that's basically, I mean it's the difference between maximum likelihood and Bayesian analysis. Like in a maximum likelihood analysis you get a point estimate and then there are some ways to get some sort of confidence intervals but you get a point estimate. That's your prediction. In regular machine learning you get a single value here or a single choice basically in a classification task. Boom, it's a panda, that's it. In a Bayesian framework you get by construction by the algorithms that you're using you get a credible interval around your estimates. So if you have fewer data this will be reflected in wider credible intervals and this is true for Bayesian neural networks for any other model basically. Little data means more uncertainty in a Bayesian model which is why we use Bayesian models and particularly so with small data sets. That makes sense. Yeah, okay, thank you. But so this BNNs are basically better for classification so if in analogy it's like a supervised machine learning right? Well, this is a regression task, right? Okay. This is a regression task. Yes, I'll give you an example of a non-supervised BNN in the last few minutes if you... Okay, cool. It's fascinating, thank you. Good. Yeah, I would basically... Or is there any other question? Otherwise I would show you a couple of examples of things that we use BNNs for. Then I'll start and then if you guys have questions just interrupt me. So one application where we've used a Bayesian neural network is in a context of conservation or computational conservation biology, you probably say. So you probably know that there is a red list that classifies organisms for their extinction risk. It's compiled by the IUCN and so the red list basically takes every species of animal or plant and tries to define whether they are endangered or not to what level they're endangered at risk of extinction. And this is great if you're interested in mammals and birds because they're all assessed by the IUCN but if you are interested in other organisms this red list is very incomplete. Okay, so there is very little known about plants even less about invertebrates in even less about fungi. So here we use Bayesian neural networks to make predictions of these classes. So this is a classification task and as predictors we use basically data coming from GBV that is like occurrence, biological occurrence data that spans like many more species than the IUCN red list. And you can take these occurrences basically and turn them into features that enter a neural network and then you can use these occurrence records which will kind of approximate the range size of a species, how many individuals of the species there are and so on but you will take these occurrence records, feed them into a network and then get as an output a prediction of their extinction risk. And so you can train the model with the red list that already exists and then apply to the other species that are not yet classified but for which occurrence record can be found. And for example, we did this on plants in Madagascar and out there where the IUCN has assessed 4,000 plant species in Madagascar using this model we were able to add additionally 5,000 species. So we more than double the amount of species that have at least an approximation of their predicted extinction risk. And in some cases this will be like qualitatively different. So for example, for firms only 3% of the assessed one were estimated to be at risk of extinction whereas with our BNN predictions we get between 38 and 57% that are at risk of extinction. So this changes like qualitatively, you know the picture and because we feed the BNN we get a credible interval because we run a Bayesian neural network we don't get a point estimate we don't get a single classification for every species but we get a distribution of classifications for each species. So we can tell that between 38 and 57% of all species are threatened among firms in Madagascar. There are also cases where we use these Bayesian modeling networks in a non-supervised approach. For example, we have a model that is used that looks at fossil data to estimate parameters of speciation and extinction rates. Basically they quantify looking at the fossil record of a clade for example, mammals. They try to ask these models estimate how speciation and extinction rates change over time. Like how fast species diversify and how quickly species go extinct. And these are models that are implemented in a Bayesian framework. But we can also use neural networks here to try and link changes to speciation of speciation and extinction rates to particular things. For example, traits like phenotypic traits of the species, temporal events and things like that. So here the likelihood is not that of a classification or that of a regression. So the likely functions are completely different. They're based on Poisson and birth death stochastic processes. So it's a completely different context but we can still use neural networks to make the connection between some predictors and the parameters of this model. The parameters of this model are speciation and extinction rates. And we can make the speciation or the extinction rate a function but a non-linear function of a number of things. So the speciation rate could be a function of climate, time, some traits, diet, overlap with humans. And so because we don't know a priori what types of interactions or what type of responses these things may have on the speciation and the extinction rate, we use a neural network here. And this is unsupervised because we don't know the ground truth but we can calculate for any given lambda here or for any given speciation and extinction rate we can calculate the likelihood of our data. And so we run this model to estimate speciation and extinction rates indirectly as a function of all these network. And what we found, for example, for elephants, elephants are an interesting case because they've been around for a long time. They have a very good fossil record but the diversity dropped very quickly, very recently. So until a few million years ago there were between 20 and 30 elephants, elephant species roaming across all continents, almost all continents except Antarctica and Australia. Now there is three species, only two continents. And so if we are to try and find out why they diversified the way they did, we can use these types of models and correlate speciation and extinction rates with specific traits and events for elephants, which in the case of elephants was in diet and traits overlap with humans and climate. And what we found is that there are non-linear complex relationships between traits and diet that explain speciation rates and that humans basically are the one predictor for the extinction of elephants, which kind of didn't survive us too much, but our model was able to quantify that 8 to 20 fold increase in extinction rate linked to humans. So this is another example of using a basic network inside an unsupervised model. And I hope this wasn't just confusing but there was a kind of what I had for today. Thanks for listening. Do you have questions? Yes, I'm not sure I really understood how you find the weights if it's unsupervised. Yes, that's a good question. So our model is using, without the network basically, is using the observed fossil record to estimate some sampling rates basically and the speciation of the extinction rates. And so for any given speciation of the extinction rate and fossil data, we can calculate the like, or we can calculate the likelihood of any of the fossil data given any combination of preservation rates and speciation of the extinction rates, okay? So that's a likelihood. It's unsupervised because we don't know the true speciation extinction rates. But we can use, so this is like a regular likelihood approach where we basically, we have the data week, we have a likelihood function that tell us what's the probability of observing this data given at this point in time, for example, given this preservation rate and these speciation extinction rates. And these likelihoods and prior functions are based on these Poisson and Perth test processes. So here we use the network to basically modulate how speciation extinction rates change over time and across species. And the network will only basically speed out a particular speciation rate for a given species at a given point in time. And given this value, we can compute the likelihood of our data. This clarifies a bit. I mean, in most evolutionary biology, models are unsupervised in a sense because we never know the ground truth, right? And every time you go back in time, a few million years, then there is no ground truth anymore. And the same applies to phylogenetic inference where we try to estimate the phylogenetic tree connecting, phylogenetic relationships connecting species. We estimate them based on likelihood that we calculate, but we don't know the ground truth, right? So we hope that our model is doing a good job. And so this is similar here, but the way to modulate the parameters is given by a neural network. More questions? Yeah. Yeah, I have a couple of questions. And so they are about the classification case and they both have to do with how to best interpret or present to people who may not even be used to Bayesian thinking what comes out of such a classification task when you have, when you don't have a point estimate, but when you actually have a distribution. So the first question is, yeah, just how do you have any experience with how you would make some kind of plot or a table or output that will more guide people to think in this framework of we have, we have probabilities for each class rather than a hard assignment to a class. So this is one of my real life issues. In the end, there are people who read what I have, but they, in the end, are very interested in is it this or is it that or is it that? How to communicate better that, well, it might be this or it might be that and this is higher probability, but the other one is perhaps also not wrong because the probability is not that much lower. I don't know if you have any creative ideas about that. I mean, communicating uncertainty is not always trivial because well, many scientists I think know what uncertainties are, but yeah, I agree that it's not always easy to convey this. For the case of classification, you get actual posterior probabilities, right? So you can basically, in this case, I can tell that I'm like 80% or whatever it is confident that this diet genome and range size are associated with the brown pairs. So you actually have a number and if you want, you can use thresholds. So what I did in some cases is to set thresholds basically, or for example, using a test set, you can set thresholds under which you are not confident in any classification. And if you have a test set, so if you set aside a bunch of your data, you can say like, okay, if I just take the one, the prediction with the highest probability, whichever the probability is, even if it's like 40%, then I will have a certain prediction error in my test set. But I will do predictions for all of my samples. But if I say my thresholds for accepting a prediction is 90% for example, then maybe my prediction error drops to 1%, so I'm more confident, but that could also mean that every time I don't get a 90% positive probability for a class, then I refuse to do the prediction. Meaning that you have 1,000 data points, you make predictions maybe only for half of them, but for that half you are very confident. And then for the others, you don't do the classification. And that's something that you can do with a BNN because it gives you these positive probabilities. You can just say like, any outcome that is below whatever, 90% for zero probability, I will not consider. Yeah, and then I have another question also about this classification task. What actually happens if you have a set up where say you have three classes, you have a small number of classes like here in this example, but some of them say, two of them are very close, say they are a brown bear and a black bear, and one of them is very far removed, let's say it is a mouse. Then what I would expect is that, okay, you will probably get a lot of cases where it's kind of uncertain between the brown bear and the black bear, if it's one of those, and it's usually very certain when it's the mouse. So then if you use something like you just said with a threshold, then it will miss a lot of the bears because it's very often very uncertain and you will just say, we don't know what it is, but actually you do know something, you do know that it's a bear, but it will just, in a result like that, it will just get lost and will say, I don't know anything. Do you have any notion of how one could deal with that, to not lose all of this information that it's actually not a mouse? Yes, so first of all, you can calculate the accuracy per category, right? So we've seen the confusion matrix, you can basically look at each row here and produce an accuracy. So here my accuracy for pandas is 100%. The accuracy for brown bears is 90% and the accuracy for black bears is, I don't know, whatever percent. From the training data, what do you know? Yes. So yes, it always, yeah, from your test set, right? So you can tell what the prediction accuracy is for every class. If you have a case as you say, where like for example, here pandas are very different from the other bears, and so for the other bears, what you will see is that the confusion matrix for this part of the confusion matrix will be more fuzzy, right? But what you could also say is, or decide to do is to summarize these two classes and say like, maybe I have a 99% accuracy in determining that this, that a given item is either of the two bears here and it's not a panda. So maybe the accuracy over these two rows is 95% but then I am not sure within this block, I don't know. So then you don't throw them away, but you can say like, okay, this is at least one of the two classes. Can you just add probabilities for that or is it more complex? Yeah, or you can like, you can, you can calculate the accuracies, right? Because you can say like, here I have 14 on one side and 23 on the other side. You can calculate the accuracy, like how accurately can I separate pandas from the other two? And so basically it's like doing a confusion matrix with only two classes, where here you put both bears in one row. You can aggregate them afterwards, right? And calculate their accuracy. Yeah, and is it correct or is it not correct? Say when you're actually using the model for classification to add the probabilities. So say you have your classification into three classes and you still want to know that but then you also want to know how high is the probability that my individual falls into one of two classes or two of the three classes? Is it correct to add the probabilities that were spit out from those two classes or not? Like, I mean, maybe we don't even have point estimate so it's a little bit hard to know what that means. I mean, what I would typically do is just to do many predictions because you can because you have all of these weights. You do many predictions and then you collect them, right? And then you see like how many times they categorize these pieces or these items as these pieces, that's pieces, that's pieces. And then like you have these frequencies and then, yes, you can sum them up, right? So 33% of the times this was this, 33% of the times it was that. So 66% of the times it was one of these two. Yeah, okay. So yes, that's fine. So we didn't do it directly on the probabilities you get here but on the something, on the posterior frequency of the classification. Yeah. All right, that is very useful, yeah. Thank you.