 Okay, yeah, thanks a lot and special thanks to the organizers for including our work in this conference program I'm very happy to be here today This talk will be about enhanced Bayesian neural networks for macroeconomics and finance And it is co-offered with Nico Hadzenberger and Florian Huber from the University of Salzburg and with Massimiano Marchilino, who we heard earlier today So I would like to start by showing you a few variables that we're usually interested in when it comes to forecasting So these include a few macroeconomic variables like inflation, industrial production, the exchange rate, for example but also financial variables like the stock market index and I suppose most of us share the experience that forecasting those variables can be very difficult task Especially recently when uncertainty and risk is high So for example, we had a pretty hard time capturing the surge in inflation After the COVID-19 shock and also we continuously underestimated its persistence Then it was nearly impossible to get close to the true value of this major drop in Relactivity after the COVID-19 pandemic, but it's not only about or not so much about getting this drop right It's also about how we want to deal with those values and those observations when we forecast the next period So we kind of have to find a way how to deal with non-linearities and irregularities in our data And the recent trend Is going into the into the direction of using large-dimensional data sets But this involves also challenges. So for example overfitting issues or also how to model all those conflicts Dynamics when you use a large-dimensional data set and for the first issue you have Solutions in the literature like regular Regularization based techniques like shrinkage priors, for example are successfully used in a literature to overcome this curse of dimensionality That's how we call it But often in those models the common assumption of linearity remains So in this paper, we asked ourselves how to model the relationship between a response and a large set of Quarriots and how to safeguard against overfitting at the same time and What we do is we use neural networks as a device for learning This relationship between variables. So as Honeck and Kuhl first already showed in 1989 actually neural networks are a device for learning an unknown relationship between variables And under relatively few assumptions, however, there is a big drawback and that's usually model specification So in a neural network you have Quite a large number of hyper parameters that you have to choose or to tune So for example, you have a non-linear function the activation function you have a number of neurons and hidden layers that you have to choose and Usually it's quite hard to to choose it. And so what most practitioners and researchers do they rely on cross-validation However, those cross-validation exercises can be extremely time-consuming and also computationally burdensome So we tried to find a more elegant solution and what we do is we use recent advances in basing statistics and econometrics to determine the structure of our network. So for example For the number of neurons we apply shrinkage and also the activation function is drawn within our MCMC algorithm So we don't really have to choose one activation function But we have a set of them and then we draw it in our MCMC algorithm Also we can introduce stochastic volatilities in our model and we know and we also heard today That this is really useful when it comes to forecasting especially in travel and times Empirically, we showed that our approach works well in simulations and also we apply it to a set of prominent macro and finance application and Conductive forecasting exercise and then we also explore the degree of non-linearities in our data set in a bit deeper Okay, so This is our model and a little bit more detail. So you have a general non-linear regression here Actually, we also have a linear part. So you see this x prime gamma That's our linear part and then we have a non-linear part, which is this function f on x and this function f is of unknown form here Yeah, and we have the error term which is assumed to be normally distributed with zero mean and the time varying Variance and the main question that we asked is how to model or how to specify f and As I already said We chose to use neural networks and actually what you see here is a shallow neural network So we only use one hidden layer But in the paper we also extend our approach to the deep version and I will get to this later But let's focus on the single hidden layer on case So what do you have in this neural network or what are the ingredients kind of are this vector of vector loadings That's the beta then you have the non-linear function the age you have a matrix of non-linear Coefficient's copper and the vector of bias terms or constants to sita and in the machine learning literature you would call copper and beta the weights and Seated would be the bias term as is written here and the H is the activation function To see how this neural network Can be estimated or like in simplified verse. I added this graph here So what you do is you start with a large and then dimensional set of Covariance, that's the x that's our input and then you weigh your x meaning you Multiplied with your coefficients you add the bias term and you apply this non-linear function H and then you end up at the first hidden layer and The outcome of this first hidden layer is actually called the neurons. So that's the blue dots and If you're working with deep neural networks, then you would just add hidden layers as many as you want But of course you have some efficiency losses if you add a lot of hidden layers So in this one hidden layer case, and that's our That's the neurons and then you weight them again and you end up at the target y and To get to a Bayesian neural networks We now assume that all our coefficients feature distribution now So you add a prior on kappa. You add a prior on beta and Then you use Bayesian statistics to estimate your model and the advantage The advantage is that you can also now use all the techniques that we know from Bayesian econometrics like shrinkage for example and Those Bayesian neural networks per se are nothing really new I would say it's also quite popular in the deep learning literature Because it also helps to safeguard against overfitting, but what is new to the best of our knowledge is our Way to do the model Selection so as I already mentioned we use a shrinkage prior to determine the number of neurons so we start with a large set of covariates and also with a large set of Neurons and then we shrink them and we use a multiplicative gamma process prior here from but the trial and dancin and It's also popular in the factor model literature because was it that what it does is with in with an increasing number of factors you get an increasing Amount of shrinkage and that's also the case then In our model that with more or with an increasing number of neurons you get more shrinkage We also shrink the weighting coefficients or this non-linear coefficients kappa and also our Linear coefficients actually the gamma we use a horseshoe prior here And then we choose between the activation functions Also by putting a prior on them So we introduce this letter and this could run the random variable that we call delta here and then we sample The activation function in our MCMC loop Okay We choose a set of curve activation functions actually four of them so that's also Those are the most popular ones in the deep learning literature. You have sigmo and you have re loo Tarnage and leaky re loo and you we use all four of them in our algorithm And here I added the equation and a plot so that you can really see where this non-linearity comes from And because those are really important to introduce the non-linearities to our model I would like to spend a few more minutes on them So in the paper we introduced this very simple Example with inflation and money growth to illustrate their functional form so what you have here is We use the in so we model the relationship between inflation you and your inflation and The year money growth rate the lagged one and what we would expect from economic theory is that With a higher rate of money growth you get a higher rate of inflation, right? And we see this when modelled linearly So you get this positive and linear relationship in the second plot when we use a non-linear Activation function also the mean estimates gets non-linear. So you see this in the in the second row in the lower panels that we really have for re loo for example, we have a constant relationship or like A constant relationship between inflation and money growth when money growth is small But then you get the strong rather strong positive relationship when money growth exceeds like 5% for example And then I would like to draw your attention to the first plot which we call convex combination here So that's our model where we draw the activation function within our MCMC loop And you really see that you get kind of a mixture of those different activation functions So you have a piecewise linear a part in there, but you have like a steeper slope for values between 5 and 10 percent for example and then getting flattery again So we really get a mixture of those activation functions Okay, before I move on to our results. I would still like to Guide you through our sampler. So what we do is we draw the gammas and beaches so our coefficients jointly from a Normal distribution with posterior moments taking well-known forms Then we draw the hyperparameters for the MTP prior by Simple Gibbs updating steps, then we draw the copper those are the nonlinear coefficients and we use the Hamiltonian Monte Carlo Step here. This is also a state-of-the-art in the deep learning literature then we simulate the activation function and We do this from a multinomial distribution with this random variable delta that I showed you before and And The nice thing here is that our approach is quite flexible. So we have kind of two options We have the Bayesian neural network where we draw those activation functions But they are common to all the neurons and then we have a more flexible approach where we even use a different activation function for each neuron and The last step is that we use stochastic volatilities in our error gem Okay, so just a short slide on our simulation study due to time constraints We illustrate our approach for different data-generating processes. So we have a linear DGP and a nonlinear DGP then we have a large and a small one as well as a dense and a sparse model setup and we also Have a DGP with Constant and then with time varying volatilities We show in our approach that it works well actually for all the different DGPs We gain especially for the nonlinear DGP what we would expect But we also kept to the linear DGP without such a few issues of overfitting Okay, so more details about our empirical application. We actually have four applications And three of them are macro applications. One of them is finance application. So the first one is Taken from the Fred Md database. So it's a time series application for US data and Here we split it into three parts because we estimate free target variables. We forecast inflation industrial production and employment The second one is a cross-section. So here we estimate the average economic growth rate of different countries So we have 60 countries specific variables for 90 countries and what we do is we split the data into 50% of training set and 50% of our holdout and then we We repeat this exercise in 100 random samples And the next one is also a macro data set. So here we forecast the US UK exchange rate And we use quarterly data to do this and the last one is the finance application where we use annual data To forecast the equity premium Okay We use the two versions of our Bayesian neural networks that I already mentioned So we have a Bayesian neural network with a common activation function for all the neurons and then we have the second Bayesian neural network with a neuron specific activation function. That's how we call it Our benchmark is always the Bayesian linear regression with the shrinkage prior actually the horseshoe shrinkage prior and stochastic volatility then we also Estimate the Bayesian neural network by back propagation. So that's kind of the state of the art or quite popular approach to estimate Bayesian neural networks in the deep learning literature and The last one is Bayesian additive regression trees because we also wanted to see whether controlling for non-linearities in a different way Would maybe help more or is better for forecasting those variables Our evaluation is based on root mean squared error for point forecasting performance and unlock predictive likelihoods for density forecasts and Before I go into more detail. I would like to summarize what we get So we see that our Bayesian neural networks offers Substantial improvements in density forecasting, especially in turbulent times and for point forecasting We are highly competitive to all our competitors So, yeah, so the focus is really on recessionary periods because here we gain the most So for macro a which is the threat MD database like inflation and Inflation industrial production and employment we gain the most during the global financial crisis and during COVID and also for Micro C and the finance application. We also see the largest gains during the global financial crisis For macro B. We also get a good forecasting performance. So that's the cross-section very estimate the average economic growth rate of countries and here We also construct an illustrative example where we only use outlay observations in our holdout So we choose all the really high and low growth rates in the holdout And then we try to forecast those values and we see that the BNN offers great improvements when we do that Yeah, we see that this good performance in terms of density forecasts is often linked to a good in-sample fit So we conclude from that that we see kind of form of benign overfitting Which is also common in the literature So you for a neural networks is often holds that they really fit the data well But they are still good in doing out of sample predict Predictions, so we see that here too. And then the last point is on the deep BNN So we see that it yields comparable results But we are not gaining a lot So we conclude from that that it's often enough to include kind of simple forms of non linearities And especially in terms of efficiency. It's of course useful if you can use a shallow neural network Okay, so in detail you can see here our density forecasting performance So I plotted here the relative log predictive likelihoods always relative to our benchmark Which is the linear regression with shrinkage and suggest equality And basically you see what I already told you that we gained the most during the global financial crisis and the COVID-19 pandemic this holds especially for example for industrial production, but also for the exchange rate Example and here for the cross-section I plotted always the BNN relative to To the linear model so you get those bars for all the random samples and you see that on average We are outperforming the linear model and especially when it comes to this outlier example that we constructed Which is the last bar we get large gains here and then We wanted to dig a little bit deeper into The form of non linearities that we get in those different data sets. So I'm what you see here Are different activation functions. So you always see the weights that each activation function gets in the different holdouts and what we see is that Especially for macro a's like for the Fred Md database You get a lot of weight on sigmoid actually and you get a mixture also for employment But for industrial production, for example, you really get a lot of weight on sigmoid It sometimes changes when you are in this crazy periods or like this periods with those crazy observations like for industrial production during the COVID pandemic and then for the other data sets We get more of a mixture and so you also have some weights on tonnage and on the other activation functions Then we also had a look at the effective number of neurons And you see that often especially during drunken times You don't need a lot of neurons. So do you it's often enough to include a small number of neurons And you get a pretty good forecasting performance And I think this makes sense because we would expect that the linear model is working well in tranquil times And when it comes to crisis times then we see that especially for inflation, for example That the number of neurons increases a bit And this helps us in terms of forecasting and the last slide. I want to show you Is this relationship between in sample fit and out of sample predictability? So what you see in those plots is the relative R squared compared to the linear model and also the relative LPL compared to the linear model and what we asked here is Is there more information in this relationship between x and y then a linear model can extract and we would argue There is if you are in this right upper corner Because in this right upper corner you get a relative R squared above one meaning the BNN yields higher in sample fit Then the linear benchmark and you get a relative LPL above zero meaning you get a better out of sample prediction in density forecasting performance and Again, we see this pattern in recessionary periods. So for example in us for production or employment. You see that the COVID Periods are in this right upper corner and also for the UK US exchange rate. You get the global financial crisis there So again, we conclude that we see this kind of benign form of overfitting Okay, to conclude we developed a non-parametric regression model based on Bayesian neural networks and our Approach is quite flexible and it allows to remain agnostic on the form of the network structure So we really determine it within our MCMC loop and we use popular techniques from the Bayesian literature to do that and We show in a broad set of macro and finance applications that we get a superior forecasting performance with our approach So thanks a lot and I'm looking forward to the discussion So the discuss on this Carlos Montescal don't okay, so thank you very much I think that was a great presentation and I have to say that I had a lot of fun reading the paper and I'm doing the discussion even if I mean First disclaimer, I'm not a big expert on Neural networks around Bayesian neural networks, so I had a lot of help from chat GPT but but still I Think I can I can say something about the paper. No, but let me start My saying but what this paper is doing? No, so what they are doing is that they are building an algorithm to estimate this type of a generic models That can have like a lot of variables explanatory variables with non linearities And then they also introduce a stochastic volatility and to estimate the model and they estimate they they build this Algorithm know which is based on a bill and a Bayesian neural network now what they are going to do They are going to take their algorithm They are going to apply it to different time series models and what what Karin has shown Is that the algorithm performs very well, especially when you look at out of sample forecasting I have to say overall as I said, I really like this paper. I think that the algorithm is very well developed I will not say anything about the algorithm because I mean I went over it. It's very well done I think the applications are very interesting also their results But and there's always a bat. No, otherwise, I wouldn't be here But I still have a bit of doubts in terms of the performance that that you saw in the paper compared to a standard Bayesian neural network and I think this is important No, because I think that the question that we have to answer now not now or these papers to answer is Should I switch from a standard Bayesian neural network algorithms to the algorithm that the authors are proposing here in this paper? And I would say that I'm still not fully convinced again I think that the algorithm is very good But I would claim that possibly it's not flexible enough for to handle different types of models and that actually I think that the the comparison that you are making in terms of the Per the out of sample performance is not very fair and I will try to over this in this in this discussion So for this I'm going to be fast, but I think that for the flow I think I have to give like a very short crash course on neural networks So, let me just consider a very simple neural network. No, so usually we put it with this very nice Very nice charge So let's consider that we have three variables three x variables that are going to be the explanatory variables And then this neural network is going to have one hidden layer and one output layer So what is the hidden layer? So in the hidden layer, we are going to have three activation functions and this activation function depends on this kind of like Regressions that then are activated with a nonlinear with a nonlinear function that is and then the same at the end for the output layer So you see that we have all these parameters these Omega parameters and what is called the biases that are kind of constants for these activation functions And usually what the neural network does is just to minimize a loss function Which is usually the a mean square the error for example, you know like data minus what you get from the output layer to the Square and then you find you minimize you do this back Propagation algorithms you minimize the MSE to find the estimates for the biases and for the weights That's it if I don't have any hidden layer And if my output layer that a function is just a linear function, we are back to linear regression work nothing more than that now what happens if I want to do a Bayesian neural network because if I just Minimize the mean square the error I'm just focusing on getting a point estimate, but I want to have some uncertainty around my estimates and And And sorry, okay So I want to have uncertainty around my my estimates and also when we are dealing with time series data where we don't have Large data sets we want also to avoid to have some type of overfitting So for this I would rather go to have a Bayesian approach to neural networks So in a standard Bayesian neural network that I can just go to Google call up and TensorFlow What I usually do is I just select a prior over the weights and the biases that I showed to you before Then I am going to pick an approximate posterior distribution Which is called a variational posterior and this variational posterior depends on some parameters lambda and then I'm going to minimize the cool back library versions between this variational posterior and And the and the true posterior of the model the thing is that the true posterior usually is not tractable And therefore we cannot solve for it directly Think of this. I take a linear regression and I think okay I think that the betas the the coefficients of the model I'm going to approximate the posterior distribution to an unknown normal distribution And I am going to minimize to minimize the parameters of that normal distribution with respect to the true posterior So that's what I am going to do Now in this paper they go a more traditional route And this is what they saw is that again They select a prior over the weights and many other things and then what they are going to do is to try to find the exact posterior Distribution and for this again since it is not tractable they develop this MCMC sampler not to find the posterior distribution now It's a complicated algorithm It requires a lot of steps and for that is why I why I think should I switch to this algorithm or should I keep still with my variational base so Let me show an example. Okay, so I'm going to simulate highly non-linear data And I'm going to show to you that a standard Bayesian neural network Works, okay, and then I will I will tell you why I think they are getting this Bad performance in the in the paper. So I will simulate a highly non-linear data and then I will say that My data generating process is this this model we have we have a mean which is a non-linear But then I'm going to have some type also of a stochastic volatility in the model The thing is that volatility here is not exogenous is endogenous No, it's going to depend also on the X variables I will put my variational posterior and then I will say that my neural network is going to have two outputs The first output is going to be the mean of the model and the other output is going to be the standard deviation of the model And for that I have to minimize as I mentioned before the KL divergence And then the KL divergence depends on the on a on a cost function is called no in my case the cost function And this is important I come to I will come to this later the cost function the log P of y So I'm just going to assume that is a normal distribution with the mean the first outcome of the neural network and the standard Deviation the second outcome and again, this is important when I am going to simulate here the posterior densities I am going to use both what is called the epistemic uncertainty That is the uncertainty that comes from the posterior distribution of the wage But also a leotoric uncertainty that comes from the receivables because this is what is going to be comparable To the model that they have in the paper Now when I do that, I'm going to construct to have this architecture for the neural network Two hidden layers three inputs because I have three three inputs in the model and then the two outcomes And you can see that it works very well No, so I see that the neural network is going to learn very well the impact of the of the three variables of X1 X2 X3 So you see here I have the the red is like a kind of like the posterior distribution of the mean and the red There the red lines This is what comes also when I account for the a leotoric uncertainty in the model and you see that the Bayesian neural network is working Very very very well here, you know And this takes like five minutes to estimate with 1,000 data points that data points that I feed now The thing is that as I mentioned before knowing the paper they say that Standard BNN performs much worse than the proposed algorithm No, and they say the dismal performance of the standard back propagation network is driven by two narrow predictive bounds But in the appendix in the paper They also show that actually when they are estimating this standard Bayesian neural network The cost function that they are using is the mean squared error Which means that when you do that in the Bayesian neural network, you are I think that you are only focusing on the Uncertainty around the mean but you don't have this a leotoric uncertainty While in the model that they are estimating they have those residuals Then when you construct the density forecast, of course, you are going to have wider bands in your in your density forecast and in fact if I check one of the tables No, it's just look at the root mean a square error that is focusing only on the point forecast or in the mean You see that the there is not so much difference between the back propagation a standard neural network or The algorithm that they are proposing in the in the paper And in fact if I estimate the same model that I mentioned before the data that I have been Simulated and then I change I change the cost function from being the likelihood as I said and to have these Two outcomes from the neural network mean and a standard deviation to have a cost function, which is the MSE So you see that here now I can only construct. I think again as I said, I'm not really an expert, but I Can only construct uncertainty around the mean which is the red ones And you see that I am missing a lot of what is happening with the with the variance of the data So one of the things that you see in this data that I generate is that I The variance depends on x and you see that for extreme values of x the variance increases But this is not captured if I just use as my cost function. I mean a square error loss so Also, since they One of the big points in the paper and I think that's very nice is that you say, okay I don't want to commit to have just one type of activation function I can have different activations functions and I can combine them But this is something also that I can do in a standard patient neural network framework I can have as they do in the paper just one layer. So I'm gonna change the architecture I will have just one layer, but in that same layer. I'm going to have Different activations functions again. This is it's just a matter of it's not easy But it's a matter of tweaking around tensorflow and the standard things know and then I will say I'm just going to use some shrinkage in my priors for the for the weights and the biases For that I'm just going to use a Laplace distribution. I have to say I try to impose the whole supplier I still couldn't couldn't get it well. So but the Laplace distribution works quite well and You see again that it's working very very well once I account for both types of uncertainty. So Again, I think it's an excellent paper. I think it's very good. But if I ask myself Do I want to switch from a standard patient neural network approach to this algorithm, which is More complicated and so on do I want to do it? So I think it's gonna depend on the applications and I still would like to see like Possibly in the paper a better comparison that tells me so Carlos. Yes So you see our algorithm is also outperforming the Standard patient neural network when we do a fair comparison with the with the models Yeah, not another small comment No, it's that I would like to see a bit more of what are actually the variables that are driving the non-linearities in the models and And yeah, also soon so also in the simulation section that they have I am still struggling a little bit to understand why If the data generating process is non-linear Why the linear model with stochastic volatility performs as well as the as their algorithm But these are a bit more minor remarks No, but I think that the the important part is to try to understand if this is Much better still than what would be a standard patient neural network using back propagation and variational base So let's collect some more question and then go back to Karen. Hi, Karen I got a couple of questions. Let's say or more clarification because as Carlos was stating I'm not an expert or neural network neither So the Bayesian convolutional neural network is a special case of what you were proposing Or it's a different kind of network Structure that you architecture that you can use it because it seems that This core sort of Bayesian convolution neural network is doing better with respect to the BMM with the impact of a Propagation, but still I'm not an expert. So that's up to you and secondly You were stating log predictive likelihood and in Brutomy square row Have you seen something about quantized core as Massimiliano was presenting before quantized CRPS? In order to see what's happening on the tail Since as Massimiliano was presenting Bart is doing well It seems to do well also when you're doing something on the tails and I was wondering if also your model is Predicting well or doing well also with respect to the tail spark Thank you. So I had a question like because I missed it a bit So, how do you do here like multi a sort of like horizon forecast? So do you do like direct or iterative and? and Related to that I was wondering, you know, how you deal with with stochastic volatility in case you do it like multi-step In particular like, you know, when I do like this interactive I always ask myself like should I like, you know, somehow draw Stochastic volatility or keep it constant and I'm asking this because it seems that Uncertainty, I don't know like as far as I understand your your points here on the comparison That's uncertainty is a key ingredient here in comparing various methods. So I was wondering whether you were doing like some sensitivity to this element Okay, so maybe I give back the floor to Karen So thanks a lot for all the comments and the questions I will start with your comments and and I think yes, it's true And when we go and thanks for for the suggestion So when we go into the revision process, it would be probably useful to extend especially this approach by back propagation To a yeah, probably fear or comparison because it's true We are minimizing the root mean squared error and we see that it's competitive But I think what is nice that it we also see that also when we minimize the root mean squared error for point forecasting We are with our approach. We are competitive and sometimes even better So even though the Bayesian neural network by back propagation really focuses on this part. It's not doing like It's not gaining a lot compared to to our approach But that's that's a good and fair point as well And but I think what is nice also in our approach is that we can dig a little bit deeper You know like for Bayesian neural networks or neural networks and the deep learning literature It all often holds that they are only interested in forecasting, right? So you're only trying to get the best forecast and sometimes I mean at least I ask myself so what is happening there and What's the nonlinearities? What's the neurons and what is driving this good forecast? And I think with our approach we can at least shed some lights on how many neurons or which Form of nonlinearity can be useful and is it different for different data sets? Then on the question on convolutional neural networks, I think you're right It's going in the direction of a convert convolutional neural network So probably we should also have a look more into this direction because we only compared it to Bart and to BNN To the to the standard version kind of and we also thought about doing an LSTM Which is done more even more Focusing on the time series structure, but as we're also using a cross-section We try to find a way to you know combine different data sets and to find a Version of a BNN that works for different data sets and not only time series or not only cross-section and so on but thank you. Yeah, and The second question was about the quantiles core We also computed the quantiles core and we also see that we we gain in the tail So it's pretty different pretty similar to Bart, actually Um and then thanks for the question about the horizon because I think I didn't mention it But I'm good point. We use one step ahead here. We also compute that One quarter ahead and and one year ahead for the USE UK exchange rate And we also see gains there and it's a direct forecasting approach So we use the direct forecasting approach and for the SV we keep it constant But that's it's also a good point that I think we should try to or at least compare But it's different when we iterate it forward also about the uncertainty. So thank you. Thanks for that I hope I didn't forget it. Thank you Any other last minutes question So maybe actually I have one I Might have missed some result But I had a question on what is the independent role of stochastic volatility into this framework So do you really need it? Did you ever check whether that does you know some some additional gain to or read to this already? complicated structure that you have because you know my similano before was talking about the straight-offs of You know adding a lot of complication on the parameters and on the stochastic volatility Identify ability problem. So I don't know whether you have some feedback on this Yes, so I would say we really need it. I mean, we didn't I Think we did not compute our basional network without SV But at least what we see is that when we compare it to the linear version with SV We really need to SV to be competitive And also with this basional network by web propagation You see that we don't when we don't use this SV or at least some kind of Volatility part then we are not doing great especially during during crisis times