 Hello to everybody, it's really my pleasure to give a part of this lecture, so the idea that we will present some research related to value information for specially distributed systems and I will give essentially the first half an hour of this lecture that is basically every cup of some general background on specially distributed systems and a bit of value information in that car will go on some kind of deeper topics related to research that we have done. So I will start with a general introduction on Bayesian inference, probabilistic way of processing data and then I will introduce this idea of Gaussian processes, Gaussian random fields for modeling distributed systems. And then Carl will go on talking about specifically value information in these fields, how to compute it in an effective way for the purpose for example of center placement and then we will end up talking about how to extend this formulation of special temporal processes which not just time is not just a temporal domain but including some time. Do you hear me well? Yes. Great, so Bavoy, if there is any question feel free to interact with me because you know we can interact also your review is half an hour or after that so I think I see you so just bring your hands and you know if there is any question about what I'm talking about. So by specially distributed system we mean any kind of physical quantity that changes in space being better have a special domain maybe a region and then you have a special quantity that varies in different points in this region. So one application that we have studied is a seismic demand you know we think some parameter like the PGA or parameters defining the seismic intensity very different in different places even during the same earthquake so you can model that using these approaches but also temperature now Carl is also working on that and for example corrosion if you think that the damages of corrosion is something that involves large system again you can model that using this approach it or permeability of soil is another kind of classical topic in which you can think that this specific parameters is uncertain and you can model its value in different places using these approaches. And then quite recently there's been a lot of general work on for example in the community of machine learning and computer science about developing this model for the general purpose of making inference kind of non-linear non-parametric inference functions. So you can relate much of what I'm saying but modeling what I'm describing is Gaussian processing modeling for example to literature in structural dynamics in which for modeling processing in time you use the same idea of Gaussian process for accounting for uncertainty and variability in time. I'll start with basic framework for probabilistic inference PGA is that you have at least a couple of random variables that's called F1 and F2 and suppose for example that they model temperature in two different locations. Suppose that you have a joint model for these two random variables that means that you have a function that's called the joint distribution that is always non-negative. Define it on the joint domain of these two random variables as always non-negative and it's normalized to 1 and it defines essentially what is the probability in terms of density that the temperature in the two rooms is a specific value at 1 F2. And then you can compute mathematically by marginalization. You can compute what are called the marginal distribution and they define essentially what is your marginal belief, what is your belief of any of these random variables. For example P of F1 define your marginal belief on this variable F1 with a temperature in the first room and that is related to this joint model just by integration. And then you have this conditional probability that works in the following way. Suppose that you observe a specific value of F2, maybe you observe a temperature in room 2. The question is what is your updated belief on the temperature in room 1? And to do that essentially you take a section of this joint distribution and this is what this blue curve does. When you renormalize this, this is just the formula of the conditional distribution and that gives you your updated belief. And if you want you can do the same if you observe F1, you will have an updated belief on F2. So this movie is just to show you essentially what happened. When you observe a different value of F1 you will see you have always different updated belief on F2 which means that the two random variables are independent and there is something to learn from one random variable to the other. So the general framework for inference is pretty simple. You start with a joint model then suppose you observe F2 as a specific value and now F2, like the temperature in the second room is not a random variable anymore and you observe that its value is 0.4 as in the picture reported here. And now the question is what is the corresponding probability on F1 and you have a formula for doing that, you take a section, you renormalize it and you have this updated belief. This is based on inference in a nutshell. Specifically for example you can have this very special case in which the two random variables are independent in this case essentially the joint distribution is just given by the product of two random variables. Now if you try to make a performed inference in this case essentially nothing happens, you have your prior belief suppose on F1 that is the marginal belief and when you observe a specific value of F2 for example F2 equal 0.2 is in the same. Great, can I go on from this point? Yeah. Can you hear me? Yes. Okay, great, great. So I was telling you when the two random variables are independent there's nothing to learn from one variable to be over. So as you can see from this movie no matter what is the value of F1 that you observe your posterior belief on F2 will be always the same because no matter where you cut this joint distribution the shape of the conditional distribution will be always the same. So after normalization you've got always the same result so you have learned nothing, this is just a special thing. But more generally when the random variables are dependent you learn something from one observing the other. So the multivariate normal distribution is just one specific case of a joint model. It's probably the most famous model, the nicer kind of similar somehow model and proposed for defining the joint distribution of some random variables. It's given by this formula on the top of the slide that looks maybe kind of complicated at the beginning but after a while it's not really so complicated. For example if you take a lot of this density that's essentially just a quadratic form. This is essentially the extension to be made sure more than one of the classical univariate normal distribution. So while the univariate normal distribution is defined just by a mean parameter and a standard deviation parameter mu and sigma the multivariate normal distribution is defined by a mean vector that is a vector with a size of a number of random variables that you're modeling and a covariance matrix that is just a square matrix that defines n by n if you have n random variables and it defines really the covariance between all pairs of random variables. For example this is a plot of this joint density in the b-variate case f1, f2 for solve for a specific choice of parameters this is how it looks like. So a joint distribution is completely defined by these two parameters mean vector and covariance matrix and it has a set of very nice properties. For example a conditional distribution no matter where you cut this joint distribution it's always normal and then for example when you have to marginalize this joint density focus on just one random variable that corresponding marginal distribution is always normal, is also normal and then moreover any linear combination of your random variables is also normal. So you have this very nice property that makes everything very simple. So this is just an example again of the b-variate distribution and as you see this joint distribution like p of f1 is normal and it's given you know is generally so it is a result of the marginalization process that generally is very complicated but it's very easy in the normal model and the result is a normal distribution. Also if you cut for example suppose that you observe that f2 is equal to 0.4 the posterior distribution of f1 stays normal so it's just a matter of figuring out what are the parameters of this updating distribution. So I can maybe show you this movie and show you you know if you suppose that you observe different value of f2 of y in this case, this notation you will see that the corresponding posterior distribution of f1 stays normal. Moreover you see that even the variance of the distribution of this dashed distribution is always the same just the mean changes depending on the specific observation the specific observation that you get. This is very nice because it allows you to perform this marginalization and find out the conditional distribution in a very simple way. I wish also just to show you what happened when the two distributions are uncorrelated in this case they are independent and see what happened no matter what is the value of y that you observe your posterior distribution of x stays always the same. It is equal to the prior one. So essentially you have learned nothing. The two random variables are independent you know again you have nothing to learn from observing one random variable to the other but in the general case the random variables are independent and so you will learn to solve them from one random variable to the other. So how can you find out what are the parameters of this conditional distribution? So the formulas are a kind of simple formula of linear algebra and words like this suppose you have this joint model for f1 and f2 that needs to be just a scalar that can be any vector. You have a joint model for them even by these kind of vectors and part of the covariance matrix. Then there is a formula just of linear algebra that allows you to find out the conditional distribution. So if you observe f2 you know that the posterior distribution of f1 will be still normal and you can find out what are the corresponding mean vector and covariance matrix. By the way it turns out that you see that covariance matrix is not a function of a specific value of f2 that you observe. And more generally suppose that you have some random variables and you have a noisy major of these random variables maybe a linear observation you see why the major that you have maybe the major that the sensor provides you are given by just a linear combination of your original random variables plus a noise that is normal. If that is the case for example you know you can think that maybe one major is just this kind of linear combination suppose you have three random variables of your interest f1, f2 and f3 and you have a joint model of them and your first major is just the sum of f1 and f3 plus some random noise. And the second major is given just by one third of f2 plus an offset plus some random noise. Then you know you can exactly find out what is the posterior distribution of f even any specific reading of your sensor y1 and y2. Sufficient is to define what is the joint distribution of f and y that is of your random variable of interest and the sensor reading and you can do this pretty easily by using some linear combination and then for example you can also know you can also get pretty easily what is the marginal distribution of your reading of your sensor so what is the marginal distribution of y and also again what the real you know of your interest is what is the posterior distribution of f even y. And now given again by this formula just derived from what I showed you before and then just formula from linear algebra in which you combine this matrix and the specific reading that you have that is described by this factor y and you get the mean and the various matrix of the posterior that is normal. So let me give you some you know some in a nutshell the idea that when you're dealing with this normal model for interest process is very simple computation just related by linear algebra. So let me show you some basic example of that to define this idea of a Gaussian process. Suppose that you have two rooms like the room in which you are and the room next to that and there is a temperature f1 is the temperature in your room f2 in the temperature in the room next to that and you'll find a joint model for these two this joint model in jointly normal so this graph on the right reports this contour line of this joint distribution the two random variables are positive correlated meaning that you tend to think that the two temperature are pretty similar if it's hot in your room you tend to think that also the temperature in the other room is pretty high and this graph on the left reports samples of that you see these samples there are just points in the f1 and f2 domain here on the on the left are reported at lines that define essentially they start from a temperature in your room and f1 and end up with a temperature in the other room f2 so for example in this movie shows you what happened if the two random variables are independent you see that in this case you see that in this case the two temperatures are really independent so you know you can be cool in one room and hot in the other and so the slope for example just a line connected with two samples sometimes flat means that the temperature is the same sometimes going up sometimes going down because you know you can have a cooler or hotter room another second one the second room can be cooler or hotter than the first one and then you can have another model in which you assume the temperature are more similar and you see that basically in this case the samples are more flat up to the case in which the temperature are kind of almost identical in this case you have almost flat lines so the point is that when you see the temperature in one room depending on your model you can update what is the temperature in the other room you can extend the model for having free rooms if you have free rooms you have free random variables so now I cannot show you a joint distribution with contour lines anymore because we are in a space of free dimension but I can still show you samples right so again a reasonable model is that one the temperature in two rooms that are close to one to be over are pretty correlated so row one two is close to one and so row three two but maybe room one and room three are pretty far apart better or less correlated so again these are examples of samples again I can show you some movies so these are again kind of independent independent samples essentially you can go up to the case of a stronger correlation when you have a stronger correlation again the idea that the temperature in the free room is more and more similar and that is when you have this kind of samples and you can easily also extend the case of this to a model in which maybe you have 10 rooms suppose you have 10 rooms in a row and again you have a model that say the temperature in room one to be over are kind of similar temperature in rooms that are far apart are less correlated so you can have something like this you see these are the case in which the correlation decayed pretty fast and you can have another example like this in which as you see the temperature changed pretty smoothly so you know maybe when you see what happened in the end and the beginning in the end then the temperature is kind of different but if you look again in two rooms a very close one to be over then the temperature is pretty similar this is how you can define a general variation of something as temperature in a spatially distributed field you can go up to a continuous domain in which maybe you think that you have really hundreds of thousands of billions of rooms in along one line okay in the end you have a kind of continuous domain in which you can point out any specific location in this continuous domain and you have a specific temperature in that specific domain and you can model how all these different temperatures are correlated one to be over that is when you have really a Gaussian process so in a Gaussian process you have a marginal distribution of a temperature in any point is Gaussian in the sense of temperature and instead of location like two locations, three locations and so on the model described, the joint model describing the temperature in all those locations is also jointly Gaussian so this is also what you can have when you define a specific structure so specifically the structure that we define can follow this model the square exponential of variance function is a specific way of defining how correlated is the temperature in this case between two different locations and the model is pretty simple and tells you that if you pick up two locations that are very very close one to be over then because the temperature is continuous is a smooth field a correlation is one almost one means that by measuring what is the temperature here I know essentially what is the temperature also like one millimeter far away from that but if I consider two locations that are pretty far from be over then the correlation decays, mean that these two quantities temperature here and temperature there are less correlated specifically in this model the square exponential define a precise mathematical formula defining how the correlation changes as a function of the distance do you see that here on the y axis I have a correlation this number that generally is between minus one and one but here is between the zero one because in this model the correlation is always positive you never have negative and then this says that when delta x the distance between the two points is zero then the correlation is one and then if you move far away one point from be over the correlation decays following this specific curve this is why called the square exponential and the the rate of decay is related to what is called the correlation length of a lambscape, lambda so specifically if you working on a region where delta x is measured suppose in kilometers then lambda, the correlation lamp is also measured in kilometers and tells you how far away you have to go for the correlation to decay from one to for about 60 percent so for example if you just look at this blue curve here you see that that shows how the correlation decay when lambda equal one kilometers and you can check that when delta x is equal one kilometer the correlation is 60 percent so the idea again to recap is that for this specific model if the distance between two points is very short the correlation is one if the distance very very long the correlation goes to zero and specifically the correlation lamp lambda tells you how fast it decays how far way you have to go for the correlation to decay over a certain rate up to 60 percent so given that you can define essentially if you define many points in a region suppose or many points along one line you can define what is the correlation of all these points so the idea is that if you select a lambda that are very short like lambda equal one kilometers essentially points that are pretty close one to the other like within one kilometer distance are highly correlated points that are maybe 10 kilometers far apart essentially are independent because the correlation goes to zero if you increase the correlation lamp if you use an alternative model with a larger correlation essentially you're assuming that there are more similarities between points that are far apart for example if the correlation is three kilometers you are assuming that points that are three kilometers apart are still highly correlated still 60 percent approach ok so just this in our samples derived from this continuous process for different value of lambda when lambda equal one for example one kilometer suppose you have a pretty smooth field in which essentially it takes a long distance for the field really to change if you use a shorter lambda you see you have less and less correlation so the field is less and less smooth it changes more and more rapidly I can even show you some movies that are the same thing lambda equal three kilometers you have a very smooth field but if you increase lambda so you decrease lambda to half a kilometer then essentially if you just move one kilometer apart you have a completely different value so essentially when you define lambda you are defining how field is changing from one point from the other and that is really crucial for defining how the inference process is working because if lambda is very high essentially that means but because the field is very smooth you can learn a lot by measuring the value of the field in one specific location because in a lot surrounding area the field will be similar to that with your matrix on the other hand if the lambda is very short you have a very local effect of your matrix so this is shown here suppose you have a temperature field and at the beginning you have no idea if both one part of the room or the other part of the room is hotter or cooler so you have this simple prior model where you have this Gaussian prior everywhere and you specify a specific correlation and then when you measure that in one specific location the temperature is in this case in location 4 4 kilometers along this line is 75 the positive formula for the inference of the Gaussian process you are able to update the value of the temperature not just in that location but also in the surrounding area so for example according to your updated belief you tend to think that for sure the temperature the real temperature has to close to the measured value of 75 in this location but not just in this location also in the surrounding area your uncertainty is very small because of the correlation because you think that the field has to be smooth and consequently the temperature has to be close to 75 Celsius also in the surrounding area and then when you collect more and more information you can update what you know about the field by accumulating this information by the way here the white area here defines that 95% confidence bound so here for example at any point you know what is your uncertainty and it's very small when you have a lot of major it's still very high it's kind of intuitive and it's natural to come from this discussion model and then you can also extend this to dimension model one pretty easily if we are dealing with a region instead of one line we can do essentially the same we can define this joint model for the region so you know these are examples now you have to define two lambdas more or less but again the lambda is very high you have a very smooth change of this field in different location if on your hand lambda is pretty short the field is changing pretty rapidly from one point to the other so essentially the same thing that happened in dimension one can be extended to dimension one this is a value pretty short lambda and again these are samples taken from the field essentially tells you that you cannot learn much from one location to another location better which far apart the lambda is very short let me now prove with showing what is the relationship between this and decision making so the idea is now that suppose you have Carl will talk about fields for now let's just focus on a couple of random variables two Gaussian random variables and they define maybe the demand and the capacity on one specific component think that it is a bridge and D define the demand of a bridge and C is the capacity of a bridge and there are two independent normal distribution maybe you have mean and variance for each of that and then the idea is that easily by using the classical method of real village analysis it's pretty easy to define what is the probability of failure probability of failure by definition the probability that the demand is above the capacity so in this case is the probability that if you sample the value from the blue line and another sample from the red line the value that you sample from the blue line is from the demand will be above your sample in this case for this specific example these of course with probability 0.9% even less than 1% because you see basically you have a high capacity and no demand but you know depending on you know these are high value mean 1% of course is a big value for a single structure but you know now let's suppose that you can measure maybe indirectly the capacity you install the sensor and the sensor tells you better what is the capacity of your product maybe it's not a perfect measure it is you know a noisy measure defined by some Gaussian marks then you know this measure is related to Gaussian likelihood so for example suppose that the major value of the capacity is about 120 kilo Newton the expected capacity but you also know what is the noise the noise value is 10 in terms of standardization so by using by its rule you are able to combine the Gaussian prior and the Gaussian likelihood and you get a Gaussian posterior this is nothing else essentially than an application what I showed you before let's say when everything is Gaussian the posterior stays Gaussian so in this case for example the effect of getting this measure is that you learn that the capacity of a structure is less than expected it's still uncertain maybe because the sensor was uncertain but it's less than uncertain but less than expected in this case the posterior probability of failure goes up from less than 1% to 1.5% this is an effect of the shift to the left of the probability of a capacity and let's see how this is related to decision making with Sebastian I'm sure that tells you a lot about decision making this is a very simple example in which there is just two actions that you can make repair or do nothing and then the last thing you've got to pay is a function of the fact that essentially the component is going to fail and there's a function of what you're doing according to this matrix if you do nothing and the structure is undamaged because it's not going to fail you pay nothing and the structure is going to fail you pay the high cost $1 million this is the cost of failure and the alternative is repair the structure and if you repair the structure no matter what is the state it is going to fail or not you just pay the cost of repair $10,000 that is much less than the cost of failure and in this context the question is what shall you repair or not and you can easily solve this optimization problem you have to figure it out what is the expected cost if you do nothing, what is the expected cost if you repair and pick up the best option ok? so it turns out that for this simple the policy, the optimal policy for this simple model is very simple that is you find out what the probability of failure is and you compare it with this free short that is the ratio between the cost of repair and cost of failure it turns out to be 1% the ratio between $10,000 and $1,000,000 $1,000,000 sorry because we have so many euros so if the probability of failure is less than 1% you do nothing you can take the risk but if it is above 1% you do not take the risk and you repair and so this is you see what happened before your prior optimal action was doing nothing and after you learn that the structure is weaker than expected you repair And yeah, this is the fact of taking the data. And then essentially, you can compute. This is just the same example just to recap. And you can compute what is the value of getting this information. You know what the value of information is. You can integrate all the possible oxidation and you have the expected cost by taking this observation. So this expected cost depends on the precision of your sensor. In this program metric analysis, I let the precision of a sensor change from a very noisy sensor on one hand. And if the sensor is very noisy, essentially, you're learning nothing because the sensor is too noisy. So this is essentially no value at all. You see basically this expected cost here in the end is equal to the prior cost, about 10 k. On the other hand, you have a perfect sensor. So here, when the precision of a sensor is when the accuracy is infinite, when the error introduced by the sensor is 0, you have a perfect measure of capacity. And in this case, the expected management cost is much less. The expected cost is much less. Now, the value of information is defined as the cost in any point, sorry, the prior cost, the 10 k, more or less, that you see here on the right, where you have no sensor, minus the minor cost that you have when you have a specific sensor. But you see basically that the value of information is pretty high when the sensor is very precise, and that it decays with the noise of a sensor. The more noise is the sensor, the less the value is needed. And then, up to, you start from the value of perfect information about capacity, that is pretty high, about 3,000 dollars. It's pretty high, but it's not infinite. Of course, it's not infinite. Up to 0, when you have a very noise sensor, the value is 0. And then you can compare that, maybe with the cost of a sensor. In principle, if you have also a model for the cost of a sensor, you can compare the value of a sensor and the cost. And only in this region, when the value is above the cost, it is good to install the sensor. Because outside this region, the sensor is too expensive, so to speak, of too noisy, and it doesn't really make sense for it to install the sensor. And even in principle, you can figure out what is the best sensor as that, that maximizes the gain kind of, maximizes the difference between value and cost. This paragraph is the best sensor. So this analysis tells you that you should install a sensor with a specific accuracy, not too accurate, because that is too expensive, and not too noisy, because there's no point in adding that. And then I will just end up showing you just some parametric analysis that shows you what happened when you changed some specific input in your model. For example, what if the prior uncertainty that we had on the capacity was higher? The question is, if we had a higher uncertainty in our capacity, is the value of information higher or lower? And as you see basically in this parametric analysis, that shows you that if you increase the uncertainty, if you assume that you know less and less about the prior capacity, then the value of information goes up because you're learning more and more by installing the sensor. That tells you that if you have a very small prior uncertainty, that's really no point in installing a sensor. But if you don't know much, then you should install the sensor. This, by the way, you can find out cases where this is not really correct, but also interesting. But overall, this is a trend. And then, your analysis is, what happened when I, in my mind, I changed a respected capacity? What if I consider a structure that is stronger and stronger, so far as, is the value of information, does it go up or down? And here, kind of interesting, maybe you start with this model, you have this value of minimum. Then you consider, you know, another model according to which your prior capacity is higher. And the value of information goes up. But then, if you consider structures that are in a model that gives even higher respected capacity, the value of information goes down. That shows essentially that, you know, when your, suppose, prior belief about the capacity is that your structure is very, very weak, there's not really value in installing a sensor because you know that, you know, you have bad news and you've got to repair the structure in any case. On the other extreme, if the structure is very, very strong, again there's no point in installing a sensor because you will know that the structure is so strong that you don't need to install any sensor. You have not to repair. In between, there is a switch point in which essentially you don't know what to do, you know, and there is where the value of information is verified. So, here that, you know, the mind to complete, you know, this lecture. If there's any question you can ask now, maybe at the end of the lecture, I will list the notes to Carl, part, and then you can post some question notes in the end, but feel free to ask me anything now. I have a question. In the, for example, slide 30, when you show the graph in the top side on the right, just an example, I don't understand very well the concept of the optimum cost. So when you say, for example, the sensor uncertainty is very low, so you have a lower optimum cost, and then if the sensor uncertainty increase, the optimum cost increase. So could you try to explain me the concept of this graph? So it's like 30 to the size of 30? Yes, just for example, in 30, the first graph on the right in the top, what is the concept? What is the concept of optimum cost? We start with a specific value of expected capacity, and that tells you that in this specific setting, for example, the prior action would be to repair, suppose where the prior action would be to repair, that mean that the probability of failure in this case is about 5% is too high, you cannot tolerate the risk and you repair, and you're going to pay 10K, because 10K is the cost of repair, okay? Okay. Okay, right. So if you install the sensor and the sensor is very, very noisy, you learn nothing and your expected cost will be still 10K, because the sensor is too noisy to give you any really, any relevant information, right? Okay. Okay. On the other hand, if you have the sensor is very, very precise, there is a chance that you learn that the structure is very strong, maybe that the capacity of the structure is around 140 or something, and in that case, so there is a chance that you will skip the repair and because of this, the optimal cost, so the expected cost is going down, so this is why there is a value information. If the sensor is very precise, it may influence your behavior and there is a positive value, right? This is the starting point. Okay, okay. Now, let us consider another case in which the prior capacity is stronger. As you see in the graph on the top, the expected, these are good news, so this case is always better than before and this is reflected on the fact that the optimal cost cannot be higher. Okay? Yes. It's true that the prior action, the prior optimal action is still to repair because the prior probability of failure is still above 1%, so if you have no information, you have to repair nonetheless and so the expected cost is still at 10K. So this is why the two lines, the red one and the green one in this graph on the top of the right, they end up exactly the same value of 10. Okay, I understand. But, okay, there is a higher chance that you will discover that the structure actually does not need repair and this is why if you have a precise sensor, the optimal cost is going down and this is why the value information is higher. Yes. Okay, when you consider steel, suppose in this case steel, stronger structures, you see that the optimal cost is always going down. So if you shift your model to the right, the model in terms of capacity to the right, if you consider a stronger, stronger structure, these are always good news, the expected cost is always going down. But this does not mean that the value information is going up because consider this last case, the pink case that I showed you here. Essentially, the prior action, the prior probability failure is about 0.1%. So the prior action is do nothing. Let's do nothing, the risk is completely tolerable. And so this is why the prior expected cost is a little bit less than 2K. This is just a product of one million dollars and this very small probability of failure. When you install the sensor, you can actually discover that no, the structure is weaker, maybe you need to repair, but this is very, very unlikely to happen. So this curve here is very flat, means that it's very unlikely that you will change your prior action and you will decide to repair. And because of this, the value information is very, very small. You can think of a limit case in which the structure is infinitely strong, then the optimal cost goes down to zero, but no matter what information you collect, it will stay at zero. So the value information will be zero. And this kind of intuitive idea, why the value information goes up and down? Yes, I understood. Thank you very much. Sure. I have another question. Number question, yes? Yeah, it's on slide 17. You are generating trajectories which are respecting a measure of your temperature and how do you generate those trajectories? I mean, how do you impose that you are respecting this value? Or do you choose? Or do you choose? Good, good, good, good, perfect. When you think of how it can be more detailed, but the basic idea is pretty simple. So the sample of the trajectories are essentially samples from a normal distribution. In dimension one, it's pretty simple to generate sample from a normal distribution. Matlab, you can use the run function in Matlab. Do you agree that if you have a univariate normal distribution, it's pretty simple to generate samples from that site? Yeah. Okay, right. You've got a normal distribution. You're easy enough to generate samples from that site. Now, if you have a multivariate normal distribution, it turns out that it's also pretty simple to generate sample. This is what is reported in this graph on the right, okay? The multivariate normal distribution is defined by mean vector and covariance matrix. If you have these two, you can easily generate samples from this multivariate normal distribution. Okay, okay. Now, when you process some measure, essentially, where is the equation? Yeah, the equation that may look easier or complicated, maybe the first time that you see it's pretty complicated, but after some thinking and not too much, there are just a question of linear algebra. And what these equation, what they do is change it, suppose the mean vector and the covariance matrix from the prior value to the posterior value. The prior value is essentially what you have before any measure. The posterior is what you have after a lecture. So in the background, there is this updating given by this formula. That's okay, this was my prior model and now this is my posterior model. The prior was normal, the posterior is normal. And now here, what I do is just representing this posterior model by generating samples and plotting the samples. I plot the samples, you know, and I plot the samples, is it here? Yeah, it's here. Can I maybe just add that when you do the conditioning, when you include a measurement from certainty, so you don't condition on the deterministic observation, right? Right, right, right. As you see right here, you have some reciprocal uncertainty also in the point where you take the measure and this is why it's exactly because we use some noise. You see, on one hand you have formula for perfect observation, but you can also, very similar formula when you allow for some noise, you know, affecting your measure. Yeah, but the thing is, that example I showed, and also probably we talked about using noise, so usually we allow for some noise, Gaussian noise affecting each other measure. Yeah, but also the thing is that the covariance function disintegrates if you input certain information, right? Oh, sorry, I'm not interested in that. Can you repeat that? The covariance function disintegrates if you input certain information. Then you have like the singular points in the covariance function in the field. I don't get the question perfectly, but what is called, kind of knows this. I don't understand. So I think as long as you're, it's conditional on the observations have to be of a specific form where they're a linear combination of the variables plus a noise with just Gaussian. So there are certain types of information that if you try to incorporate it, the math wouldn't work. Like if you just know that I observe a value and it has to be greater than a certain value, that's not an observation of this form. So that couldn't be processed with this modeling. So you have to, there are ways to do it, but it's not just this linear algebra. You have to actually do like a Monte Carlo simulation. Yeah, but I'm just mentioning it, that if you don't have the noise term, then you have a problem. It goes to covariance function, so you have no variance in the point of the observation. Yeah, I think it still works out, but you just, you have a zero and it doesn't work? No, you cannot calculate the inverse of the covariance matrix. Okay. Okay, can you tell me what is the question? If you have a measurement with exactly zero noise, yeah, it introduces some, you can't invert the matrix, basically, to do the updating. No, right, but if you have a measurement with zero noise, essentially that component is not a random variable anymore and all the rest of the field, so right, you have just to use a reduced matrix for all the over that you measure, right? Right, you can define a covariance matrix for all the part of the field that you have not measured. And this is what this formula says, so this formula here on the top, the size of this covariance matrix is less than the size of the joint covariance because you have observed a subset of the random variables so that you can define a field just on the remaining random variables that you have not observed exactly, right? Exactly, but I just wanted to say because if some of the students that they want to go home and try at home, they remember to include the noise term, otherwise you would get a lot of... No, that's a really good point, right? There's some numerical issue, right? Essentially, the noise is zero when you try to invert, no, that's a really good point, but the point is really that, right, when you have a perfect measure, you can't, as I've just said, you cannot define a covariance matrix, including also the variable that you have observed. Yeah, totally agree with us. Thanks for the clarification. Okay, very clever, it has a very long lecture to go, so maybe it's better to, that he will start covering this last part of that, right? Okay. Thank you very much. Yeah. Okay. Still working? Okay. So, I think now we'll sort of, that was an introduction to sort of the Gaussian process model, which is a way of modeling random variables that are distributed over space and having them be related to each other. And I'm gonna sort of go forward with that impact's value of information. So we just saw an example with, value of information for one Gaussian random variable or actually two Gaussian random variables, but we're only measuring one of them. But now what if we have a whole field of random variables that we're measuring multiple variables, how that affects value of information. So the idea basically is if we have a field which defines a set of random variables which affect our system, these random variables are correlated with each other. We can take observations either on one or of many of these random variables. These random variables are affecting the states of our system through some sort of a limit state function, for example. And then, of course, we're taking actions to manage our systems and then the combination of these states and the actions defines our loss or our consequence function. And then our total consequence for managing a whole system, maybe the function of the consequences for managing different parts of the system or more complicated interaction of the states and actions we take on various parts of the system. And then we can, of course, the end idea is, of course, to assess what's the value of any individual measurement and then use that value to select which measurement or measurements we should take of the system. So I think the first thing I wanna do is just briefly recap the idea of value of information and also kind of introduce the notation because it's a little bit different from what we saw this morning. But basically we have a set of random variables, F, which describe in some way the state of our system. So these could be sort of strains on different parts of the system, basically just a set of variables. This was sort of X in the morning. We have a set of actions that we're taking to manage the system. So in the early examples, you saw sort of a choice of whether or not to repair a component. So that's sort of an action decision. And then together, these define a loss function. In the morning, I guess we were more optimistic. We had a benefit, which we were trying to maximize. Here I have a loss, which we're trying to minimize, but it's the same idea of what's the utility of this combination of states and actions. And that's kind of encapsulates how much we're paying to manage the system or how much benefit we're getting to manage. The benefit would just be a negative loss. And then, of course, in the prayer condition, what we wanna do is we can compute for any action we take. We take the expectation over the possible states of the system we can get and we can compute the expected loss we're going to incur for any action we take. And then we wanna choose the action which minimizes our expected costs. So this is sort of the prior analysis that we do. So without any additional information, we just wanna choose an action that minimizes our expected loss or maximizes our expected benefit. And then now if we had access now to an observation which is related to the state of the system, before we choose our actions, what we're doing the expectation but now using the posterior distribution of F conditional to our observation, which in this case is Y. And then we wanna, of course, again, choose the action which now gives us the minimal expected loss or maximum expected benefit conditional to this observation. So this is the sort of the conditional cost conditioned to this observation. And then, of course, if we wanna do the expected analysis, we have to take an expectation over all possible outcomes of that observation. And that gives us our posterior expected loss and value of information is the difference of these two. So how much, in an expected sense, how much less we expect to pay as a result of making better decisions based on this new information we've just gathered. And again, it has some key properties. You can compare this to the cost of a measurement. So you should only choose to acquire information if the value it will provide is greater than the cost of collecting the information. And as an upper bound, you can say that the value of perfect information is there are the maximum possible value of information where the value of perfect information is the value of information I would have if I had perfect information about what the state of the system is before I choose what actions I'm gonna take. So we're going to now kind of extend to a larger system. So we're gonna see how the complexity of this is affected by the size of the system. We saw a bit of that in the morning with the decision tree branches. And we saw that as we have more and more possible states, more and more possible actions, the number of branches can grow exponentially. So as I said, the three parts of this analysis, we need to take expectation over possible states. Number of states can grow exponentially with the size of the system. We have to minimize over possible actions. Actions grow exponentially as well. And we have to take an expectation over all possible observation outcomes. And if we have many different observations we're taking on different parts of the system, exponential growth and that's overall very exponential growth in the size of that decision tree. So if we're gonna look at sort of larger problems we need to adopt strategies to avoid this growth, we saw some strategies earlier. I wanna just point out it's maybe related to one of the strategies, but it's maybe a different way of looking at it. If we can express our loss function in this way. So the cost for managing our entire system, a function of all the variables affecting the system and all the actions we take to manage the system, if we can express that as a sum of losses where each of these losses is a function of some subset, some disjoint subset of the variables affecting the system and a subset of the actions affecting the system, we get some benefit which I'll go into, but just intuitively this is in a lot of cases we sort of take this kind of approach, like if we're managing maybe a farm of wind turbines we can say the cost for managing the whole farm is the sum of the cost of managing each if we have sort of the cost of managing a wind turbine as a function of the state of the wind turbine and the actions we take to manage it, we're taking each turbines in a different state, we're taking different actions to manage the turbines and we can say maybe at least approximately the cost for managing the whole wind farm is the sum of the cost of managing each turbine. So if we plug this into the value of information expression we can take advantage of the mathematical properties first of the expectation, so the expectation of a sum is the sum of expectations so we can move the expectation through the summation there. We can also take the advantage of the fact that if we're minimizing a sum and each sum and there is only a function of a subset of the decision variables we can minimize the sum by minimizing each part of the sum separately. So here we're only sort of choosing from a subset of actions that minimize this term and if we do that for every term we then minimize, have minimized the sum because I'm assuming that none of the actions we take that affect one term or that affect any of the other terms in the summation and finally we use the linearity of expectation again to move that through. So we can also do this of course if we're not conditioning on any, if we're not taking any measurements then we're in the prior case the measurements just drop out but the math still applies and now the value of information could be written basically as a summation of local value of information. So instead of evaluating value of information using this full loss function we evaluate value of information on each part of the system using the local loss function and then sum up to give the value of information at the whole system level and the important thing is here we're still taking expectation over all the observations the consequence of that is that measurement on one part of the system because of the, if for example we're using kind of a Gaussian field as we saw with Matteo's presentation observing one part of the field updates our knowledge about the whole field not just where we're making the observation. So in this way even though sort of the consequences for managing things that happen in different parts of the field might be separate because of this the information we collect can be used to update our knowledge for the whole field and compute the value of information based on that and we still get the benefit of taking information we collect at one point and using it to update our model of the whole system. And then of course this because we only have to deal with a subset of variables and a subset of actions at each time this greatly reduces the computational cost and if we further consider that in typical situations we're not taking measurements of every possible variable in our system we're only measuring a small subset of the variables we get even further reductions in computation. So another benefit we can get is specifically when we're dealing with these Gaussian random field models of course one benefit as Matteo talked about was that you can update from prior to posterior distribution just using a linear algebra equation. Another benefit which is related to that is that it's a little bit easier to compute value information as well. So again if we have this Gaussian model of our system linear observations with Gaussian noise if we have sort of our state variables which are functions of like a limit state function here which is a linear combination of the random variables affecting the system then our limit state variables will also be Gaussian. So for example in the kind of example Matteo showed where you have two states operational or capacity exceeds demand, capacity is less than demand so it means systems operational, systems failed two possible actions do nothing or repair. We can sort of look at that graphically and say okay if I repair the system I pay the same penalty regardless of what the failure probability is whereas if I do nothing my expected cost that I'm going to pay increases because the expected cost there is product of the probability failure and the consequence of failure. My optimal action then of course I want to minimize this so for any given probability failure I should choose the action which gives me the minimum of those two. So below in this case the numbers are different so the threshold is actually 0.25 here so below that we should do nothing, above that we should repair. And in the prior case of course we have the prior probability of failure is a specific value which we compute from the reliability index we can compute a specific probability of failure where our component inputting that we can figure out what we should do and then what the expected cost of that is so we can formulate our expected loss as a function of probability of failure. In the posterior case what we actually have is a distribution over probabilities of failure the reason for this is any measurement we take we update our model and we compute a new probability of failure. If we have a distribution over possible outcomes of our measurements we're going to end up with a distribution over the probabilities of failure after the updating process and for example it may look something like this where for some measurements we're going to measure the system to be safer than we thought it was so that we have a high probability that we'll have a low probability of failure after updating. For some measurements the system will be less safe than we thought it was so we'll tend to get probabilities of failure that are higher and then some measurements will be inconclusive and we'll end up with a probability of failure somewhere in the middle. The posterior expected loss so the expected loss conditional to that observation that random observation is basically the integral of this curve weighted by this distribution. If we then realize as we learned earlier that the probability of failure is related to this reliability index then in the specific case of the Gaussian system we have this very nice property that conditional to an observation our reliability index is actually a Gaussian distribution because of just the way that Gaussians work in this example. So what we have is given a specific observation we might take, every observation we make will update our probability of failure and therefore update our reliability index. If we have a distribution over possible observations which is Gaussian and we do the same sort of thing we're gonna have a distribution of our probability failures and if we convert that back to a distribution over reliability indices that just happens to also be Gaussian which is a nice property and then the sort of practical consequence of this is here we have an expectation over observations and we may have any number of observations so that integral we're doing to compute that expectation has to be the integral with respect to the probability distribution over the joint distribution of all the observations. Here we can also compute that in the same way using this equivalence as expectation over our reliability indices and the reliability index for a component is just one number. So observations can be, we can have 10 different observations so that's a 10 random variables we have to take an expectation over 10 random variables reliability index for a component is just one random variable so we're taking an expectation over only one random variable which is easier to do. So that's very kind of brief overview of how kind of the advantages of or how to avoid some of the computational difficulties associated with scaling up to large systems which in addition to some of the things we learned about this morning can be used. So yeah, this is just summarizing if you have a Gaussian system together with sort of a loss function that decomposes in that way you can get some good computational benefits from that. Okay, so let's, so just as a simple example, let's say we have a roof which is loaded up with a snow load you can't actually, it's not showing up on the screen unfortunately so okay here we go. So let's say the profile of the loading on the roof looks something like this the snow load on the roof looks something like this which is given by a Gaussian distribution with this kind of exponential form to get sort of the smoothness. And let's now say that we can sort of measure the depths of snow at a given point to get a likelihood update our prior to a posterior and now sort of draw samples from the posterior distribution as you saw earlier are now conditional to that observation. And then if we of course make multiple measurements of the depth of the snow we would, our sort of draws from the posterior would get closer and closer to the sort of the actual profile that we're looking for. So now the question is where do we want to make those observations? We can only make a limited number of observations let's say we can only make one observation in the simplest case where should we make that observation? And we wanna answer that with value of information of course so let's say if the loading exceeds a certain capacity threshold we'll have a local collapse of that section of the roof so for this profile with this given capacity threshold we have collapse in those areas which incur a certain cost for us which is proportional to the amount of area that's collapsed. We can of course take an action for example clearing snow off different areas of the roof which will avoid those negative consequences. We can do the same sort of thing where we compute sort of expected cost so if we choose to do nothing or if we choose to do nothing our expected cost will increase as we go from one side to the other it's not showing up here but there's a higher probability the sort of the average the mean looks something like a line here where there's a higher chance that we'll exceed this threshold here higher probability of failure higher expected cost towards this end whereas clearing the snow off has the same cost regardless of where we're doing it and we wanna choose the choice which minimizes our losses so in the prior case with no additional information we would choose to remove snow from this area but not do anything there and of course if we take an observation we update our model changes the probability of failures changes what optimal action we should take take an expectation over the possible outcomes of our measurement and we can actually compute the value of information as a function of the point where we're taking our measurement and that looks like the following so we have the peak is here so the best location to take our measurement would be here that's a function of several different several different properties first off the average is increasing as we move from left to right so it's a more likely failure but at the same time our prior optimal actions say okay we have a more likely failure here so we should be more conservative and remove snow from this area but then at this point the consequences of choosing to remove snow or not are equal to each other so this is kind of the point this is sort of the decision point but again the measurement is not exactly there because when we take a measurement we learn a little bit about what's happening in the vicinity around it as well and you can see from this maybe you can maybe see this is sort of the confidence intervals confidence intervals are much narrower here so we have a narrower or less uncertainty towards this side and towards this side so it's an interplay of all these different factors that are kind of ending up with where the value of information is highest if you're making this one measurement just maybe an interesting thing to note if you tune the consequences so that the failure has twice the consequences of this repair action we actually get the thing where it's a misclassification problem where you're trying to actually predict am I above this line or below? Typically the consequences of failure are very different from the consequences of the consequences of sort of failing failing to do anything and having a failure are much higher than the consequences of taking an intervention action when you didn't actually need to okay and then of course it's a function of sort of the actual process that we're trying to deal with so in this case we had to make a different decision for every point do we remove snow or do we not? in this case let's say we have to make one decision for the entire structure do we remove snow everywhere or do we not remove any snow? so this leads to sort of the value this is a different decision making problem so the value of information looks differently and the point of the optimal measurement is different in this case but it's actually a more difficult problem because now the loss is not the loss is not the sum of the losses at each point we're making different decisions at each point now we're making the same decision across the entire structure so we've now coupled the decision making problem across the entire structure so we no longer have that system where we can make the assumption that the loss can be written as the sum of losses of subsets of the action so now we have one action we need to take so that loss is a couple across the whole system similarly if we thought collapse in one place on the roof will become a progressive collapse and lead to the collapse of the entire roof now the states are coupled so the state of this part of the roof is now going to depend on what the loading is on this part of the roof so again we have this coupling across the entire system so it's a more difficult problem we can't do that simplification that I showed earlier with decomposing the value of information function but with a little bit more computational effort we can still evaluate it and say okay now in this case the value of information is highest for measurement at this point so we should measure on this side and that's basically a result of the fact that we want to be now very conservative and remove snow from most of the area but if we happen to observe a low loading value here where the average is much lower we might be able to get away with avoiding removing snow in that area save a little bit of cost but still avoid failure okay so just to recap that section value of information is kind of dependent on many factors the accuracy of the measurements the coupling so the degree of correlation between different parts of the field that we're measuring and of course the actions we take the decisions we can make and their consequences I'm going to go probably very briefly into what happens when we go to a temporal system so a system where there's a random field affecting the system but that random field is changing over time so in the previous example we had the snow had one shape so it was one snowfall but what if snow is melting more snow is falling then the profile of the loading is going to change over time and we have this temporally evolving system just for clarification for the first case before we go into the temporal case I have a feeling that you have some sort of assumption on the temporal dimension yes anyway even though you don't mention it because if you have like realization exceeding the threshold and the root state is dangling and it will complete your failure of zero and the point so you must have some sort of so in this idea on what is going to happen yes so although it doesn't exist don't do the failure everything is fine and independent yeah I would say that yeah in that case it's again it was a little bit of a simple example so we can very easily see whether the roof is failed or not so taking a measurement of the depth of the snow where the roof is already collapsed doesn't make a lot of sense but kind of maybe extrapolating that to a system where it's not always so obvious I guess the important thing to say is that you know the actions we take have to be able to we have to be able to actually have an effect on the system so when we what you might have in mind is something like that the conditional probability of failure is being too high or something where the conditional probability of failure will then somehow incorporate time-married effects like wind effects or let's say what's it called effects which has to do with the temporal accumulation of heavy load for instance, timber structures yeah we just have to add that in this constructed case because otherwise we would just observe that the structure survives it means that there is no probability of failure so we have to assume in this case which is very deductive of course but we have to mention that there are some time-married effects we don't have control over yeah then we can talk about the data yeah I would say that like uh... for some reason there's some time between when we make the observation and the actual collapse so if we observe this, you know, we have 24 hours before the actual collapse occurs we have to have a system change since then yeah there's something happening that's not immediately observable but through a measurement we can deduce and not included in the model and not included and another one, it's actually very interesting so it was also a super presentation so far but I just want to put some practical point of view again so you assume this spatial correlation structure with this squared exponential, right? yeah and this is somehow fixed, isn't it? yes so what happens if you have many, many observations? because in a Bayesian context you would always update but not only the beam but also the covariance, right? but you keep this covariance fixed so in a practical context, many, many observations unfortunately this is not very often the case but then you would have your spatial correlation that is maybe totally inconsistent with what you have observed you know what I mean? I would say I would say, I think what John is proposing is that the correlation might also be a model yes, so you're up to it yes, something I haven't discussed is yeah, there can be uncertainty in that and of course the parameter is affecting the model which is a whole higher level of decision if you have so if you have if that was the true correlation length and your model was absolutely correct you wouldn't have that problem but it's very obvious and it's a very nice practical example you had an assumption on the spatial correlation of the snow depths that's a tough one, right? yeah but you fixed that somehow yeah so yeah, it would you need the sort of the drawback here is you need the model of the world you can of course do what we saw in the morning we have sort of the this is kind of the small world where we know the model but in the sort of bigger world example we don't know the model exactly and that adds sort of another a whole nother like in the what I showed you before there's three layers of complexity with the states, the actions and the observations there's a whole nother maybe layer beyond that or several layers beyond that where we don't know the model we don't know the parameters of the model we don't know whether the model itself is correct and so on so forth I guess I don't know if we want to we want to take a break and or I can very quickly skip I have a few more things which are maybe less important than going through any examples in the afternoon but we can do a short break okay thank you so the first thing I really want to touch on is the time variant problem so if we have a system where the random variables affecting our system are evolving in time in principle we can also kind of model this with the Gaussian process kind of framework if we wanted to just by extending to another dimension so we had maybe before we had sort of one dimension of space Matteo showed examples with two dimensions of space but maybe we need two dimensions of space and one dimension of time for example to describe it we can also use a correlation structure where the correlation length is now a correlation time so over a certain number of hours variables are more or less similar to each other so again in the same kind of way we can build a model where we have at each time step certain variables affecting the system that affects the state of the system at that time, we can also take observations we also take actions at different times and we have consequences we suffer and maybe one assumption we might make is that the consequences we suffer from managing the system over time can be written as maybe a discounted sum of the consequences we incur at each discrete step in time at each year for example this is one model we might use I won't go into details here but just to briefly say the one important thing to keep in mind here is that while in the spatial case we could take observations anywhere we wanted and use those observations to update our model of the whole system now we have to be very careful because we can only make decisions based on information we've collected in the past we can't use decisions the outcomes of observations we're going to take in the future to guide our decisions now so our actions are taken in a very important sequential way where each the set of actions we take at a certain time are based only on the information we have up to that time and so we have to be very careful in computing that and that's also a major source of computational complexities because now we don't just optimize what are we going to do now we have to consider in the future we're going to gather more information and we're going to change our decisions in the future that's an additional layer of complexity which I won't go into details about but just to be aware of that and also briefly just if we make this this kind of assumption which is similar to the assumption I made in space we can do that and get sort of a similar similar computational savings where we can compute sort of the value information as the sum of these terms but the very important thing to keep in mind here which is which is a very important restriction on this assumption is now the actions I take at a certain time can only affect the losses at that time so the actions I take now are not going to affect my losses in the future and that's in many applications that may be a very poor assumption to make because the actions I take now are going to affect the way my system behaves in the future so that's in this case in the temporal case that could be a very restrictive assumption to make to get this kind of computational savings and you have to go through the full analysis considering sort of the full history of what you've done in the past as well as what you're doing now to affect is going to affect how the system evolves in the future but yeah okay I'm not quite sure I get it so what you're saying that the actions you take now can only affect your business with losses right now under this assumption yes which is you know maybe a I mean you can do something now but like if everything is okay if your system is dead off and has not saved why would you do anything? yeah so this is one application area for this type of thing is maybe a very sort of simple model of response to an extreme event so if I imagine a system for a coastal system for many years and there are there are hurricanes happening so in maybe in this year I say I'm going to predict that there's going to be a hurricane I mean I need to build a sandbags to protect my my coastal system and then at the end of this hurricane season I take it down and then depending on whether I built the system and whether or not a hurricane comes that year I suffer a certain loss and then the next year I make a new decision again maybe the probability of a hurricane coming this year is related to the probability that a hurricane came the year before but uh... because of the climate system yeah you look at you look ahead well you look ahead in terms of uh... the system is the system is evolving uh... in time so there is benefit to taking a measurement now if I take certain measurements I know how the system is evolving maybe I know the trend so I can make better decisions in the future but this assumption just says that the consequence I suffer is only the consequence of what happens in this year and what I do about it in this year so you have to take the estimate for the next year yeah so this is I mean again this is a bit restrictive this only applies when that assumption is in is in place and in many applications you have to explicitly model if I take an action now it's going to affect how the system behaves in the future uh... so then that's just very briefly uh... second brief mention is sort of sensor placement and scheduling uh... basically what this means so if I say a sensor placement what I mean is I'm uh... I'm selecting certain places in space and I'm going to collect information either continuously or intermittently with a certain schedule uh... in time at those locations so this is kind of a in a structural health monitoring context this is I'm choosing where I'm going to place my uh... strain gauges and then uh... you know as the system changes I'm always going to get measurements from those same locations from the strain gauges uh... as related problem is kind of a scheduling problem I have uh... sensors in a few places but I have to choose when I want to observe from those sensors so uh... in some cases we have uh... power constraints if our if our sensors are deployed remotely and they're operating from battery uh... we don't we can't collect and transmit information all the time we have to uh... be very uh... careful with using our battery resources so we can't collect information everywhere all the time but uh... if we schedule appropriately we can make those decisions and then finally there's sort of an unconstrained problem where you can collect information at any point in space and at any point in time and that might be uh... if you have sort of uh... many inspectors you can maybe on some days send them out to inspect many components and on other days you only send them out to a few locations but it's sort of a mobile inspectors and you can send out however many you want uh... and then all these cases can sort of be handled uh... with different sort of uh... cost functions or constraints in your optimization the objective here is to choose the set of measurements which maximizes the value of information of those measurements minus the cost of collecting those measurements and for example in the placement problem it may be sort of a very small cost to collect more information in a place where you already have a sensor or it may be a negligible cost but if you want to collect information in a new place where you don't have a sensor uh... there may be a very high cost associated with installing that new sensor so you have to trade those off as well uh... another distinction just to mention uh... the difference between online and offline in a sequential problem uh... offline means sort of you make a plan of what information you're going to collect where and when you're going to take these measurements and you follow through on that plan uh... this contrasts to an online case where the information you gather is then used to reevaluate and re-optimize your plan for collecting information uh... and of course whenever possible you want to do this because uh... obviously when you reevaluate you're gonna base that on the latest available information and you're gonna at least in an expected sense get a better plan of that so basically the the idea is you uh... make this plan then you collect information at the first time step use that information to update your your predictions and your models you then reevaluate your plan for the future and then continue forward uh... in an iterative manner and obviously you can only do this when sort of uh... the time between these sequential steps in your plan is greater than the time it takes to re-optimize your plan so if you're doing on the order of uh... you know milliseconds or seconds it may not be practical to do an online sensing but typically civil engineering were making decisions with months or a year time scale and it's very often going to be possible to do this online updating uh... and then finally there's uh... the issue of sort of combinatorial optimization so uh... when I select when I'm selecting measurements what often what I'm doing is a combinatorial optimization problem I have uh... a large set of candidates for where I can place the sensor and from those candidates I want to select a certain number of uh... places that are actually make those measurements and so the the only guaranteed way to find an optimal solution to this problem is to look at every possible combination because uh... the the properties of the value of information are such that I'll get into that maybe a little bit more later but uh... basically the only guaranteed way to optimally solve the problem is to look at every possible uh... combination I could make and of course the number of those combinations grows uh... very very quickly with the uh... number of candidates and considering in the number of sensors I want to place uh... one way possibly to avoid this which is a way that I've often made use of in my work is a greedy optimization algorithm uh... basically what this does is instead of choosing where to put all the sensors uh... you choose where to put the first sensor as I showed in the example with the roof and then based on the fact that you're going to have a measurement of that location you now choose where I'm not going to place the second sensor where I'm going to place the third sensor and so so instead of looking at every possible combination you uh... restrict your search by fixing sort of one one place and then looking at all the remaining places and then fixing the second place looking at all the remaining places uh... this is much more efficient but uh... we do have to be careful because there's no guarantee that this will get us to an optimal uh... solution uh... the reason for that is that the value information lacks a property called a sub modularity and uh... simply put this is kind of a diminishing returns policy where the the value get you get from two measurements uh... if you if it did have this property the value you get from two measurements is always going to be uh... less than the marginal value sorry the value get for the two measures has to be less than the sum of the values you would get from each measurement individually so the basically the the more measurements you have the less uh... value each the less marginal value each new measurement is going to provide uh... that is not the case i'm gonna try to illustrate that's an example let's say we have a system two components uh... the states are described by random variables that involve in time so over time the state oscillates like that if it exceeds a certain threshold we have a failure and consequences with that we can take actions as in the previous examples of a binary action which avoids that failure uh... we also take measurements of these uh... blue line at each point in time but our measurements are going to be highly biased so they're going to look like these orange this orange is a measurement of this blue and this orange is a measurement of this blue and there's a very high uh... offset between them which we don't what that offset is beforehand so the way we would attack this with the 3D optimization is as follows we look through it and we try to find the one measurement that gives us the highest value uh... in this case we pick a measurement there but the measurement gives us almost no value because that high bias uh... means that these the uncertainty after that observation is still very high but with the second measurement we can now look at the two measurements and from those two measurements we can get an idea of what the uh... systematic bias in those measurements are because the measurements will have a certain relationship to each other uh... and we know from the prior model that uh... the variable has a certain uh... behavior so based on the difference between what we're predicting the variable to behave and where those measurements are we can kind of get an idea of what that bias is and that idea of what the bias is gets better the more measurements we take and uh... it actually ends up provide multiple measurements now provide a very high value of information so as I said a little bit earlier because we lack some modularity means that this one measurement by itself sorry provided a very small value but this measurement combined with this measurement provide a very high value so there's there's not a diminishing returns if there's diminishing returns that first measurement would provide the most value the second measurement would have provided less value but that combination of two provides more value than any than either of the individual measurements uh... together and that's kind of a lack of some modularity property now if we keep going forward uh... what we eventually see is that because of this the greedy algorithm has gotten stuck it's only measuring here because it only has it only has measurements here so it can only figure out what the bias is for measurements on this component it still doesn't know what the bias is for any measurement on this and therefore those measurements have low value and are not selected it'll only select a measurement once it's exhausted all the measurements were considering on this component which in this case is ten but once it has that first measurement now can go back and say okay now i can correct for the bias of another measure of other measurements with that measurement i already have and then it goes forward and picks up high value of information related to the management of this component uh... and we can see this is sort of the the value of complete or perfect information is probably now measured everything we can and that's sort of the max value of information if i were to subtract uh... sort of a linear cost so cost increases linearly with the number of observations we take this is the net value of information and then based on the net value of information this is the optimal set of measurements we should use so only these four measurements on this component uh... intuitively you can pretty easily say that that's that's not correct what about managing this component uh... there are ways to avoid this one way is uh... a reverse optimization and basically this starts with the full set of all measurements and then iteratively removes measurements from the set while keeping the value of information high you can see that that keeps measurements on both components uh... basically up until near the end where it has to abandon one of the components and then subtracting the cost that gives us our maximum net value of information is now much higher and it consists of measurements on both of these components uh... the main drawback of this is we had to work backwards from the set of all possible measurements as i measured mentioned earlier typically the set of all possible measurements is much much much larger than the set of measurements we actually want to use we consider many possible options and then only choose a very small subset of that so working backwards would take much much longer to get to the best solution than working forward uh... if you do want to work forward we can use uh... a heuristic approach by kind of switching to uh... a different uh... a different optimization objective which is not the value of information so it's not our ideal objective but it does have this sub-modular property uh... and with like a little bit of a heuristic we can say that we we optimize based on value of information but when the sort of the rate the growth rate of that value of information as we add more measurements drops below a certain threshold maybe we switch to another heuristic to pick our next measurement then switch back to value of information and now we can uh... climb up uh... back up to where we were before so this is kind of one option that you might use to kind of avoid these traps where uh... this greedy of algorithm although it's very efficient falls into this sub-optimal solution uh... so that's just something to be to be aware of uh... that even though this algorithm is efficient it it can run into problems uh... in some cases uh... so with that um... just to try to briefly summarize uh... value of information of course is for learning the course is very critical for supporting uh... decisions about where to collect information because it directly assesses what the benefit of that information is going to be to reduce our uh... to make better decisions and to reduce our management costs uh... unfortunately the complexity grows very quickly in large systems so uh... just sort of blindly applying the decision tree approach will lead to sort of a very large decision tree uh... in a large system so we need to take uh... basically whatever advantages we can to try to prune those uh... prune those branches on the decision tree down to something that's more manageable uh... the sort of assumptions we make about how the system behaves and uh... what actions we can take uh... to manage it uh... is basically what allows us to do that pruning in a good way uh... we saw earlier uh... splitting the decision tree based on the actions and based on the states is one way to sort of avoid that complexity and that's the function of how the system functions and how it uh... how we're going to manage it uh... so and then in this lecture we uh... Mateo talked about Gaussian random fields and how they can be used uh... as a special class of model which allows for very efficient uh... updating and also allow us to uh... in a principled way describe the correlations between random variables in different locations to get kind of a smooth uh... shapes that often is kind of what we want to describe spatially varying phenomenon uh... i should also mention at this point so we can it's doesn't always have to be uh... normally distributed random variables we can do things like applying exponential transformation so if we have a start with a Gaussian random field apply a exponential transformation what we end up is a set of log normal random variables instead of normal random variables so we can work in in kind of the log space and uh... work with uh... variables that are have a different distribution than just a Gaussian uh... for other kinds of model we may need to adopt uh... more the way to do Bayesian inference is not not so simple we may need to do uh... more complicated ways to do that Bayesian inference and updating especially if we have sort of a non parametric model especially i won't go into that but just to be aware of this Mark of Jane Monte Carlo class of algorithms for addressing that Bayesian updating uh... in general uh... and of course uh... we as i mentioned earlier uh... response to uh... the question um... something i've kind of one of the things i've lost over here is uncertainty in the model and then in this case would be uh... okay we have a correlation length but do we really know what that correlation length is or is it itself a random variable so that's an idea that we don't necessarily know exactly what the correlation length is but it may actually uh... be an uncertain variable and we may need to in addition to updating our uh... model based on our model with a certain correlation length we may need to also go back and using our data update our model uh... update our prediction of what that correlation length actually is uh... that's adds sort of another level of complexity to the task so uh... with that i think i'll conclude and uh... as i said so uh... sebastian sent out an example problem that you can work on answer questions about that