 So welcome back. I think today is the last lecture of this week. I think we only have two lectures this week and for the next couple of weeks. Any questions on previous lectures before we start with today's lecture? I'll start sharing the screen in the meantime. So I see no questions yet. If you have any questions, just put them in the chat and I've discovered now that there is a way I can see things flashing even when I'm talking. So today's lecture, we'll do more Gaussian calculations, but we'll do something that is probably the first meaningful, well, or at least useful model that you may come across. And it's called probabilistic seed A, which stands for principle component. I'm not sure whether you guys would be familiar with PCA. It is one of the most widely used basic data analysis techniques discovered in first proposed at the beginning of the last century by Carl Pearson. And it essentially consists in a way to reduce the dimensionality of high dimensional data. So what is PCA? Extracts the directions of maximal variation. OK, so there's already something in the chat. OK, so Alex introduced it yesterday. So I might present it under a different time. I presume Alex introduced principle component analysis, which is good because today we look at the probabilistic analog. So I'm not surprised because Alex is also very much interested in dimensionality reduction and density estimation techniques. So the basic idea, what does it mean to extract directions of maximal variation? Well, often data might appear to be high dimensional, but there might be constraints in the mechanism that generates the data so that the data does not feel uniformly the whole space but is constrained on some subspace. So a typical example would be if you have 2D data, if you have 2D data, but really your data is not a full cloud but is something that is approximately along the line. So this would be a typical scenario where the data is not truly high dimensional but is approximately lying on one direction where most of the variation happens. So if I were to project the variation along this direction, I see that basically the data lives within this range. Well, along the orthogonal direction, I would see that it lives on a very small range, which indicates that the data is approximately one dimension. And how would I compute these directions? Well, the trick is very simple. So I consider my data point be x, xi for i equals 1 to capital N. Then I build a matrix that contains the empirical covariances between the various entries of these vectors. So I want a matrix that has in its LK position. It will have the expectation, essentially, of x Lxk. Now with these, the upper indices are the components, and the lower indices are the data points. So the indices of the data points. So these indices go from L and K, go from 1 to D. D is the dimensionality of the data. Because I am using the cloud. So how do I compute an estimator of this covariance matrix? Well, the empirical estimator will be given by a matrix S, which will be given by averaging the exterior products. So if each one of my x sides is in R to the D, and to me these are column vectors, each one of these terms will be D column vector times D row vector. So it will be a rank 1 D by D matrix. And I'm summing n of them. And I'm getting this empirical covariance matrix. This describes the shape of the cloud of data points. And so here we have our cloud. And it's captured by this empirical covariance matrix. And now if I find the eigenvalues, so if I find directions vectors W in R D, such that lambda W is equal to SW. So when I apply S to W, I obtain the same value rescaled. Then I'm finding directions of these the so-called principal directions of variation. So I'm looking at the empirical covariance, which tells me how various entries in our data co-variant. And the eigenvectors of these matrix give me the directions of independent or so-called principal variation. So principal component analysis, simply as an algorithm, you need as an input n data points xi, then compute S, which is your expected empirical covariance matrix, and then compute the eigen decomposition of S eigenvectors with largest eigenvalues. And so there is a parameter in PCA or a hyperparameter, which is, let's say, the choice of the k principal components, which is something that you decide. If you want to visualize your data, typically k is 2 or 3, or you can take higher principal components. The way that people normally choose it, and I guess this is how Alex introduced, is if you plot the spectrum of the matrix. So here you have position 1. You have the eigenvector with the highest eigenvalue, position 2, the second one. What you see, you typically see that the first one has a fairly high eigenvalue. Maybe if the data is truly three-dimensional, you see three eigenvectors with large eigenvalue, and then you see a tail of very small eigenvalues, which represented the noise in the data. And so in this case, if you wanted an optimal reconstruction, which is both concise in terms of extracting few features and accurate, you would cut after three. So I saw there were a number of chat messages. So the first is an easy question. We can rewrite right bottom corner part, principal. And it's spelled principal with an A L, not L E. It's an error that even many experts make, but principal directions of variation. Let's see. We changed. So in this case, are you assuming that the average along each component is 0? Practically, yes. What you would do is you first remove the, that's why it's an empirical covariance. So you first center the data. I forgot to say that. But typically, you first subtract the mean, and then you do PCA. Otherwise, you will capture the mean. Can I repeat why count three eigenvalues? Yeah, that's a very good question. So in this case, why we consider three eigenvalues, here what I'm doing, I'm plotting the eigenvalues. eigenvalues. And as I said, the eigenvalue, you can think of it as the variance of the data projected on the one-dimensional axis spanned by the eigenvector. So if I take this empirical covariance matrix and I do the following calculation, if I do W transpose SW, if W is an eigenvector, then I will get is an eigenvector of normalized to 1. I will get its eigenvalues. Otherwise, I get it rescaled by the normal of the eigenvector. So this is actually the variance component. If I project all my data, as I did here, if I project my data on the axis spanned by the eigenvector, then doing that calculation will give me the variance of that one-dimensional random variable, which is X scalar product W. And so what I can do, I can take a basis of eigenvectors, W1 to Wd. So this would be W1, W2. And at the end, I would have W capital D and plot their corresponding eigenvalues. So this would be lambda 1, lambda 2, lambda 3. So what does this plot tell me? So this plot tells me that there is quite a lot of variation in the first principal direction, also on the second and also on the third. And I see a conspicuous drop in the amount of variance if I consider the fourth principal direction. So what this is telling me is that there is a lot of variation going on in the first three, but there is very little variation going on in the remaining d minus 3. And so this is the reason why I would cut at 3. So in principle, the practical thing to do is either you want to project to two dimensions because you want to visualize things and plot them, or you compute all of the eigenvalues and eigenvectors. And maybe you start just with the eigenvalues because computation is cheaper to compute eigenvalues. You plot the eigenvalues and you hope to see a drop at some stage. Now, the reality is that many times the drop is not as pronounced as you see here. But hopefully, if there is real structure in the data that it is generally well approximated by a linear subspace which captures most of the variance, you should see a drop. And that is where you stop. What if the data points look like a circle such that the eigenvalues are near each other? Right. Well, if the data points look like a circle, so by this I mean if they look like a sphere, I think you mean. Well, then we will have the data and they look like a sphere. So they have similar magnitude in all directions. Then what we will have is that the data is genuinely high dimensional. So if in my case, I were in a situation like this, well, let's say I've centered the data and now I've got something that looks like a sphere, then what I would get is that my plot of the two eigenvalues would be something very similar. And that tells me that I should retain both directions. Incidentally, the sum of the eigenvalues, so the sum of the eigenvalues that you discard, so if I sum the eigenvalues, I mean all the eigenvalues have to be positive, of course, because this matrix is positive definite or semi-definite, depending on how big n is with respect to d. So they're going to be non-negative, all the eigenvalues. And the sum of the eigenvalues is essentially the fraction of variance contained by those eigenvectors. So if I were in this situation, these three eigenvectors might sum up to 80% of the total variance, and these ones might be 20%. So if I'm in a situation where the data is approximately spherical, then what happens is that whatever directions I throw away, I'm losing some potentially valuable information or information that is not as irrelevant. So what if the largest eigenvalue is degenerate and has multiple eigenvectors? Well, what you would do in that case, you would certainly include all of the eigenspace associated with the largest eigenvalue. There is no good reason to only take a subset of those eigenvectors. Good. So this is standard PCA. And I hope that this has refreshed a little bit what Alex explained earlier on or yesterday. What about probabilistic PCA? So probabilistic PCA is a relatively recent discovery as opposed to PCA that has been around for well over 100 years. Probabilistic PCA was introduced only about 20 years ago in a paper by Mike Tipping and Chris Bishop in 1998 in the Journal of the Royal Statistical Society Series B. And this paper is, I believe, in the references for the course. It's a very clear paper and I really recommend it. And so the question was, find a model. So PCA does not have a model. PCA takes data, processes it by computing the empirical covariance and taking a negative composition and returns some directions. Can these arise as a generative model? Why would we want such a thing? Well, I'll give you some examples of why in a second. So the picture to keep in mind is still the first one. So we have a scenario where we think that the data is genuinely on a low-dimensional subspace or approximately on a low-dimensional subspace. So now the observed data, let's call it why, lives in a high-dimensional space. But we suspect that it is really distributed on a linear subspace of this high-dimensional space. So how could we generate data that approximately lies on a linear-dimensional subspace? Well, what we can do is to take a variable x, which lives in a low-dimensional subspace, let's say, our little d, or our q actually, and map it through a linear mapping from this little low-dimensional space into the high-dimensional space. So take a vector function of x so that f maps rq into r. And if we think that some space over which the data lies is linear, because maybe the constraints are linear, then the natural mapping is given by a linear operator w. And since we think it's only approximately linear, we can add an error term. Now this matrix w takes a q-dimensional vector and returns a d-dimensional vector. So we'll have d rows and q columns. So my vectors are always common. Now, of course, this mapping, if I specify what is the form of the noise and what is the mechanism for generating x, given some x's and some epsilon's, I will be able to generate y's. So the PPCA model assumes that epsilon, so first of all, you have multiple data points, which are all iid. So epsilon is spherical Gaussian with zero mean, because it's an error term. So its covariance matrix is the diagonal times a constant. And then assumes that xi are sampled iid from a spherical Gaussian in q-dimensions. So this is the first example of generative modeling that we really see. We have data, which lives in high dimensions. And we postulate that this data is generated through the following stochastic model. We first sample independently q-dimensional random vectors to the Gaussian with zero mean and spherical identically covariance. We then use a fixed matrix. This matrix is a parameter of probability PPCA to embed it into d-dimensional space. And then we add a little bit of spherical noise, which is sampled independently. So this is, in this case, these are the observed variables and these are latent or unobserved or hidden. And it's the first example of a latent variable. So this is the PPCA model. And in the rest of the lecture, we kind of analyze it and show that indeed it is very closely related to PCA. But it also allows you to do things that PCA does not, cannot do. So I saw there was something in the chat. Yeah, the W is learned. Yeah, and we see what it means to learn it. So let's see how it is learned. Well, we have a conditional model of y given x and w and sigma squared, which tells us that by just looking back, if I know what x is, then y will be centered around wx plus some spherical Gaussian noise. So this is a Gaussian with mean wx and spherical Gaussian in d-dimensional. But how do I find these parameters, w and sigma squared? Well, the general principle is you maximize the evidence. So we have discussed that the Bayesian way of doing things is to average out the bits of the model that you cannot observe. So we need to compute. And then once you have got an average, it gives you a general measure of the fit of the data when you remove the unobserved parts. So the fit of the model to the data is quantified by this marginal distribution, which is the integral in the x of the conditional distribution times the prior distribution on x. Now, this evidence is still a function of w and sigma squared and measures the fit of the class of model, linear models with spherical Gaussian prior on x, regardless of what the x values are. So the optimal value of w, that's how we learn it, is maximizing this evidence with respect to w. But what is this evidence? Let's find out what the evidence is. Well, I can compute this integral. Or, as I was showing in the previous lectures on computations with Gaussians, we can observe that since this is a linear model and we have a Gaussian prior, so it's a linear Gaussian model, then when I marginalize this because it's a linear Gaussian model, this joint distribution, which I've written out as a product of a conditional time supplier, but this joint would be jointly Gaussian in y and x. So when I marginalize x, it would still be a Gaussian. And to determine what Gaussian it's going to be, well, I have to compute the first two moments. Because if I know the first two moments of a Gaussian distribution, I know the Gaussian distribution. So what are the two moments? So what is the expectation? What is the mean value? The expectation of y under p of y. Given w sigma squared. Well, I have to compute an expectation of y, which is wx plus an epsilon. This didn't come out very nicely, so it's re-writed. And now the expectation is linear. So it's w expectation of x plus expectation of epsilon. Now, epsilon is an error term, so it's 0. It's expectation. And x in my model has got a 0 mean prior. So this is also 0. So the marginal distribution has got a 0 mean. What about the variance, which is also the second moment in this case? So we need to compute the expectation of y transposed, which is variance to variance matrix, will be. And to do that, we need to, again, plug in the formula. So this is the transpose of this thing here. Now I break out these terms, and I get wxx transpose w transpose. That's the first of the protocol, these two terms. Then I get plus 2 expectation of wx epsilon transpose. Then I get expectation of epsilon epsilon transpose. And now the integral is linear. So the integral of the expectation, which is an integral, is linear. So I can bring it in and put the w and w transpose out, and I get expectation of xx transpose. So this is w times the identity in q dimensions times w transpose. This is 0 because the error is independent of the x and above 0 means. And this is sigma squared, the identity in d dimensions. So the variance is w, w transposed plus sigma squared i d, the variance of variance matrix. So the marginal distribution of y, which, as I said, is the fit of the model to the data once I average out the latent variables, is a Gaussian, as we knew, with 0 mean. And excuse me. Yes, please. Would you please go back to the previous slide? Yeah, that's an easy question to answer. There you go. Thank you. Thank you. Yeah, so let me check. There's no questions yet on the chat. So this is the evidence or marginal distribution. Marginal distribution for y. It's a very simple calculation, which is very similar to the calculations we were doing yesterday, basic calculations with Gaussians. Yeah? Exactly. Yes? In the previous slide, what happened to IQ? Yes, what happened to IQ? Well, I just stopped writing it because when I do w times the identity times w transposed, it's the same thing as w, w transposed. Yeah, so inserting an identity in between a product of matrices doesn't make any difference. I see. So I don't write it anymore. So what we now need to do is if we have data, so we have to assume that we have a set of observations yi with i equals 1 to capital N, how do we find w? Well, we need to write the evidence as a function of w now, n sigma squared, and y1 to n, OK? Now, the assumption, you see, each y has its own x, which is independently sampled, and its own error term, which is independently sampled. So even when I marginalize out the xis, the yis will be independent. So this is going to be a product of Gaussian distributions on each of the yis, each of them with zero mean, and this type of covariance matrix. Now, typically, what we do is to work in the log space. So this curly L is generally not the actual evidence, but it's a log of the evidence, OK? And so if we write the log of the evidence, we turn the log of the product into a sum of the data points, log of a Gaussian of such type. So here, we have a log determinant term, so minus log of the determinant of w minus 1 half because it's a square root of ww transpose plus sigma squared p. This is the bit that comes from the normalization constant of the Gaussian, which does depend on w and sigma squared. So you have to remember it, and now these vertical lines are the determinant, because they are matrix. And then we have the term coming from the exponent, which is a half yi w transpose plus sigma squared i d to the minus 1, which would be transpose. And now I can use a trick. So this is sum of yi transpose ww transpose plus sigma squared i d minus 1 yi. This is a scalar and can be viewed as the trace of this product of matrices, which this is a 1 by d matrix. This is a d by d matrix. This is a d by 1 matrix. But now, if you remember, the trace of a product of matrices is invariant when I permute the matrices cyclically. So this is the same as the trace of ww transpose plus sigma squared i d to the minus 1 times a matrix which is made up. And this is our friend is n times the empirical covariance matrix that we were talking about before. This is the empirical covariance matrix, because this is centered data. So it's the sum of exterior products. So in summary, we have that our objective function, l, contains a terms which is a log that of ww transpose plus sigma squared e. And then contains a term, which is the trace of ww transpose plus sigma squared e to the minus 1 times n times the empirical covariance matrix. OK, so I see a question flashing on and divided by 2. Yes, thank you very much. So the formal proof now would continue by taking the derivatives with respect to w and setting them to 0. And now, if you ignore sigma squared i for the time being, for a sec, as if it wasn't there, I just want to give you some equation. So if I take the derivative here, I will get something that is 1 over this exterior products times w. And here, if I take the derivative of the trace, I will get something that is w times the empirical covariance matrix. And so if I set that, suppose that w is now a single vector. If I set that, so if I take the derivative with respect to w, I get something like a norm term of ww transpose to the minus 1 times w. And then I get w times the empirical covariance matrix. And I set it to 0. And this is exactly an eigenvalue equation. So it's telling me that if I feed w to the empirical covariance, this is ignoring sigma squared. If I feed w to the empirical covariance, then it returns w multiplied by a scalar. Excuse me. Yep. So in the first line, there is an equation. But in the right-hand term, so the first term is the logarithm of the determinant, right? Yep. So it's a number. Yep. And the second term is something. A trace. No, no, it's a trace. Sorry, it's a trace of the whole thing. So it's a number. OK, thank you. So you see, if I ignore sigma squared i for the time being, then taking the derivative of this function l with respect to w and setting it to 0 returns eigenvalue equation for w. Now with sigma squared, it's a little bit more complicated. You have to work out. You have to go through the SVD, the composition of s. But the idea, fundamentally, is the same. So I have a question. Yes, please ask. So the operator s is an operator on matrices? I mean, if w is a matrix, then this is a matrix multiplication. Since we have considered w as a matrix from the beginning. So OK. Well, I mean, matrix can add on. So matrix multiplication is both an operator on vectors and an operator on matrices. OK, thank you. Yeah. So what have we discovered? We have discovered that the maximum likelihood solution of PPCA, probabilistic PCA, is PCA. So this nice little model where we generate data from a low dimensional spherical Gaussian center being 0, we map it with a parameter w into high dimensions. And we add a little bit of noise. The maximum evidence solution of this latent variable model where we maximize the marginal likelihood returns a matrix w, which has got as columns the principal components, the eigenvectors of the empirical covariance matrix. So there are a couple. Can you explain the first? Yeah, right, yes. So there's two questions, which are both good questions. So the first is a calculation question. Well, so when I take the derivative of the log, the derivative of the log is 1 over its argument. And then I take the derivative of the argument, which is w w transpose with respect to w. And that gives me w because it's derivative of square. So this term is the derivative of that. I mean, these are all matrices, so the actual calculus with matrices is a little bit fiddly, but as I said, you can get the proof with all the t's crossed and the dots on the i's in the paper by tipping and bishop. So the objective function is the same of the likelihood function. The marginal likelihood is the objective function in this case. So this type of approach that we have followed in deriving probabilistic PCA is called an empirical-based approach. So it formulates the models in terms of probability distributions and later variables, but to tune the parameters of the models, you marginalize the variables and you optimize with respect to the parameters. It is also possible to be fully Bayesian in place. So you could do Bayesian PCA, they call it. Excuse me, can you explain again why maximum likelihood solution of PPCI is PCI? Yeah, because when I maximize this evidence or marginal likelihood, you see I get an eigenvalue equation on the empirical covariance matrix. And that is the definition of PCA. PCA, as I introduced it, is just an algorithm which says, start with your data points, compute the empirical covariance matrix and compute its eigenvalues. And it turns out that if you define a probabilistic model like this and you tune the parameters by maximizing the evidence or marginal likelihood, you also get the eigenvalue equation for the empirical covariance matrix, which means that the maximum likelihood solution of PPCI is PCI. So you don't have to do gradient descent on this likelihood or anything like that. You know, it's quadratic, so it's convex in W, and you can set to zero the gradient and you will find the unique solution. Now it's not really unique because you can rotate the W space and that will, you know, if I insert an R here, then the objective function does not change. But that's basically what it identifies is the subspace spanned by the K principle components. So eigenvectors of the empirical covariance matrix with the largest eigenvalues. Does what you wrote means that if I have, so if I have no dimensional data, I mean, you see the, let's say that the latent variables have a special structure, which means they ask variables. So the eigenvalues, if I were to compute the empirical covariance on the latent variables, I will get the identity. What makes the real data have a non-trivial structure is the W. So it is the W that embeds the axis in high dimensions and stretches some directions so that they vary more essentially. So the eigenvalues of the empirical covariance corresponding to the maximum likelihood solution are in fact given by the norm of the W vectors. So what would I mean by Bayesian PCA? Okay, one more question before we start. Bayesian PCA is not terribly. Yes, so this result does happen because the noise has a zero mean but noise is generally assumed to have zero mean. Could you arrive at the same solution, to the same solution if not casting Gaussian noise? In general, no. If the question is, is there any form of noise that would still return PCA as a maximum likelihood solution, even if not Gaussian, I don't know the answer to that. So if it is an existence question, maybe. If it is, doesn't matter whether the noise is Gaussian, well, if it is not Gaussian, you won't generally have PCA as the solution. So PCA comes about, I mean, this equation descends from the fact that we had Gaussian, so you could write the likelihood in these terms. Correct, so is PCA, is PCA only for Gaussian as well? So it is true that PCA does not assume that the data is Gaussian. On the other hand, PCA will have as a solution, I mean, does not explicitly assume that the data is Gaussian. However, at some level, it does assume that it is Gaussian because it only characterizes it in terms of the first two moments. So PCA does not have a generative model but extracts some second order statistics. So if the data is generally non-Gaussian, PCA will not be very accurate. PCA assumes that the data is Gaussian, but you can apply it to any data and what it will do, it will do exactly the same thing as PCA. So regardless of whether the data is Gaussian or not, the maximum likelihood solution of PCA will be given by PCA, by the principal components. Richmond, can we apply PCA to data with combination of continuous and discrete variables and to what extent you can? It may not be a terribly good idea because this is a practical consideration that the scaling of the continuous variables with respect to the discrete variables will matter. So if let's say we have data that is made up of zeros or ones plus minimal amount of noise, if the amplitude of this noise is very small, then the principal component will span, will be dominated by the discrete variable. If, on the other hand, the continuous variables have got a much larger variance, then zero one, then they will dominate. But in principle, you can do it. So let's get back in the last five minutes to why is PPCA preferable to PCA in some ways? Well, first of all, let me say PCA is not strictly speaking Bayesian because W is treated as a parameter. You can place a prior on W as well. But it's not analytically solvable anymore. And so you would need to resort to compute a posterior distribution over the values. You would need to resort to approximate techniques. Now, why is PPCA better than PCA in some sense? For the first reason why it is, and this is an important reason, is that PPCA is applicable also when you have missing data. So the first big advantage is missing data. Why? Well, suppose that one data point, why misses entries three and four? Yeah, so suppose YI is in R to the sixth, and this is actually that happens quite a lot in practice that some entries in high dimensional data are missing for whatever reason. So this YI would be I one, YI two, then blank and blank, YI five, YI. Well, we still have that YI is conditioned on W and sigma, so marginally, is distributed according to a Gaussian. And we've just computed that it's a zero mean, W transpose plus sigma square. So if I have incomplete data, all I need to do is to marginalize out the missing components. So the likelihood of YI will be obtained by marginalizing in this expression, this Gaussian. So I can provide it that I don't systematically always miss the same components. Because it is a correlated model, I would still be able to learn W and W transpose by marginalizing out missing data wherever it arises if it is not systematic. And that's a big difference because in standard PCA, if you have missing data, you either throw away the data point, which is always a shame, or you have to impute the data point, okay? Which means I have to find out some clever method maybe by using nearest neighbors or whatever, Gaussian conditionals or whatever, to fill in the missing data. And of course, you can have clever methods, but they never reflect the fact that you don't know this data. They replace it with fake data, which is not the most desirable. Then another thing that you can do with PCA, that you cannot do with PCA, you can generate new data. So for example, suppose again that I have a Y for which I have got some missing data. I miss the rest, for example, yeah? Now, I can compute what would be the X, the posterior over X condition on Y in general if I have complete data, okay? And this is a relatively simple calculation of the types that we've done previously. So we get a posterior distribution, which will tell us that X will be, well, it would be a normal with W transpose Y and a certain variance, which is going to be something like W transpose W plus sigma square Y, yeah? Anyway, exercise. Computers and exercise the posterior, okay? Now, if Y is incomplete, then I can still marginalize out the missing components and get an estimate of what is the most likely X given the bits of information that I have, and I can use that to complete. So conditioning on what I have observed complete the data, okay? Because I get a posterior and I can just marginalize the remaining thing, the remaining interest. And also I can basically say, okay, given a YI and a YJ, I can generate a full trajectory of X's of ways to go from YI to YJ by going through the XI and the XJ. Excuse me. I can generate interpolate in a probabilistic way. Was there? Yes, I had a question. So when you say that if we have incomplete data, yeah, for example, we have some X that misses some components, shouldn't we, no, in the second case, when we generate and we have a data X, this is some components, shouldn't we in that case marginalize over the components that we have and then sample the components that we don't have? No, because you want to condition on the components that you have. Yes, so condition that you have observed YY and Y2. So conditioning. Yes, of course. Conditioning is simpler than marginalize. So we don't marginalize actually, we just fix some of the... So we marginalize here, if the question is how do we do PCA when we have missing data? Then we have to marginalize the missing entries. Once we have done PCA and we have this joint distribution for the Y's, if I observe Y1 and Y2, for example, then I compute the conditional of Y3, Y4, Y5, Y6 on the values of Y1 and Y2 using this thing here. And that gives me the distribution over the missing data once I have solved PCA. So this marginalizing is needed in order to be able to solve PCA in the presence of missing data. And conditioning is needed for then once we have learned the low dimensional structure impute PCA. Okay, thanks. You're welcome. So I think we are already over time, we had a lot of questions, which is great. But I'll probably pass the ball back to Matteo who may want to tell you guys what's going to happen. Next. Thank you very much.