 Okay, so good afternoon everybody. My name is Valestando Menafoglio. I work at MOCS in the Department of Mathematics of Politecnico di Milano. The picture that you see here is the central entrancy of the Politecnico di Milano, which is also what I have behind me. So today I'm just giving an overview of an approach, which is that of object-oriented data analysis to deal with data that are density functions. I will try to give you some methodological contribution, but also to keep the focus also on the applications, so how to deal with actual applications. Now, why should we care about the analysis of complex data? Well, complex data are everywhere and basically everything we do in our life produce data. So we can say that we are living in digitalized society and data-rich environments. Even the data that may seem simple at the first sight might be better represented through exhaustive summaries. So even if our actual observation is a very simple object, it could be that we need to, let's say put together the data in order to have something more compact, but exhaustive on the phenomenon. And here it comes the presence of PDF data, so data that are represented as distributions. Now, I will work on an example, which is very close to our daily life, unfortunately in the last year. And so we talk about an example on density of mortality that were clearly perturbed in 2020 with respect to previous years. Now, even if the data that we have at our disposal are just counts of deaths, we need in some way to summarize this data in order to provide the more compact information and to try to study the dynamic of the phenomenon that we are observing. Now, what we see here are the mortality distributions that we have in the provinces of Italy along the years from 2017 to 2020. And it's clear that we have a dynamic here and what we want to do is to understand which are the methodologies that can allow us to study this kind of complex data. So and study the data a long time, but also a long space. And in fact, special dependence may play a major role in our analysis. Again, on the same example as before, we have here mortality data that are distributed in space. And basically we cannot just forget about the distribution of the data in space because this can provide us with the very informative data on the context that we are studying. So for instance, in this case, we might be interested in studying not only the temporal dynamics of this data, but also the spatial structure that arose in 2020 that were not present in previous years. So other types of data that may be collected in the environmental setting are data related with the hydrological properties in aquifer system. These are data that are, let's say, classical in my presentations. I will not focus on this today, but it's just to give another example of simple data at the beginning that might be summarized by more complex data. So here, what you see in the slide are particle size data. So the raw data that are the beginning of the analysis are the dimension of particles that we find in soil samples that might be summarized in distributions. And again, these are particle size distributions. So exhaustive summaries that may present a spatial dependence. Now, what's the problem of classical approaches when dealing with the complex data? Well, the problem is that basically most time they try to reduce the complexity of the data and try to analyze only part of the information that is available. So this is just a schematic representation of what a classical approach would do. That is that they would start with a data set of complex objects, the classical approaches would try to reduce the data set to a simple environment like just keeping the color of the curse, so the grouping or just some non-exhaustive summary, and then would try to make prediction or analysis of these summaries. And clearly in this process, lots of information is lost. Instead, the idea of the object oriented approach is to try to keep the entire information and so to use as the core of the analysis the entire object and not only some summary, some feature of the data. And the effort is to try to exploit the entire information content which is within the data. So let's say the main idea, the foundational idea behind object oriented data analysis is to deal with data as objects. So the atom, the building block of the entire analysis is no longer some collection of features but the entire object which is considered as a point within a mathematical space. Now, this is a classical in a multivariate data analysis where we deal with the typical vectors in Euclidean spaces. And in these spaces, basically our points are vectors, let's say of dimension two, but in general could be also of dimension higher than two, where it's clear how to define the sum of the product by a constant and so linear combinations of the data. And it's also clear usually how to measure the norm of vectors. So let's say how big a point in this space is as the length of the vector that points from the origin to that point. But it's also clear how to measure distances and so similarity between points in this space as well as it's clear how to measure angles between vectors in the space. Now, the idea behind object oriented data analysis but also before functional data analysis is to deal with complex data by embedding them in the generalization to the infinite dimensional setting of the Euclidean spaces. And this generalization, a convenient generalization, is that of the Hilbert space. So in the Hilbert space embedding, we imagine our data as points within a mathematical space. And this mathematical space is assumed to have the same kind of geometrical structure as the usual space, so as the Euclidean space. So in particular, we have the concept of sum, we have the concept of product by a constant, and so we can take linear combinations of the data. But we also have the concept of norm of distance and of angle. And this allows us to build a statistical procedure which generalize the statistical procedure in use in multivariate data analysis. And it turns out that everything of this comes from two concepts, which are the ones of operation that provide us with a vector space and the concept of inner product that allows us to compute angles, distances, and norms. And so it gives us the metric of the space. Now, the key point, similar to compositional data analysis for multivariate data, the key point in all of this is to understand which is the correct mathematical space where to embed our data. So similarly as in the multivariate setting where compositional data do not believe in the entire space, but in the simplex, similarly, when we deal with probability density functions, we have to be concerned with the nature of the data. And so we have to choose the correct embedding for our data that may not be the space of square integrable function, L2, which is the typical space where analysis and functional data analysis are done. So the key idea for making functional data analysis and object-oriented data analysis of probability density functions is to understand which is an appropriate embedding space for the data. Now, the appropriate embedding space, or let's say an appropriate embedding space, comes from the generalization of the ideas that we all know probably on the multivariate setting, the generalization to the functional setting. So just let me recall this because I'm the first speaker, and so I just recall a few concepts that are, I believe, known to everybody. But let's say that if we have a discrete PDF of d components, we can represent this PDF as a multivariate composition. So as a vector with d components where the components are positive and they sum up to a constant, which is typically set to 1. So in the multivariate setting, each element of this composition will represent a part of a wall according to a given partition of the domain. And we all used to say that the informative part of the composition is not the absolute amount contained in the parts, but the relative amount, so the log ratio among the parts. Now, in a three-dimensional setting, so with the d equal to 3, the situation is depicted here. So the data object in this case is made of three parts. Now, the y-axis is not particularly informative. But what we know is that this three-part composition will not belong to the entire space of 3, but to a simplex. And if we fix the constant to 1, this simplex is the one represented here. So what we usually do in compositional data analysis is to work out, to represent the data according to a geometry which is well suited to this simplex. And this geometry is the h-ish geometry. Now, if we go to PDFs, actually, the reasoning is the same, but we have to go to the continues. So the continuous PDFs can be seen as functional compositions, so as relative objects of infinite dimension that have similar concerns. So they must be positive and they integrate to a constant. And similarly, as in the multivariate setting, what we actually want to do is to analyze them, not in the entire space which will be usually L2, but within a geometry that well represents the characteristic of the data. Now, this geometry is the generalized h-ish geometry that is called the base Hilbert space geometry that was introduced in these two works. And basically, it represents operations and an inner product that makes the simplex infinite dimension a Hilbert space. And now in this space, we are free to make linear combinations of density functions as well as to take inner products between density functions. Now, I will not spend much time on this because this has been already discussed in previous edition of Coderwork, but just let me remind you that similarly as in the h-ish geometry, here the inner product is not defined in terms of absolute parts of the composition, so just as point evaluation of our distributions, but always as log ratio among point evaluations in the same spirit as in the h-ish geometry. There are several reasons why this is a very convenient geometry and we can find very nice interpretations in mathematical statistics like with exponential families and perturbation is here interpreted as a base of data of information, similarly as in the multivariate context. Now, what is interesting to us today is what's the strategy, the general strategy for an analysis in the biospace? Well, what we can do is to take our PDF data, embed them in this space, and then formulate appropriate methods to deal with the analysis of the data and to analyze their possible spatial dependence if it is present. So, let me give you an example of an analysis in base space just to take the opportunity to illustrate a number of methodologies that can be applied in a real and let's say modern data analysis. So, I get back to the data that I presented in the first slide to discuss more on a possible analysis of this data set, which tries to take elements of the theory of biospace that today is available. So, I just recall that our data represent the mortality distributions along the year from 2017 to 2020. So, each curve here that you see represented in gray is the distribution of mortality along the year in a province of Italy, and we have 109 provinces. So, the data here are interpreted as functional compositions and embedded in a biospace. Now, before going on, just let me say why I believe it's meaningful to interpret this data as PDF first and so as functional compositions. Now, the crucial point is that these are the raw data. So, the data that we get from our National Institute of Statistics. The raw data counts of mortality, so death counts in a day of the year in a province. If we were just to look at the data as they are, so on an absolute scale, what we would see would be just that there is a high mortality in big provinces and relatively lower mortality in other provinces. But it would be very difficult to highlight dynamics that happens that are interesting to our study of the mortality in 2020. And so, for instance, we clearly see here a province which is Bergamo that was severely hit by the COVID pandemic, but it would be somehow masked these absolute counts from the counts of other cities like Milan that were severely hit, but not so much on a relative scale. So, what we really want to do is to go on a relative scale to understand and appreciate what's the relative impact of mortality in 2020 with respect to the usual mortality or to the mortality in the other periods within the province. And this relative scale is the one that really allows us to capture the dynamic of the phenomenon. Now, the first question that we want to answer is to understand how much of this mortality was predictable from previous year, because we clearly see here in the dynamic of the year that there are peaks in mortality that somehow we can expect. And these peaks are usually related with the winter season where we have the seasonal diseases. And then we have a peak in the summer season where we have the hottest period of the year. I forgot to say that we are focusing on the elderly population, so on people older than 70. And so here is more evident this dynamic, whereas if we were focusing on younger, on the younger population, the mortality distribution would be flatter, luckily also in 2020. Now, what we want to understand is what was anomalous in 2020. So is this peak that we see anomalous with respect to previous year? Well, the answer is yes, clearly. But how much? So what we do is to formulate a linear model to describe the yearly dynamics of our mortality distributions. This linear model can be represented in the base base B2 in this way. We have that this is the mortality density in a year, which is represented as a linear model where we have two coefficients, beta 0 and beta 1. And then we have a single regressor that represents the average mortality density in the previous years. So this average mortality density is the average over the four years before of the mortality densities. So if I want to see in 2020 what is the predictable components, I try to fit a linear model on this year and I try to see what is left out of this linear model. So let's say that this part here is the predictable component whereas the error of my model will be the unpredictable component. And so the anomaly that I can see in the 2020 year. How to estimate this model? Well, the typical way to estimate this kind of model is to try to perform a dimensionality reduction to principal component analysis keeping a relatively high number of principal components and describing the beta on the basis of the principal components. So here, as I said, this is the unpredictable component in my process and this is what I can use to study what is left once I try to predict the usual components in my model. And what is left is here. If I try to apply the model in 2017 based on the previous four years what is left is relatively flat. I may have some small component on a small peak which is not explained but basically it's almost flat. But what we see here is that in 2020 something has changed and has changed a lot. And if I try to see what's the norm of this error so how big this error is so the norm in B2 of this error and I make a special map what I get is something like this. So the errors are almost un-correlated as I see in the previous years whereas in 2020 the error seems to have a spatial structure. Now the Italian people that are connected will recognize here the provinces that were hit by the first pandemic wave so here we have Bergamo Berescia, Lodi and Mantova. So this was the very first hotspot found in Italy. Now the point now is how can I formally quantify the spatial structure so how can I describe in B2 the spatial structure that is present in this procedural from the model. And to do so I can use the methods of object oriented spatial statistics. Object oriented spatial statistics tries to formulate a model from spatial statistics but for data in the Hilbert space and in particular I will refer here to the base Hilbert space B2. Now the key idea, yes? Sure, Alexander, yes, about five minutes. Okay, I will try to cut out and come to a conclusion, thank you. So the key point here is that we have spatial dependence and we try to measure the spatial dependence by measuring how similar objects that are collected close by are in the base space B2. So the first point is to measure the spatial dependence. The second point is to understand how to produce the spatial maps from these spatial dependence. To do so we generalize this is something that I already presented in a previous edition of Coderwork to formalize the concept of spatial dependence and the measure of spatial dependence is a virogram, which is the generalization in B2 of the classical virogram that those of you which are more familiar with those statistics will clearly recognize. If we do so and if we try to plot a virogram in B2 we will recognize that in fact a spatial structure came out in 2020 and it's clearly visible with respect to the previous spatial structure mostly related with the nugget effect. So in 2020 not only we see that there is something different in the dynamic in the shape of the residual but we also can appreciate that there is a spatial dependence coming out. And to represent and interpret the spatial dependence we may look for the direction of main variability of our data set and try to give spatial maps of the principal components. To give it in a figure it's like looking in our infinite dimensional simplex for the directions of main variability and to project our data on these directions. If we do so what we get is something like this. This is a representation of the first principal component and this of the second principal component where in black we see the mean of the 2020 errors and in red we see how we move if we proceed along the principal component in direction of high values. Whereas in blue we move in the direction of low values. So we see that high scores along the principal components highlight high values of a peak. So a critical peak in April, March, April which was the first wave of the pandemic whereas if we have low values these are places that were not hit by the first wave. Whereas for the second principal component we clearly see that if we move for the positive value of the scores we get that we are moving to the second wave whereas to negative values we are moving basically to a flatter situation. And so in this way we are able to represent our densities through map that clearly represent the spatial structure of our data. And so again we see the regions that are associated with the first pandemic here and the regions that were most associated with the second wave of the pandemic that are not the ones that were already hit by the first ones. I will close with this and I will just leave the curiosity about how to deal with the phase variability in the data. Well you see that there might be a shift in the data and actually also the shift in the data might be captured by using the base space methodology. I don't have time to enter in this but I will share my slides and so you can find more in this in the end of my slides. So let me jump to the conclusions. Geo-reference complex data arise very often in real case studies and distributional data are more common than one may expect because they are very often obtained as summaries of data that would be otherwise un-infractable. Base spaces are a natural embedding space for this data and actually we should not think of base spaces as a space only for PDFs because they are very rich spaces that could be used to deal with any kind of relative data and so in particular also with data representing the phase variability. This let's say the base space methodology is a very broad field of research and there is still much work to do so if you are curious about that I will be happy to give you more references and to discuss with you of further development. So thanks a lot. Let me use just 30 seconds to thanks a lot all the organizer for this beautiful idea of organizing this online meeting and for giving me the opportunity to be one of the speakers at this meeting. Thanks a lot.