 So what I would like to talk about today is the data challenge that is posed by multidisciplinary research. And I will also tell you how we can solve the data problem. So as you may notice, a lot of research currently is moving towards multidisciplinary research. So I think this will be like the kind of next generation of research. And as Marie already said, I will focus here on health and well-being. And in health and well-being, there are two big questions that are usually looked for and answered by the researchers. And one question relates to diagnosis. So who is at risk, for example, for low well-being? Another question relates more to treatment. So why is someone at risk? In answering these questions, what we see over the recent years is that the answers are coming from different domains from different disciplines. So, for example, in accounting for risk factors, not only at lifestyle, one looks, but also, for example, at environmental exposure or also at genetic constitution. And insight is growing more and more that often these different factors jointly into place. So for someone, genetic susceptibility may be the main factor of risk, but for someone else, the genetic constitution may put him at risk, but the environment may protect him from the risk occurring. Also, in answering the question about the mechanisms, so why is it that someone ends up in a situation of low health, we see that the answers are coming from mainly the biological and behavioral domains. So there is a big rise of bio-behavioral research where in fact people are really trying to get a feeling of the intricacies that play between those two domains. And this is very important in developing multidisciplinary treatment plans because you want to know who should get mainly lifestyle advice, who should get the right medication, and who should get a combination of both a change in lifestyle and maybe some support by medication. Now, this means that in collecting data, the different views enter. So nowadays, you often find that big collections of data are gathered where not only from one discipline information is obtained about the same group of patients. And here, I used colorectal cancer survival as an example. So we are not only looking, for example, at their personality and their lifestyle using questionnaire data, which is something where psychologists would typically do, but we also look, for example, at their genetic constitution using biomarker data. Now, this kind of data because they represent different disciplinary views are called multi-view data. Now, more generally, these are called in statistics in data analysis, multi-block data. So you have several blocks of data on the same group of respondents. And within the blocks, there is some kind of homogeneity with respect to the information you're displaying. And these data really pose a challenge because you have one problem that is more related to the fact that you are collecting a mouse of information, and you are looking for new types of relations. So there is a discovery component in this type of research. So it's difficult to know where to look. And given that so much information is collected, and often in an un-targeted way, think about genome-wide associations that is where you get millions of genetic elements that are just measured because they can be measured. It's hard to know where to start looking. So finding the relevant variables just by your data analysis method is one of the main tasks that has to be accomplished. Another fundamental problem that comes with this type of data is that you have data blocks that are very heterogeneous. So not only in science, for example, questionnaire data usually is just about a few variables while the genetic data is representing a huge block with 10,000 or even millions of variables. And this can be solved fairly easy by pre-processing steps. But what we also see is that there is correlation within blocks that is much stronger than correlation patterns associations between the blocks. So finding these relations between the different views, and that's what we are really interested in, that is a very difficult task because it's hidden information. And in answering these kind of questions, what people usually do is that they rely on methods for analyzing a single block of data, either by just doing separate analysis on each of the data blocks, which makes it difficult to find links. Now the other option that is very often used is just to put everything together to concatenate the data, and then to apply standard techniques that we have. Now this is not working. So multi-block methods are really needed to analyze multi-block data. So to illustrate this here, I have synthetic data where you have, on the one hand, questionnaire data with fairly strong correlation between all variables, and then on the right side, we have biomarker data with two sets of variables that have a weaker correlation, but still some correlation. And very interestingly, there is here these sets of eight variables, three from the questionnaire data, and five from the biomarker data that are correlated. But the correlation is quite weak. If you apply a method just for concatenated data for a single data block, so you do not account for the multi-block structure, what you will typically find is either a representation where all correlations are displayed, also the spurious correlations, those that are in fact not meaningful at all, or if you try to penalize for the spurious correlations, you will lose also these small correlations that are of interest between the different data blocks. So the answer to handling this type of multi-block data of multi-view data is to take the block structure really into account. And this is something on which I've been focusing during the past five years, and I developed a principal components type of model that uses block regularization, so that really tells the data analysis like, you know, there is this block structure, and we're interested both in finding these view-specific, these block-specific associations, but also in finding these very small relations between the blocks, and applying this type of methods to the synthetic data that I just showed you, indeed well recovers the type of structure that is in your data. So on the one hand, in the top panel, I'm showing the view-specific mechanisms, and there you'll find that indeed for the biomarker data you have this big group with all correlated variables, and for the psychological variables you have indeed also a correlation between all psychological variables, and the method also picks up this joint mechanism that is linking the two types of data blocks, and that is what is displayed here in the bottom. Now having a quick look under the hood, what is done is telling this principal component analysis that it should penalize coefficients to zero, and in this way you get this automated selection of the relevant variables, but in penalizing to zero we impose the block structure, so we tell the method like you know some of these mechanisms are view-specific, so that means that you should cancel out the order block completely. Okay, now there are two types of coefficients in effective principal component analysis, and people usually are not aware of the fact that these different formulations of PCA are using different types of coefficients, and what I call their weights and loadings usually are denoted in the literature by loadings. Now as we recently showed, choosing whether you put this type of zeros on the weights or on the loadings really matters, and if you're really interested in mechanisms then you should put the zeros on the loadings. If you're interested in prediction answering the who question then it should be put on the weights, and this was now published recently, we got the acceptance letter yesterday in a paper which was written by a data science PhD student. Okay, now to illustrate the type of methods that I developed together with my team. Here I have an example where we're interested in family dynamics, and we have three blocks. We have questions relating to the mother, answered by the mother, questions answered by the father about his feeling about parenting and his relation, and then also questions posed to the child, so these are the three blocks. Now what is interesting to see is that you get an easy interpretation, many coefficients are just zero, and furthermore you clearly get these structures of the block, so you immediately see that for the child there are only a few specific mechanisms at play. The other example here at the right is a prediction example where I used data on vaccination to predict whether a vaccine is working well, so I think this is a relevant example for the time being, and using my methods, which is the SP cover methods, the top line in the table, we get an out-of-sample prediction that is much higher than for other state-of-the-art methods. Now what is interesting here is that I had 55,000 predictor variables, and the methods only retained 200 variables, and these because its genetic data were annotated with a tool, and it shows that in fact this was a very interpretable set of selected genes all relating to viral sensing and the immune system. Okay, so to conclude the major challenge for multidisciplinary research is to deal with multi-block data in an appropriate way, and for principal component analysis we have done work that does the job and also for clustering other people are working on similar type of approaches that are also working, but for many methods still developments are needed, and I think this is one of the main fields of data analysis for still a lot of work needs to be done, and I also see that the number of papers is increasing very rapidly, so this is becoming one of the main types of tasks for people developing novel statistical methods. If you would be interested in applying this type of structured PCA method, I have both an art package and also freely available quotes on github. Thank you for your attention.