 Good welcome back from the coffee break for the final session of today. So Margette is an ESR with Chloe Arsencotte, who unfortunately cannot be here due to sickness, but we are very happy to have Margette here and report on her work. The floor is yours. So hi everyone, I'm really happy to be here to present a part of my KG project. So I am Margette, ESR 11, and I'm supervised by Chloe Arsencotte, who's not here today. So I'm a PhD student at Minparitec in Paris and at the Institut Curie. And today I'm going to present, I'm going to talk about learning from multimodal data to improve cancer treatment. So the presentation today will be divided into four parts. The first I will talk about the context of the study, then I will present the method, then the result and at the end the conclusion and perspectives. So as general context, so today we are going to talk about breast cancer. So breast cancer is the most common cancer among women and it is characterized by the fact that it's a highly heterogeneous disease. So we can have different molecular subtypes for breast cancer. The first one I will talk about is called luminal A. And here it's characterized by the presence of hormone receptors here, estrogen and polyesterone, and the absence of HER2. So these hormone receptors controls the growth of the cancer cells. And for the luminal A, it constitutes the most common cancer in breast cancer cases. And it has a good pronostic because here we have a target. We have a target for all the treatments. And for the second subtypes, it's called, it's the human epidermal growth factor receptor 2 positive. When it's combined by the absence of the hormone receptors, it's called non-luminal. And when it's combined by the presence, it's called luminal B. And here it's 15% of the breast cancer cases. Here also we have quite good pronostic because of the advent of the HER2 targeted treatment. And the last one is called TNBC or triple negative breast cancer. Here we don't have the presence of the hormone receptor or the HER2. So it has bad pronostic actually because we don't have a clear target for the treatment. Now, depending on those molecular subtypes, so we have different pronostic, but we also have different treatment strategies. So clinician will use multidisciplinary approach or a combination of therapies. And at the institute we usually use the combination of the chemotherapy and the surgery. And depending on which one comes first, we have two types of chemotherapy. We have the neoadjuvant chemotherapy and the adjuvant chemotherapy. The neoadjuvant chemotherapy is done before the surgery. And here the aim is to reduce the tumor size before the surgery in order to have a non-invasive surgery. And the adjuvant chemotherapy, the aim is to kill the remain cell after the surgery. So after those treatment, how the clinician can say that the treatment has gone well, we measure several endpoints. The first one is called pathological complete response. So it's defined by disappearance of the invasive cancer cell after the completion of the neoadjuvant chemotherapy. And we also have two other endpoints, two other survival endpoints, so the overall survival and the relapse-free survival, so which is the length of time from the date of the diagnosis of the or the first treatment that the patient is still alive. And for the relapse-free survival, that's the patient is still alive, but with any sign of cancer, any sign of relapse. So in that context, we present the preco-multimod project, which is a project with different databases. So the first one is called database son. It contains information relative to adjuvant chemotherapy and it has 15,150 patients. And we also have the alias database, which are basically all the reports, all the free text reports that are written by the clinician. So we extract for this project the records that belong to the same patient from the database son. So we have here a multimodal dataset when we have structured data with database son and free text data with database alias. And the objectives of preco-multimod will be to identify and evaluate pronostic factors using a multimodal dataset. So here I will use the relapse-free survival status as endpoint for my prediction. So the methods that are used for it, so here I'm presenting the pipeline. So first analyze and extract the data, then do the pre-processing, and then I did multimodal learning, and then do the interpretation and evaluation of the pronostic factors. So first let's talk about the data from the two database. So for the database son, I have structured data. So I have over there clinical information. We have 162 features for the clinical database. And we have also biological measurements, which are these two more markers and the human markers. And these features are sequential features. And I also have for the alias database free text reports, and each patient has six to around 600 reports in his folder. So for the pre-processing of the dataset, so for the structured data, the clinical information, I use the feature with the more value. And for the biological measurement, I extract statistical features, mean the variance, the max, etc., and other features such as the alert because we have the normal range of the biological measurement. And for the free text reports, I use for the prediction the TFIDF of the diagrams. So the TFIDF will measure the frequency of the background and the importance of that background within the corpus. And I was also able to extract another model called frequency of events, because we have the name of the event in each report of patients. So I was able to say the two have the occurrence of unique events for each patient. Now for the methods part, I'm doing mutual model learning. So integration methods are important for the mutual model learning. And the first integration methods I'm going to present is called early integration. So for these methods, we compute, we perform, we concatenate all the modalities into one big input and then compute the machine learning methods and then do the prediction and the interpretability. And for the late integration here, we do the machine learning for each modality and then the prediction will be a majority vote between the three machine learning models. So the first models that I use was random forest. So here you can see that I have a highly imbalanced dataset. So for that, I use a random forest as the baseline and I perform sampling methods with the random forest. So the first one is balanced random forest. And here during the construction of the random forest, it's down sample the majority class in order to balance the bootstrap sample. And for the smooth and the random forest here, the smooth will over sample the minority class. It will create synthetic dataset that look like the minority class. And I also aggregate the balance random forest and the smooth random forest. And I did the prediction on the test set in order to have the accuracy and then identify the predictive feature. Another model that I use also was the neural network. So here I use the same train and validation set in order to compare afterwards. And also I did the hyperparameters tuning with Keras tuner and trained with balance batches in order to tackle the imbalance dataset. So another part that is important in this project is the interpretability. So for the random forest, we will use the building random forest interpretation algorithm. And for the feed forward neural network, I will use SHAP and LIME. So LIME, it's similar to SHAP. So it would do a local interpretation. It will perturb like an instance and learn an interpretable model locally around the prediction as an interpretation. So these two methods are performing local interpretation except maybe SHAP. But we will take into account here only the local interpretation. And then I also did the global aggregation of those local explanation. So here to perform the global aggregation, the first method of aggregation, it's called the LIME aggregation. It was given by the people that wrote the LIME algorithm. So here the first method will be the square root of the sum of the attribution of the feature j over all the instance i. And so here with the first method, when it's about text, it can be biased because when we have a word that occurs many times, we will have a larger aggregation score. So that's why I use also the second method of aggregation because here I will do the aggregation method on text data. So here the second method, I use the LIME aggregation average over the occurrence of the word here. And the third method of global aggregation here, it's called average attribute. It's the mean attribution with the n instances. And that's the third one. It's the one that is used for SHAP, for example. And I will evaluate all those global aggregation, all those global explanations by using something called the AOPC curve. So the AOPC will plot how the score is going to change when we remove the supposed top feature that are given by the global aggregation. And here I'm going to show the result of the model. I'm showing here the mean score. So here for the early integration, we see that we have similar AOPC score, but that the baseline, which means the random forest without any sampling method outperform all the others. And for the late integration, here for the structured data, the same thing happened with the baseline, which is the random forest without any sampling methods. And the same happened also with frequency of event and text data as well. So for the late integration, when we will take the majority vote between the three models, the structure, the text and the frequency data. So here I perform weighted late integration when the contribution of the model will be proportional to the score of those models. And we end up with a score that is quite great or 0.79. And when we compare the early integration to the late integration, it is similar in terms of a one score, but here the early integration has a higher IOC curve. Now for the interpretation here, I'm showing the most important feature that are from the random forest building algorithm. And here I'm showing the result of the global aggregation. Here the first one using the lime importance, the average importance, and there the average of attributes. So I perform it for the lime and for the shop. And here it's maybe not clear, but I'm showing the 10 most bigram that all the aggregation method has shown to be the most important one. And I compare also a shop to a lime to see if I can see bigrams that are in common. And we can see that there are different bigrams that we can see both in shop and in lime, such as certain doctor's name and other bigram that are really reliable from a medical point of view. And here to evaluate the global aggregation, I'm plotting if we remove the top feature predicted by a wish shop, we see that the average importance increased the performance. So here the one with the dot is the baseline when we remove a random feature. So here we can see the two above lime average attributes have shown a good result. Only the average importance has increased the performance. So as a conclusion and perspective, so the first perspective here that I am currently working on is to try a more complex model in order to improve the performance. So I'm currently working on birds, which is adapting the electronical health record for adapting bird for electronical health record. So here we can see that we can have sequence of event when we take the electronical health record, which will correspond to the history of the patient. So the token here would be the event that happened in a day for a patient. So that's the perspective for now. And as conclusion, here random forests, for random forests, resembling methods didn't show improvement regarding scores. So the random forest is quite robust for the data that I have. And the different integration methods used are similar in terms of scores. I also have SHAP and LIME that show similarities regarding the predictive features. So for text reports. And also the global aggregation model methods use work in general according to AOPC. So maybe we should, I should try other interpretation methods or using birds. I would have other other information such as attention score for the events. So I will be able to compare what I would have with attention attention score to what I had for the global aggregation methods. So that was it for me today. I would like to thank my supervisor Chloe and some colleagues here, all the SEBIO members or my lab members and all the ITN members. Thank you for attention. And if you have questions, never be happy to take them. Thank you, Marguette, for this talk. And now we have a time for questions. Jan is first. Thank you, Marguette. I found the presentation very super clear. So thank you. I have a couple of questions, actually. For the first one, in your use case, it seems that early integration and late integration performs quite similarly. Maybe early integration slightly better. Do you think that it is context specific or which integration methods would you recommend and why? Actually, I will recommend both methods, to be honest, because like for the early integration here, the advantage would be that all the modalities will work together to be able to give the best prediction. And for the late, the fact that we perform machine learning methods for each modality will be specific to one modality. So it can work as well. So both of them work. I will recommend both of them, actually, because they have advantages. So it will just depend on the data that you will have. I see. Thanks. And my second question was, do you have any idea of intermediate integration method? So we thought about intermediate integration method. So I didn't do it yet. So it's about kernel. So it will be multiple kernel learning when each kernel will learn a modality. I didn't do it yet, but it's an option to try the intermediate integration method too. Okay. Thanks. Thank you. I have a short question for Masi. Great talk. Thank you. Have you tried methods like stack generalization for late integration? It's an old method you, I think, described in the 80s, where a learner learns from the predictions, basically, of the input learners. Like a neural network, for example, can be any sort of algorithm, because you did begging, if I understand you correctly. Yeah. No, I didn't try yet, but I would like to see if it would be an impact with that. Maybe as a comment, if you do early fusion, there is a certain risk in clinical reality that if you lack data that the model becomes not applicable anymore. Yeah. Because you need to have all the data, otherwise you cannot apply it. Thank you for the questions. Thank you for the talk. I have a very quick question of more of a curiosity on the aggregation methods of the local explanation. Could you explain again the third one? I'm not sure I understood correctly how it works. Like do you just average the features over the dataset and then use that as the? I just sum all the weight, sorry. I sum all the weight of a feature of all the local explanation for a feature over all the instance that we see the feature. Oh, okay. So that's the method that Shab, Shab will have. And even if I compare with here, when I compare with Shab, we have the same top features. Oh, okay. Okay. Thank you. Quick question is when you are comparing the area over the perturbation cube, okay. So maybe it should be interesting for just a feature analysis because when you are removing a feature, for example, in the base random forest, maybe to recompute the relevance for the new model because what could be happening is that the random forest just is picking one, the top feature. And then when you remove it, there is picking another feature that is highly correlated with the one that you remove because random forest is not very good for fair attributions of relevances. And maybe the Shab, if you are using something like perturbational, is capturing the top, distributing the relevances across the correlations more fairly than the random forest or something like that. Okay. Okay. Hi. Thank you. Great presentation. Just a quick question. Are you also considering integrating images like mammography at some point? For now, I have, so I have 15,000 patients and I have mammography for only 300 of them. So it would be complicated to perform maybe deep learning with only 300 patients, but we are looking for like mutual model dataset that have images in it and with more instances. But for now, we're just going to stop here with the text and the stricter data because we don't have enough mammographies, mammograph images. Thank you. Okay. Thank you. Really nice talk. Now, I was just a bit curious because you said that you include the doctor's name or identifier as a feature. And doesn't that introduce some sort of bias? Because, I mean, there is an inherent bias in how doctors evaluate and okay, it's quite, yeah. That's a good question because we were thinking about like if there is a doctor that usually take care of like the most complicated patient, it's normal that we see them here for the relapse. So another way to deal with that was instead of taking the doctor names, is to take maybe the service or maybe the name of the specialty of the doctors and replace the specialty into the corpus and try to perform that. But I have, I had this comment from the clinician. So it's a good question. Thank you. Yes, because I mean, but even I think with the specialty, if the doc, I mean, if someone is specialized in more extreme and more difficult cases that have lower survivor rate or whatever, like recovery, then I think the models will pick up on that or I would expect. Yeah. So thanks for the talk. Just a quick question following up on the on the reports. Have you cleaned the text somehow or just calculated the diagrams? So how would you process the the reports because also thinking about synonym mapping and so on? Thanks. So I cleaned the text. It's hard to clean actually reports because we have different types. We have different doctors and they all have their own jargon. So I did the like the basic preprocessing for text, like removing the most common words, etc. And so the doctor names is inside the corpus. So I was not aware until the results. So that's why but I did the preprocessing like the basic one when we when we deal with NLP. So remove all the stop words and all that stuff. Sorry for the YouTube I repeated no synonym mapping was the question. No synonym mapping? No. Good. So then let's thank Margette again and move on to the next speaker.