 So, Pelin is one of our ESRs at the Spanish node as supervised by Joachim and is the final speaker of our first day of the symposium. The slides are ready to go. Yes, so we're looking forward to your talk. Thank you so much, Karsten. Hi, everyone. Today I'm going to present you an air network project we developed. So, yeah, before I started, I would like to mention my advisors are Joachim, Isabel and Carlos. And I started September 2019. So, as I mentioned earlier, our project is a neural network project. And more specifically, this project are really more complex algorithms than regular machine learning algorithms. And we are using them, and this complexity enables us to work a really complicated problem. And they produce really well-performance. However, this complexity also causes an issue called black box. In black box concept, the calculation during training are well known. However, the interpretability of the results is uncomprehensible. And this lack of explainability can cause a problem, especially if we are working in a health care field. For this reason, we are trying to implement biological knowledge in the network. More specifically, signalling pathways. As Joachim mentioned in his earlier speech, a signalling pathway is a cellular-level operation in an organism. And each pathway can either create a product or an helper signal for the next process. And we are specifically using circuits. Circuits is the smallest element of the pathway. And they have effect or receptor gene information. And we are using this information to improve the interpretability of the network. The network uses single cell RNA sequencing data. And it predicts or annotates cell type information. And to summarize, the project I'm presenting has three main concepts, cell type prediction, cell type annotation and interpretation. So cell type prediction is simple supervised approach. It's just predicting cell types. Cell type annotation is focusing on the hidden layer information. And the interpretation part is focusing on this integrated knowledge, which is the signalling pathways. I'm sorry, circuits. And our design is fit for neural network with dense connection. And we are using this biological knowledge in the first hidden layer. And we are implementing the implementation is applied by using each node in the first hidden layer as a circuit, which means if the genes, which is the input layer, has no connection with the selected circuits, then we are removing the weights of course here. And for the data sets, we are using two data sets for this project. And before giving more details about data set, I would like to mention one paper, which called CyBET. And this paper is important for us because both of the data sets are shared. Both of the data sets is obtained from this paper. And we are using exactly exact preprocessed step as they did. Because this paper also shared some results according to their algorithm. Because this CyBET algorithm is one of the well-known algorithm for cell type prediction tasks. And they produce really competitive results. As I mentioned, we are using exact same versions as they use. And the first data set we called is PBMC. It is a balanced data set and has seven different cell types. And the second one, Melonoma, there are two subsets for this data set. They call training and testing. Training and testing has exact same data set as training. However, there are one additional cell type, which training doesn't have. And we are using PBMC data set for cell type prediction analysis. And the Melonoma data set, we are using both interpretation and cell type annotation test. Task, yes. The purpose of this analysis is just to understand how well our proposed network is. Because the purpose is just to get the prediction versus grant through. And for these purposes, we designed an experiment with 100 iteration by using stratified holdout validation. And each iteration uses 30% of the data set as a testing data set. And these are the seven cell types, which has the same exact number of sample size. And the evaluation is for evaluation, we have five metrics, accuracy, balanced accuracy, F1 precision and recall scores. In addition to these five metrics, we are also generating average confusion metrics. These two figures is the evaluation metrics result. The top figure here belongs to these five metrics. Actually, for balanced and accuracy balanced are actually, yes, all of them seem similar because our data set is balanced data set. However, we actually generally are calculating the paying attention to distribution of each cell type. However, we are seeing similar results. But this is not an issue because the data set is balanced. And both graphs are taking a value between 1 and 0. 1 is for best, 0 is the worst score. And as we can see here, they are quite close to 1. And this shows us that our network is working for cell type prediction. In the bottom figure, in addition to general overall picture, we also pay attention to cell type details. Precision, F1 and recall scores is calculated for each cell type detail. And they are all close to 1. However, there are a slight difference for these three cell types. And side toxic memory t-cells and regulatory t-cells. However, again, overall scores are above 0.7, I'm sorry, 0.8. So in overall picture shows that the prediction for each cell type is working well. And this is the average confusion matrix. And this is a basic grant-reversed prediction. And the diagonal shows how well the predictions are. And for these three cell types, we are slightly getting lower performance. However, in biologically, they are quite similar. And as we can see here, each worst prediction is distributed by these three cell types. As I mentioned, CYBED paper shared these three results in the bottom. These three algorithms are the most well-known and most common-used algorithms in cell type prediction tasks. And our results is in the top figure, which is quite similar to CYBED. We are producing really competitive results as CYBED. And however, in addition to these three algorithms, we are aiming to interpret our network. And for this purpose, after we saw that our predictions are really good, our assumption was if we are getting good prediction in output layer, then the encoding layer, which is the last hidden layer of the network, should also have the information about the cell type differentiation. And for this reason, the experiment is started with in two steps. The first step is basic visualization, just creating clusters. And then the second one is novelty detection, which we are trying to find the similarities for samples and then trying to understand which samples are relatively abnormal than the rest of the data set. And the interpretation is, again, looking the biological layer, which is the first hidden layer in our design. And more specifically, we are looking the activation scores of each node in the first hidden layer for each cell type, which means basically we are just getting the most active nodes for each cell type. And according to these selected nodes, we are making the analysis. And for example, for B cells, the most active pathway should relate it with cell-to-cell communication, prolification, protein expression, and segregation. And our results shows that the most active pathways are related with these functions, cell functions. And this output provides us that during the cell type prediction, the integrated biological layer is taken into account. Yes. And for the next one, as I mentioned, we have two steps for this analysis. And both of the analysis is using encoding layer. And the experiment is designed by using 70% of the training set here. And then after model is trained, a full training set and the testing set is used for this experiment. And as I mentioned, the purpose of this analysis just to provide a visual proof, because we want to see encoding layer can create clusters for each cell types. One important cell type here is negative cells. As you can see here, this negative cell doesn't exist in training set. And we are trying to create a separate clusters for this cell type, while also getting separate classes for the rest of the cell types. These are the two results. As I mentioned, 70% of the training set is using for the model training. And this is the full sample set for the final figure. As we can see here, it's cell type, which are the non cell type for the model, is creating one cluster for each cell type. And for the next one, testing. And as we can see, we also can see that one cluster for each cell type. But also we are seeing that one separate cluster for unknown cell type. And this shows us the encoding can be used as a cell type differentiation. And for this purpose, we are using a similarity score from local outlier factor analysis. And the steps are, and the analysis start with calculating the encoding information, calculated by using encoding information. This similarity score shows how for the giving sample, so this score shows that for giving sample, how similar this sample for the rest of the data set. And it gets 0 to negative infinite. 0 if we are close to 0. It means this giving sample has a relationship or more similar to the rest of the sample set. And the next step, by using this similarity score, we generate a distribution plot for each cell types. And by using these distribution scores, we are calculating a threshold. And this threshold is helping us to decide which samples are going to assign as an unknown and which samples are going to execute for our network. And for this, if the calculated similarity score for giving sample is above from this threshold, then we are executing our network. And end of this analysis, we are getting a cell type prediction. And if the value similarity score is below than this threshold, then we are saying that the network doesn't see this label before. So that means this is unknown. And this is the final results after following all steps. And as you can see here, we are getting a high accuracy for all the cell types. And also, we are getting a high accuracy for unknown cell type identification. One issue I would like to address is the performance of NK. As you know, the networks work better with higher sample sets. For this cell type, unfortunately, we have a small set, a small sample set. That's why the performance is relatively lower than we compared the rest. And this is the comparison between Saibet and our network. And as we can see here, we are getting a high performance, except NK. And also, we are getting a similar performance for unknown cell type annotation. The results I showed is still one of our ongoing project. And it's funded by several European agencies. And before ending my presentation, I also would like to mention our latest project. So we are currently working a cell type trajectory project. There are several studies in this field. And basically, in this project, we are trying to find the cell type behavior according to... Yes, for this project, we are using cancer data set because we are trying to understand the changes in the cell type when it became the cancer. The idea is if we see this trajectory, which is like the pet, we can understand how the cell behavior for the patient or the treatment or the cancer stage. So these are the full sample set we are using. As you can see, there are multiple. And yes, we are in the initial stage. However, the results we get so far is what the literature says. And we are using a variation of the encoder for this design. And also, we are using a signaling pathway also in this project. Thank you so much. Thank you, Pelin. Now, time for questions. Lukas. Thanks, Karsten. Thank you very much, Pelin. It was a wonderful talk, very clear. And I also love not only the content, but the progress part at the top. It really helps with the science. I think everyone here is the same. I'm starting to work a bit with our single cell RNA-seq data. And the cell type annotation and prediction seems to be kind of hard in some situations. I was wondering if you know how well these algorithms perform in situations in which the cell types are not that different, for example, with single-nuclear RNA-seq in which the different cell types are, for example, different types of neurons. For, I mean, in biological point of view, I'm not that expert. But I know there are, for example, in our result, like confusion metrics. If the cell types are really close to each other, the network is quite under-performance. So yes, for this purpose, I think, in my opinion, to work with some biologists to understand or at least finding a route for the cell types might be helpful for the network. Thank you. You're welcome. More questions? So I have one question. You mentioned the local outlier factor. Yes. And this includes a number of parameters. Are your results sensitive to these parameters? Or have you explored this? Or how do you set them to make a phase? Of course. How do you set the parameters? Actually, this was one of the topics when we are using this network tool. We use the default feature, because our aim is to see, can we separate, or can the default? I mean, even if we are using simple design, can we find this separation between cell types by using the simplest model? So just the default parameters in some software implementation of LOF. OK, good. Further questions? Yeah, Giovanni. Thank you, Pelling. I have a quick question on the structure of the first hidden layer where you mentioned that you keep only the connections for the genes that are relevant to the circuits. Have you considered doing that instead as a soft version? So instead of literally removing the links between the neurons, just adding some regularization term in your training procedures so that the first layer is incentivized to mimic these connections, but also allows for a little bit of wiggle room to try to find something extra. Do you mean what's the regularization? Do you mean like, I mean, L1, L2 regularization? No, you could just add the term, sorry, to the loss to penalize how far the connection goes from these hard cutoff. But then, since it's just a term in the loss, you could still get a little bit of other contributions and try to get a more general result? Actually, we didn't, but we make this analysis, which is like, we shuffle the connection. I mean, we assume that we keep the same number of connections, but we shuffle them. And however, when we shuffle them, the results we are getting is really poor performance, but we didn't do anything about what you are saying. But when we shuffle them, we saw that this shuffling is really important because with this shuffling, it doesn't mean for the network. But keeping them is a biologically meaningful is providing performance. Thank you. Are there further questions? If not, then I would like to thank Pellin and all the speakers of the first day of this symposium. Thank you very much.