 Hi everyone. Thank you so much for coming. My name is Cindy and I just finished my undergrad at the University of Toronto. Today I'm going to be talking about my project, which is predictive models of data set specific single cell RNA sequencing pipeline performance, which was supervised by both doctors Alina Solaiga and Kieran Campbell. So just as a brief introduction, I'm sure all of us here are familiar with SCRNA Seq and how it allows us to detect different cell types within tissues. And due to the high dimensional nature of single cell data, many computational methods have been developed for data analysis. And the number of these types of computational methods has been greatly increasing over the past few years. So in a common single cell RNA sequencing clustering pipeline, we will have five major steps starting from the raw counts. We will first perform filtering, normalization, feature selection, dimensionality reduction, and then finally clustering. And for each of these five steps, there is an ever-increasing number of different methods and parameters for these methods available. So it is up to the practitioner to select and build the appropriate methods and parameters so that their pipeline is optimized for their data set. But due to the sheer number of methods, this is becoming more and more difficult as time goes on. And previous benchmarking studies have shown that pipeline performance is in fact data set dependent. So for example, in this study, they compared all of these clustering methods here on the x-axis on these data sets on the y-axis. And so we can see even for very commonly used methods such as SRAT, when they measure the performance using the adjusted RAND index, which measures the agreement between the pipeline cell type labels and the ground truth cell type labels, SRAT did perform really well on the majority of data sets, but there are still some data sets where it was not performing super well. Moreover, these authors also looked at different upstream gene filtering methods before they benchmarked these clustering algorithms. And you can see that when they changed the gene filtering method to the third one on the right here, the performance of pretty much all of the clustering algorithms dropped a lot. And this just goes to show that not only is it important to build, select a clustering algorithm that is suitable for your data set, but it's in fact important to select all methods in your pipeline so that they interact well with each other and yield good clustering results. So this led us to the question that we tried to answer, which was given data set characteristics and pipeline parameters, can we predict pipeline performance on an unseen data set? And to try and answer this question, we started off by collecting 86 single cell raw counts from human data sets from the EBI single cell atlas. And on these 86 data sets, we ran 288 different single cell RNA sequencing clustering pipelines using the PipeComp R package, which allows us to define different methods and parameters for each of the five steps we previously mentioned and then run all combinations of them. Next, we computed some unsupervised clustering metrics on each of our 86 by 288 clustering results. And we also had some data set specific characteristics, such as the number of cells and number of genes and pipeline parameters. And we use these data set specific characteristics and pipeline parameters as inputs to our supervised machine learning models, which we then used to try and predict the unsupervised clustering metrics so that we could eventually generate predictions of pipeline performance on unseen data sets. And so I'm just going to jump in and talk a little bit more in depth about this workflow. So like I just mentioned, we took 86 single cell RNA sequencing data sets from the EBI single cell atlas, and we selected all of the human data sets with less than 100k cells just because of computational constraints. And you can see in this heat map here that these data sets vary greatly in different summary stats, such as the percentage of mitochondrial counts. Then we ran the 288 different pipelines on each of these data sets. And so you can see that for each of the five major steps we mentioned before, these are the different methods and parameter settings that we tried for each of the steps. And how we got to 288 is simply by trying every single possible combination of these settings. So now that we have our 86 by 288 clustering results, we computed some cluster purity metrics in order to evaluate how well our pipelines performed. So our cluster purity metrics that we used were the Kalinsky-Herabas index, the Davies-Bolden index, as well as the Silhouette width. We also used gene set enrichment analysis on each of our 288 by 86 clustering outputs just so that we could use the normalized enrichment score in order to measure the biological validity of each of our clustering outputs and use that as well as a quantification of pipeline performance. And so now that we have our metrics and our data, once again, the question we are trying to answer is given the data set characteristics and pipeline parameters, can we predict these performance metrics on an unseen data set? And to this end, we used, we tried two different models, random forest as well as penalized linear regression. And we gave the data set specific characteristics and pipeline parameters as input to these models and they would output predictions of the clustering performance metrics. And for each of these two models, we used tenfold cross validation in order to tune the hyperparameters. And so after we predicted on the test set, which consisted of 30% of the 86 single cell RNA sequencing data sets, so we have 25 new data sets in our test set, we found that predictions on our test set actually correlate significantly with the measured pipeline performance on the test data set. So you can see on the x-axis for both of our box plots here, we have the metric that the respective models are trying to predict. And on the y-axis, we have the correlation between the model's predictions and the observed metric value. So both of these models were able to predict significant with correlation significantly above zero across all four metrics. And this was measured using the Wilcox and rank sum test. Next, we also assessed the performance of our models by comparing to previous cell type annotations. So 16 of the 86 data sets that we looked at actually came with author-provided cell type labels. And we ensured that we included these 16 data sets in the test set so that we could compare our test time predictions to the adjusted RAND index, which we computed for each of the 16 data sets with the cell type labels. And once again, we have box plots here with the metric on the x-axis and the correlation between our model's predictions and the adjusted RAND index on the y-axis. So for these, we can see that at least three of the four metrics for the linear regression models actually correlated significantly above zero with the AIRI, which shows that the pipelines that our models predict to perform well actually provide cell type labels that have good overlap with expert annotations. Then we looked at which dataset features were important, essentially, for these predictions. And by looking at the top 10 most important features on average across the four metrics, we found that our models mostly prioritized pipeline parameters for predicting performance, such as the clustering resolution and normalization method. So in conclusion, in this project, we have created a new dataset of pipeline performance benchmarks, which we have also used to supervise machine learning models that we have shown can predict clustering pipeline performance on unseen dataset with significant correlation with the observed metric values, as well as our pipelines that we predict to perform well do correspond to good overlap with expert cell type annotations. So we hope that this work leads to the exciting possibility that one day supervised machine learning models may be able to generate dataset specific pipeline recommendations for single cell RNA sequencing clustering to take the guesswork out of the picture for practitioners. And yeah, thank you so much for listening.