 Thank you very much, Andrea, for the generous introduction, and thank you for the committee for selecting me for this award. It's a huge honor, and I'm very excited to be here with you today to tell you about my work and fearing gene function, and discuss some of the opportunities and challenges that I think we as a field need to address. So, characterizing gene function is important to understand the behavior of biological systems from cells to organs and organisms. It can help us identify potential mechanisms of diseases and lead us to find genes that we can target for therapy. However, despite the tremendous advances in our experimental technologies and the large amount of data we are acquiring on genomics, we still don't know the function of so many genes. And even for the genes that we know their function, our knowledge is far from complete. So, for example, the gene ontology database provides one of the best representation of gene functions and the notations of gene functions by representing functions as a hierarchy where more general function at the top and more specific one at the bottom. However, in general, gene ontology doesn't provide us with cell type specific or tissue specific gene function. So, this is a huge limitation because we need to interpret our data based on reference databases. And basically, when we are interpreting our data not based on context specific annotation, this means that we are very limited. So, I really believe we need to do a lot more in order to evolve these ontologies and databases in parallel of acquiring more and more data in order to make that system grow more dynamically. And, of course, characterizing gene function can be challenging for many reasons. So, for example, the redundancy in gene, like accounting for each other or because of the blue tropic nature of gene functions. But I truly believe that challenges in data analysis is a major obstacle. So, with developing better bioinformatics tools, we would be able to gain much more and learn much more about gene functions from existing data and using existing experimental tools than we currently have. So, one powerful way to learn about gene function is based on genetic screen. So, by inducing gene loss and observing how the cells will change after losing the gene. So, for example, here you see epithelial cell like, and when we knock down a gene, if the cells become more spindle-shaped, then we can infer that the gene have a role in epithelial to mesenchymal transition. If we knock down another gene and we see that the cells become larger, we can infer a role in cell size. Or if the gene knocked down affected cell-to-cell contacts, we can infer that the gene regulates cell adhesion. So, fantastic news that for more of our decade now, we are able to generate these data sets in high-throughput fashion. So, what I've shown here is a multi-weather plate where cells in each square will be grown independently and we can induce different gene knockdown. So, using automated microscopy, then we can image different locations in these wells and for each of these images, we can have between 100 to 2,000 cells. If we image four image locations for two to five markers and for 20,000 gene knockdowns, you can imagine this result in huge amount of single cell data, millions and millions of single cell data that allow us to understand the heterogeneity and gene effects in different functions in different cellular populations. So, to analyze these data sets first thing, we need to develop automated image analysis pipeline to separate cells from their background. Then, we can extract various features from each cell. So, for example, the length of the cell, the area, the context of the cells, so are the cells are crowded or are they more sparse, et cetera. And this can give us tens of features for every single cell. So, current analysis pipeline has not developed a lot in the last few years, so basically for every gene knockdown, we have measurements from hundreds of cells and we simplify this by taking the average of each gene per to patient and then we apply dimensionality reduction just to reduce the number of features and then use clustering to group genes that have similar phenotype, which we assume that then they will have similar function. So then to interpret what the cluster means or what are the functions associated with each cluster, we can use functional enrichment analysis. It turns out that this pipeline doesn't work very well in a practice. So, in reality, we find it very, very difficult to interpret the resulting clusters and this problem I faced myself during my PhD studies. So, what happened that is really difficult to understand what this cluster means in terms of function, so most scientists will go back and focus on one or two features that we really understand what they mean, do a head analysis and discard the rest of the data. So, 99% of the data in these screens are often discarded and we are not learning from it. So, I really wanted to solve this problem because I feel there is huge potential. We have the tools to do these screens in very high throughput and having the efficient tool is very important to learn from these data sets. So, there is many limitations with the pipeline I showed you but I will focus on one major limitation or what I think is the major limitation. So, let's say there is a gene A that regulates function one and function two. If we knock down this gene what we can observe that in sub-population of cells features associated with function one will change while in another sub-population of cells genes features associated with function two change. So, then when we look at the gene profile we will see features representing function one and features representing function two. Now, if we consider all the genes and all the different gene profiles representing multitude of functions then when we apply clustering it become really difficult because these gene profiles are very noisy. They represent different functions and it's really difficult when it is so like clustered together that we understand what these clusters represent. So, what I think we really need to do is to discover phenotypic signatures associated with each function in order to be able to better interpret this phenotypic data. So, for example, find the features or the phenotypic signature associated with function one then find the phenotypic signature associated with function two and et cetera. So, of course the question is how we can do that. So, after thinking about this for a long time I came with this idea is to use our prior knowledge to guide the phenotypic discovery. So, for example we can use gene ontology what we know about genes involved in cell division for example and use this to find what feature is affected by cell division genes and then train a classifier that separates cell division genes from a random set of genes. This classifier can give us functional phenotypic signature that then we can use and apply to see what other genes resulted in a similar phenotype and then we can predict that it has these genes have similar function. I call this approach knowledge and context-driven machine learning or KCML. So, knowledge is driven because it's driven by our prior knowledge but also context-driven because when we apply it in context of certain data type that in certain cell type then we assume that the predictions are more specific to that context. Of course this idea is very simple but it turns out the implementation of such classification framework is very challenging. So, as you might already guess gene ontology is provide very noisy labels. It's not perfect, it's based on so many studies from so many systems and it's also not context-dependent. Another problem that I faced is that I chose the gene ontology terms or functions that have between 100 and 500 genes. So, it's not very specific, not very general but still like with this amount of genes the negative class is huge. So, we have the rest of the genome which is around 18,000 genes. To make long story short what I found to help in this classification framework is to use effective sampling strategy where we take multiple samples from the negative class to try to distinguish the positive class from that. I also found that support vector machines seem to perform better than other approaches and it's more opposed to overfitting and reducing false discovery rate. So, with this then what we can do is train a classifier for each function independently. So, for example cell division as I said and then run the genes that based on their phenotypic similarity for their involvement in cell division. Of course not all gene functions will be present in one data set. No data set is complete in terms of what we measuring. So, we set a threshold for what we consider successful classifier. So, if it doesn't give good performance we don't include it. And by doing this for so many functions then what we have at the end is multiple ranking for each gene what are the functions or the different functions it's involved in. And of course we can have the confidence of these predictions based on the phenotypic similarity. So, to test this approach I collaborated with Lucas Pilkman's lab at the University of Zurich where I had access to a genome-wide sirenase screen where 18,000 gene nookdowns were performed. What was interesting from this data set is that the entire well was captured. So, on average we have around 6,000 cells per gene nookdown which give us a lot of information about heterogeneity and bilitropic gene functions. In that study they imaged two markers DAPI for the nucleus in blue and the rotavirus infection is marked with viral protein 6 in green. And the original study just looked what genes increase or decrease rotavirus infection. And I really wanted to do some more comprehensive analysis on this specifically focusing on multi-cellular organization. So, as I said before we can extract many features describing the shape, the local cell density and the various intensity measurements. So, this resulted in around 160 features per cell. So, basically I benchmark KCML in different ways and what I found that actually it can learn much more functional information from phenotypic screening data. So, you see here clustering-based approach like KMS and self-organizing map. If we consider the area under the rock care it's almost random, it's 50%. So, that's why we can't interpret these clustering results because it's the very poor performance while KCML on the other hand perform much better. So, the challenge was okay, what about the new predictions? Are they any good? And what I've done to validate that is basically over the period of developing the method there was a new annotations added to the UniPro database. So, there was thousands of those. So, what I've done is to compare KCML predictions to the new gene ontology annotations. And what I found that actually significant number of KCML predictions were added independently to the gene ontology database now, so 15%. Which means that KCML can predict novel gene functions and give us hope in the future that similar approaches can help us to start defining context-specific gene functions. So, the paper was published in molecular systems biology last year and we identified as under yesterday interesting role for olfactory receptor in cellular organization. So, please have a look and there is some also interesting discussion of how to handle single cell data and represent that. So, in summary, KCML predicts multiple functions for each gene based on phenotypic similarity. What the main purpose was to allow much more comprehensive analysis of genetic screen than what was possible before. It can importantly be applied to any type of phenotypic data. So, I discussed image based screen data but I also applied it to transcriptional data. So, looking down genes and measuring gene transcription as well as CRISPR screens measuring viability across different cell lines. So, it's quite generalizable framework that we can use it to learn from any phenotypic data. There is so many interesting future directions for this work but the most aspect that I'm excited about is now to apply this to screens performed in different cell types using similar markers and start comparing what are the differences in gene functions across these different cell types. So, I'm collaborating with AstraZeneca to look at various arrayed CRISPR screens to study that and have like build a portal of context specific gene functions. Also, like this framework is not only limited to predicting gene function. You can imagine with any data you have if you have prior knowledge, a good amount of prior knowledge, you can implement similar things. So, for example, if we have a drug screen where we know a lot of drugs mechanisms of actions, we can also apply similar framework like driven by our knowledge. So, at the end I would like to thank you for listening. There is also another interesting visualization tool that I developed for bio-imaging data. It's called Shabography.com. So, please have a look and let us know if you have any feedback. And I'm very happy to have your questions. Thank you.