 So, more in detail, what are the objectives for the part one of this module number eight? I have detected somatic variants in a cancer sample. What information and what tools can I use to interpret them? What variant annotations can I use? How do the impact prediction or assessment models use for annotation work? And then in the lab, we'll see how to get a VCF and get it annotated using a tool that's called Anabar, which also has a number of built-in databases for the frequency and other types of information. This is a self-contained package as pre-reach. Okay, let's have a little bit of introduction. So there's two levels, if you wish, that you can keep in mind. One is the variant and one is the gene. If you have somatic mutations from cancer, let's presume in the form of an encegonal prototype variants and you want to divide them between the ones that have an effect and the ones that don't, you have to keep these two levels in mind. What do I mean? Well, you may have a variant in a very important gene that doesn't really alter the function of the gene because it's in a place that's really diverged or it's introducing a change in the protein that is benign, right? That's possible. It doesn't mean that just because you make a change, you're going to break it. It may be a silent change or very mild effect on the functionality. So that's one level and we're going to mostly look at that level in this part. But the other level is, am I really hitting a gene that's important or am I not? And there's plenty of examples where you put a somatic SMV on an olfactory receptor that's completely irrelevant to cancer, typically 99.99% of the cases or even if you have a damaging substitution on a gene like titan, which is important for our function, but we know that gene can tolerate a lot of substitutions without actually having a deterioration of the function of the world. So there's also a level that's about the gene, right? Does a change in this gene produce an effect or not? This part is not so much about that. In the second part, we will look at how a gene connects into different functions and different pathways and then there are, of course, also other considerations beyond that, beyond just the function and the pathway. So keep that in mind, we're going to dive deeper into the effect of a variant of a gene product, but don't forget that not all genes and gene products are made equal for how much they can contribute to cancer. So in this diagram, I wanted to capture these aspects, but I wanted to add also the variant recurrence as another dimension. We've got to focus on, as I said, on the yellow area, which is what is the effect of a variant on the gene product, and we're going to look more at the second one later. What is the functions and pathways that gene product is implicated in and do they have any relevance to cancer? But if you statistically want to determine if a variant has an effect, of course, its recurrence is also an important factor, right? So we're going to see, for instance, in the next part, we're going to take results from a very large study, and we're going to take the most recurrent variants that were scored that significantly more recurrent than we would expect based on a background model. So overall, if you are given a bunch of variants in cancer, you have to keep in mind all these three aspects. Is it on the right gene? Does it disrupt or create a gain of function on the gene product? And do I have sufficient statistical support in terms of its recurrence, or at least the recurrence in that pathway, or the family of genes? OK. As I said, we're going to focus on the variant gene product effect in this part. The variant size that we're going to work on is the small variants, and specifically we're going to work on somatic SMVs, because that's what you learned how to call in the previous model. And somatic indels are a little bit more tricky. But in general, if you look at annotation tools, you can really divide small variants from bigger variants. Small variants are typically 1 to 50 base pairs. Substitutions, block substitutions, indels, they can either change an amino acid, remove an amino acid, delete a small number of amino acids, or create a frame shift, a premature stocking. Whereas bigger variants can remove entire exons, or can decrease the copy of a gene, or increase the copy of a gene. So they have a different impact over the gene and the gene product. And so they need to be annotated in a different way. When you annotate a variant, there are different angles, or different types of information that you can add. First of all, we're going to consider databases where variants have already been reported, and specifically there's databases where you can extract a few frequencies in either normal or disease populations. For somatic variants, this is useful just to make sure that your variant is really somatic. If you have a variant that's polymorphic in the general population, even if it's somatic for real, it's probably not going to be interesting. So you're going to be using this mostly as a negative control. Although in other areas, like in congenital disorders, frequency plays a more important role. For somatic variants, more of a negative control that the variant is really somatic. DbSNP, general reference database, Cosmic, reference database for somatic variant. Then we're going to look at the gene mapping. So how do we say that a variant disrupts the gene product of a given gene, and the complication there is due to overlapping genes and other configurations and different type of sequences. Then we're going to look at the effect type on the gene product Does it change an amino acid? Miss sense. Does it not change an amino acid? It does nothing else. Anonymous. Does it lead to a truncation or a frame shift? So forth. Specifically for miss sense, which lead to a change in the amino acid, we're going to look at richer models to give you a quantitative outcome about what is the impact of that change in the amino acid. We're going to see models like SIFT, polyphen, mutation assessor. And we're going to look at other types of information that you can use to evaluate the effect on the gene product, conservation at the genomic level, using multiple species alignment. Integrated models like CAD, they use many different other predictors and information sources. And finally, I'm going to show you a work that was published very recently. And the data should be made available soon, also as an animal database for predicting spicing regulatory effects. And if you stay tuned in literature, you'll see more good quantitative models that tackle the non-coding regulatory type of effects journey. Historically, more challenging than just the amino acid changes. So we'll start from the buy-in databases and allele frequencies. I'm going to do a quick right through the main allele frequency databases. One that's used a lot is 1,000 Genomes. And it was originally started as a project to map polymorphic variations on a really private variation that only you have, a variation that's generally in the population. Right now, there's about 2,500 subjects, apparently healthy, trying to represent different ethnicities in a balanced way, like Caucasian-Europeans, Latin-Americans, Africans, South Asians, Eastern Asians. Platform Zilumina, the whole genomes initially were done at very low coverage. And then there were X on a higher coverage. And then higher coverage, so higher depth, whole genomes being added. Mind that in the beginning, the technology was expensive and so that was the rationale for doing shallow coverage genomes. And then, of course, later on, that was increased. It's even a shallow coverage. You can do a good job of calling polymorphic buy-ins. What you tend to miss out is the private ones. But anyways, nowadays, I think this is a musculoskeletal type of data set if you want allele frequencies genome-wide. So it's definitely a good piece to have. The other one is NHLBI-SP, which started as a project to investigate variation in disease, cardiovascular disease, cystic fibrosis, et cetera. And now it's very broadly used as a control data set, even if it doesn't have healthy individuals. Definitely for somatic variations, you can use it because there's no cancer in there. It has more subjects, 6,500. Although the ethnicities are less well-represented because it was never meant as an effort to systematically map ethnicities. So that's something you should keep in mind. But on the other hand, the exome sequencing was done at very good depth. So of course, you're going to miss out on the regions that were not on capture, which, of course, are typically the regions that are intronic, deeper intronic, or non-coding genes. But you have a good resolution over code in exomes. And finally, something that was added very recently to the landscape of the frequency database is available. The EXAC is a very large-scale database compared to the other ones. It has about 60,000 unrelated subjects. The problem is it's all diseased subjects. It was meant mostly for congenital disorders. So they made sure they would deplete almost completely individuals with congenital disorders. I'm a bit more unsure for cancer. I mean, it's not an old tumor data set. But you have to be careful when using it because there are samples from cancer data sets. So just make sure when you look at the EXAC frequencies to be a bit more flexible. It's a great resource because there are so many exomes. But of course, compared to the other ones, it's used as a control data set to be more careful. Now, for databases that are meant as reference databases for vines, you've probably been introduced to you already. DBSNIP is just really meant as a repository of vines. So it's not a database of control vines or polymorphic vines or vines that are not disease-related. It's just a repository of vines. So it just really tells you if that vines been absorbed before or not. So be careful not to use it as a hard filter. I'm going to throw away everything that's in the DBSNIP. That can, if you're looking at somatic vines, I can get rid of somatic vines that are recurrent and have been characterized in the literature published and identified as important. It's basically good to look up a variant. Does it include annotations stating if it has been reported in a specific disease that you would be able to take? You can grab the subset that's so-called the clinically flagged. And then there's a smaller subset that's in Omen that you can access. But it's not really meant to categorize the type of diseases being associated. So another database that's called Cleanbar for that. We'll talk about Cleanbar. OK. What is the difference between that? Whether a variant is presumed to be common in the general population or below the common frequency. Yeah, it's just about the analytic frequency. I usually don't recommend to use that because it's relatively crude. It's much better to extract the analytic frequencies from the actual databases and look by ethnic group and look by database rather than relying on that classification. So that's really meant as a very crude classification if you want to do something basically quick and easy. You want to have very quickly a rough idea of what's common and what's not. You can use that. But otherwise, the analytic frequencies are the right way of doing it. And then cosmic, which I'm going to spend much about, but it's the repository of somatic mutations. So again, you can use it to see if the variants are already being reported as somatic, what tissues, what recurrence. So let's move into gene mapping. Discussing gene mapping and the type of sequence within the genes, I will pretty much adhere to categories and concepts that are in nanobar. But of course, they have general validity in biology. Some of them you'll see maybe are a bit different than you would define them, but bear with me. Type of genes, protein coding, non-protein coding RNA genes. So be careful when you hear non-coating variants. So there's two types of non-coating variants. The ones that have a regulatory effect. So there's a transcription factor binding site or they're sitting in a UTR or in an intron. And there's something that's binding there or recognizing the sequence and then regulating the expression or chromatin state or other events. The other type of non-coating variants are the ones that are in the exonic sequence. So they get spliced into a final mature transcript of a non-coating gene. And there's several types of non-coating genes are probably the ones that've been on in biology for the longest time are the TRNAs, ribosomal RNAs, small nuclear RNAs, small nuclear RNAs, and so forth. And there's new categories that have been unveiled in the last five to 10 years. MicroRNAs, link RNAs, long interspersed on coding, interference RNA that are transcribing the other direction, and so forth. So when we talk about the type of gene for the gene mapping, we're gonna be mainly relying on this distinction between coding and non-coating genes. And of course, the protein coding genes are the ones that are better characterized. And for the non-coating genes, it really depends. Maybe some that are well characterized, like some small nuclear or a small nuclear, and others where there's barely any literature, like an incarnate that's in a catalog. Nobody knows what it does. And the other thing that we have to keep in mind is what part of a gene am I impacting? For a coding gene, there are the coding axons, which are the ones that code for the protein. So they get translated, they code for a protein. So it's not about typically what amino acid they encode for. And so typically, an effect of the variant of a coding axon is the effect that you have at the amino acid sequence level. So we'll see that the models will look if you truncate the amino acid sequence, or if you put a random amino acid in the right place, that may be hyper-activating or deactivating the protein. Vines in UTR are more difficult to make sense. In contrast, they may have a regulatory effect, transients stability, but I don't have any good predictor that I'm giving you for UTRs in this module. And then there's the introns that are spliced out. They may also contain regulatory sequence, but most of the variants that are in the introns are passengers. And then there are spliced sites, which are sitting on the introns right adjacent to the axon, and they contain consensus dinucleotides, which are important for splicing. So if you disrupt the consensus dinucleotide and put something else, they can have a pretty disrupting effect on splicing. I said in the beginning, I would show you a predictive tool that's able also to interpret potential splicing effect in the axons, even overlapping code in axons or in the introns, and that's based on more sophisticated features. It doesn't only look at the dinucleotides, but also looks at splicing in answers, RNA, binding protein, binding sites and so forth. And then outside the gene, we have the upstream area, downstream area, and then we have intergenic variants. So these are the, this is the categorization that ANOVA will follow with respect to the type of the sequence. For coding genes, for non-coding genes, most of it is about the same. The only difference instead of having UTRs and coding axons, you simply have axons, because they do not code for a protein, so they do not get translated. ANOVA follows a priority mapping system, which I hope you can see well. Since gene products, sorry, since genes overlapping the genome, it simplifies your task of interpreting variants, if you say, if I see a variant in a coding axon, and there's also the intron of another gene there, I just report the variant in the coding axon. So this priority system is meant to give you only the variants that are more likely to have an impact of interest. And it's fairly reasonable biologically, like you can see the coding axon and course plus that comes first for coding gene trumps and on coding gene, the UTR comes second, intronic, afterwards and so forth. But you can switch it off if you want, I find it's actually helpful. So for instance, we can look at this example where you have a microRNA, which is overlapping a protein coding gene. And then for the region where the two overlap, if the microRNA axon is overlapping an intron, the prioritized using the priority rule mapping will be to the axon of the microRNA, otherwise it will be to the coding axon of the gene. So, coding axon of a coding gene trumps axon of non-coding RNA, but axon of non-coding RNA trumps intron of coding gene, right? So look at the part in blue. And the last thing to consider for gene mapping is that you can use different databases. I usually use RefSeq. Somebody else may use Ensembl, we use RefSeq in the example. RefSeq tends to be more conservative in the number of axons and transcript isofrones. And Ensembl tends to be a bit more liberal. Okay, for gene product effects, we talk to the regulatory effects or effects of our protein coding sequences. And these are the main effects on coding products to keep in mind. Stopgain adds a premature stop column, so it leads to a truncated protein sequence which is typically not functional unless you're just removing the last 0.5% and there's no functional residue there. Frameshift, similar effect by a different mechanism. Splicing, alteration of the core of splice site or of a splicing regulatory variant can have a similar level of impact by different mechanisms. So often these three are referred to as loss of function because they're more disruptive. Then there are others that are little bit more debatable whether they always have a strong effect or not, like an inflamed surgeon or deletions or a loss of a stop column which presumably adds more sequence until a stop is found or the transition stops for some other reason. And then you have miscense SMVs which typically changes an amino acid or synonymous which does not change an amino acid but could still be principle either have a splicing regulatory effect or impact because of COVID usage. So I already said that loss of function variants stuck in franchise splicing are the most one with the strongest impact but you always have to consider how many transcriptisophones are affected and what percentage of the coding sequence is affected. And for miscense variants it's all about the type of amino acid change. You can have an intuitive idea by looking at the structures of the amino acids but there are predicted models that are more accurate in that. First of all, we're gonna just have a look at conservation and then quantitative models to assess miscense variants. Conservation is a broad concept. The measures for conservations that I'm gonna discuss here are conservation at the genomic sequence level. So they're based on taking reference genomes of different species to multiple alignments and checking if DNA nucleotide is conserved or not. That's the specific scoping which I'm gonna use them here. There's two main metrics that I find useful that are available from UCSC. One is phylope, which is the nucleotide residue level conservation. And the other is fascons which is the regional conservation. So if a block of sequence is overall conserved or not. Phylope can be used as a complementary method to assess the effect of a splice site variant or a miscense variant. For miscense, you have to be careful because of course, not always a change in nucleotide is a change in column but it's a good complementary parameter to look at whereas the regional conservation fascons is useful to have an idea if you're looking at the region of the protein that's more conserved overall, like a conserved domain or some other residue that's, sorry, some other local sequence of amino acids that's overall conserved in species. And you can have the mammalian one or the vertebrate one from UCSC. Okay, let me cover this. For miscense variant, there are several models that are used to score more in detail the impact of the miscense variant. There's different ways in which you can categorize them. Some useful ways of categorize them is what type of information they use, who they just use, look up different protein sequences, align them and see how much an amino acid is conserved or which substitutions that the amino acid level can be observed across this multiple alignments or do they use more information? So beyond the multiple protein sequence alignments they also look at secondary structure or they look at other protein features that are available annotated. The other consideration is do they use a model that's not based on any prior knowledge of what are the disease variants? You just use multiple protein sequence alignment and then you look at it how often you see a certain substitution. Or do they actually rely on a positive and negative training set where the positive training set, the variants that are truly disruptive of the gene product, are actually known disease variants that are present in databases. And this difference is important because the models that are based on a training set can be more powerful, can leverage more features, but can also be more biased. What about if we, most of the non variants that are known to have an effect have been reported because people have first looked at, say, genome conservation and they have looked at SIFT or they have looked at something else. That doesn't mean that that's the only mode of action. It's just the one that we understand better. There can be a number of biases. There's a lot of debate in the community about pros and cons of different training sets or you should use a training set at all. So these are interesting methodological considerations to keep in mind. Why are two methods given a different answer or why is a method good at detecting modes of impact that maybe are a bit more exotic or not? So just to have an idea of how some of this work, SIFT is probably the one that's been around for the longest time and that's more broadly used. It's based on protein sequence conservation. So it's based on taking a bunch of protein sequences, doing a side blast based alignment and then looking up at the frequency of substitution for a specific position so that you can assess if a given substitution that you're observing a variant is likely to be disrupted because it's something that's never observed or it's something that's likely to be benign because it's observed. So mind that you are collapsing together protein sequences from different organisms and paralogs from the same organism. So that's a caveat to consider. But on the other hand, it's not based on any predefined set of energy environments. Polyphen-2 is another broadly used tool and while it has more features than SIFT, it does use sequence alignments but it also uses other features like descriptors of amino acids. So it tells you just based on biochemical considerations as an amino acid change, forgetting about its substitution frequency likely to have a strong impact or not. Presence of protein domains, other secondary and tertiary structure features. But it's based on a training set. There's two main training sets that are used, Hume-D and Hume-VAR and of course, if those are biased or don't give you the full picture, also Polyphen doesn't give you the full picture. In brief, mutation assessor could be considered a SIFT with a more sophisticated approach to alignment, it was originally introduced to do a better job not just on backbone conservation but also on specific residues, improving the alignment approach. And then of course, there are what I like to call meatloaf predictors because they mince up a lot of different things and you can see more and more coming out and each one of course, is saying it's doing better than the other ones. What these predictors do, they don't have an original idea of their own on what features to use, but they use at the same time, all the features that can grab from UCSC and other predictors, including the type of effect like frameshift stocking and so forth and maybe other information like we're in the transcript that is and they try to use a better training set and a better machine learning model to speed out an answer. The reason why I'm presenting CAD is because I find its choice of the training set particularly original and good. It's using simulated variants. So variants are you just plugging a random just using a background rotational mechanism so they're not typically not absorbing reality as the deleterious training set. And then it's using changes that you observe comparing the chimpanzee human genomes which are supposed to be typically not disrupting the gene function because the two species are close enough. So it doesn't explicitly use the variants that have been labeled by clinicians as pathogen. That's how it's interesting. It includes several of the predictors that we saw before so you see correlation between CAD and SIFT or CAD and Polyphen or CAD and Philopy. Don't be surprised, right? It's a myth, so it has all of the previous one in. And finally, I'm just gonna say something very briefly about this new model for splicing regulatory prediction which was recently published on Science by the FreyLab and collaborators. FreyLab is here at University of Toronto. And you may have seen before this group publishing on the splicing code. So basically the advising model of how differential splicing is regulated in different tissues. And this approach was taking the model and applying it to disease. If we are able to learn on healthy tissue, what are the features that drive splicing? We can take the model and then adapt it so that we predict if a variation is actually disrupting regulatory sequence that's relevant for splicing. So this is also another method that does not explicitly use disease variants. But, of course, you can show that it's good at predicting disease variants and disease variants or somatic variants from cancer tend to have more impact than neutral polymorphic variants that you observe in the general population. So this is not available yet, so we're not gonna speed it out in our exercise, but hopefully it's gonna be soon packaged as yet another out of our database, so it's easy to use. Okay, so that's the end of the lecture. So I'm actually gonna skip these conclusive remarks and just say, this is a brief summary of what we've seen. We have been focused mainly on annotating variants and looking at what genes they met to and what is their effect on gene products. So we've been blind to other considerations like is a variant recording or what's the function of a gene. So that means with these tools, we can do a good job at assessing if a variant is significantly changing a gene product. Then we have to consider what the gene product has. We have mainly looked at mapping to potent coding genes and also to non-coding genes. And we mainly looked then at the effect on potent coding genes sequence and we've looked at models that give you a quantitative score especially for those variants that can have a broader range of outcomes like instance variants that change an amino acid. So still module eight, variants to networks but we're not gonna see much variant level theta in the second part. We're gonna see mostly stuff about genes, pathways, functional gene sets. So the title of this part is from genes to pathways. What are the learning objectives of this module? What identifiers can be used for genes? Oh, we've already seen some before when we went by inter-notation. What gene annotations are available for genes? And what is gene ontology? What is gene set enrichment analysis? What are gene set enrichment tests? What are the statistical concepts behind it? And specifically multiple test correction, for instance. Finally, how to visualize gene set enrichment results using a cytoscape plugin that's called Enrichment Map and that also will be a segue into general principle network visualization cytoscape. That hasn't really, maybe a little bit disconnected. Cytoscape is a tool to visualize networks. We found a way to use it also for gene set enrichment analysis visualization but in the next module that you go through, you'll do more network analysis. So we take advantage of this lecture to also find you to some principles of how do I visualize biological networks in cytoscape, so bear with me. All right, a little bit of introduction. That's my conceptual map of how to go about gene set pathway and network analysis. On the left side, you have experimental data. You may have done some RNA profiling in tumors, identifying differential express genes or fused genes or you have identified regions of the genome that have focal amplifications or you have identified recurrent somatic SMVs and genes that are significant for those, et cetera. So you may have different types of experimental data matrices. That's what the cubes represent. So you have this experimental data and these get mapped at the gene lab. You define gene level scores. Now you have to make sense of it in context of what are the functions and the interactions of the genes. So you have on the right side, fire knowledge about genes. You have genes organized in sets that relate to functions, that relate to pathways, that relate to whatever else. You have networks where genes are connected because they're gene products, physically interact or are co-expressed. And then you have pathways which are these diagrams that are acutely representing a biochemical process like a phosphorylation cascade or a series of biochemical events that go from a receptor binding down to a transcription factor activation, right? So you have your experimental data and this prior knowledge in these forms that are computable, sets or graphs of some sort. And then you have some algorithms, some informatics. And what you typically want to get out of it is, I have some of the gene sets that are representative of my data. I have certain areas in the network or certain mini networks that are representative of my data. I have certain pathways or certain patterns in pathways that are representative of my data. In this lecture, we're gonna mostly see stuff about gene sets. And then with the next lecture, you'll see more about networks, okay? So typically, this is a map of functionally, what's the functional meaning of your experimental data? And so besides the cool picture, what is this really in practice? Well, this is what you get if you take a breast cancer cell line, you feed it with estrogen, you compare it to a cell line that hasn't been fed estrogen, you look at the gene expression, look at the differences. Test gene ontology for being enriched in gene or differential express. Take all those gene ontology terms, lay them out in a graph in a network and then identify functional group. This is telling you that stuff like translation RNA transport, DNA metabolism, RNA processing and so forth, they're all upregulated in this cancer cell that's been fed estrogen, which is kind of trivial in this case, right? What happens to a breast cancer cell line if you feed it estrogen, it's gonna start to grow, proliferate. So all these activities are upregulated the transcriptional level as well. For our types of experimental data, the result may be less trivial. We'll see some application cases, okay? So first of all, gene identification, so how to get the piping done right, basically. We've seen before mapping variance to genes. In that case, we'll use Raphsik. In general, if you have an experiment that produces some type of signal or information which can be mapped to genes, you have a wealth of gene identification solutions that you can choose. It depends in part on your application, but there are some principles that you can keep in mind. But first of all, ask yourself, what entity are you identifying? Are you having an assay that's primarily identifying transcripts, like if you're doing RNA-seq analysis, then you're looking for isoforms that are differentially expressed, for instance, using catholings or another application, then you have your information primarily at the level of isoforms. If you are doing something else, like you are in ANOVAR, you are anodizing variance, and then you're getting an output at the gene level with your official symbol, like before, then your entity is a gene and so forth. So be clear with yourself about what entity you are identifying. Then look out what databases are out there, and what is the most stable identifier from those databases. I like a lot to work with the NCBI databases with RefSeq and AnthraGene. So based on my experience, those identifiers will serve you well. If you're working with transcripts, the RefSeq transcript ID, if you're working with genes, the NCBI AnthraGene ID, they are stable, they are traceable, and they identify entities that make sense biologically. For instance, if you are identifying genes using the official symbol, that's also biologically clear, but number one, different species may have the same symbol. Number two, symbols change with time. So the official symbols is a suboptimal choice. It will immediately make sense of it, but it's not fully stable and it's ambiguous comparing different species. So RefSeq, transcript ID, and the NCBI AnthraGene ID do not have those drawbacks. So those are good choices. If you like to work with ensemble, you can choose analogous identifiers from ensembles, and if you're doing polyomics, you may prefer to first map your results at a polynidentifier level and then secondary at the gene level, for instance. So it partly depends on your experiment. Throughout the lab, we're gonna use the official symbols for a bunch of reasons, but the preferred identifier in general is the AnthraGene ID and the RefSeq that I would propose people to use, or at least to have. Then you can also add the official symbol and you can add other identifiers as an extra. Gene annotations, what is out there? What is available? How do they come in gene sets? So sets are a structure that's very easy to understand and it's the easiest way that you can map cell biology to something that's computable. So you can say in the cell, I have stuff like the nucleus, I have stuff like the cell cycle as a regulatory process. For each of these conceptualizations of cell structures and cell processes, I attach a list of genes as a set, right? That's a simple idea. There are many different types of gene sets, some of which you'll see in the literature are used 90% of the time and some are used less. Here's a complete list just to go through it and then we'll work mostly with the ones that are more popular. So number one is functions. Nowadays, gene ontology is the control vocabulary that's clearly dominating in the world of gene functions. So we'll talk more about gene ontology, how it represents biological functions. That was, as we said before, those are biochemical representations of the process. Use a graph, more or less formalized, but of course you can also look at them as a set. Give me all the genes in the EGF pathway from care, right? So you can forget about the more detailed structure and just treat it as a set. Why would you do that? Well, just because they tend to be very carefully curated, they tend to be made of genes that are better characterized, you can make sense of them. There's probably lots of reagents that are already developed and so forth, as opposed to gene ontology functions tend to be typically brought. There's others, genotype phenotype or disease associations organizing genes by the protein families or protein domains that are in their gene products, cytobanes or other genomic positions, co-espress modules, targets of regulators like predicted targets of microRNAs or predicted targets of transcription factor and finally protein interaction models. Roughly as you go down this list, they become maybe less helpful for your standard application and you can see them use less often, but it really depends on what you're looking at. Typically in your standard analysis, I would say always consider using the first two. So the functions from gene ontology and pathways from a couple of databases at least. So let's talk a little bit about gene ontology. It's a project that was launched in 1998 to standardize the way biological functions were described for genes in model-organized databases. It has come a long way since then. It's become the dominant resource and there are constant revisions and updates to the structure. So stay tuned with the literature. Here I've just captured a few paper headers that are relevant. It's meant to be useful for a larger collection of species and it's meant to standardize in an ontology, so in a control vocabulary that's structured with formal relations, the normal physiological function of genes. So it's not about disease. It's not about abnormal phenotypes. It's about the physiological function, right? And each database uses the gene ontology terms to attach functions to genes. So each model-organized database has a group of curators that takes care of that. It's composed of three main ontologies, molecular functions, cellular component, biological process. Why is it important to know this? Well, we see for instance in the lab that since the genes that we're analyzing are very well characterized, we just pick one of the three, which in that case, since we look in a cancer gene, we prefer biological process over the other two. It depends on your application. You may want to use them all or maybe use only one of them. What do they represent? Molecular function is more focused around biochemical activities or stuff that you can characterize in vitro or in isolation. And it's more a property that's in that sense biochemical. So for instance, a ligase activity or a kinase activity or the propensity to bind DNA, those are all examples of molecular function. Cellular component is structured around the parts in the cell. So organelles like ribosomes in the plasma reticulum and specific components on them, mitochondrion, matrix, outer membrane, inner membrane, and so forth. So it's like the anatomy of the cell, if you wish. And as per this biological process, this is more about how things come together for processes like cell cycle, apoptosis, cell differentiation, immune response. So from the merely cellular level up to the organismal level. So that's a bit of a difference between biological process and cellular components. Cellular components just concern with the cell, but not with cell types or organs. Doesn't include any anatomy at the organ level. The biological process includes processes at the organ level, like immune response or cognition, right? Sorry, just one second. I don't know what you're before. I have to apply my self in. So if you visualize the gene ontology terms in the ontology graph, this is what you're gonna see. You start from a term that's really, really general, like biological process. Then you start with terms that are a bit more specific, like cellular process, cell cycle, none, for instance, to checkpoints or specific parts of the cell cycle, like sister chromatic segregation. So it's a hierarchy of terms, where one is more generic and the others are more specific. And the relations among them are, the ones that are most used are, is A, and part off. For instance, cell cycle is A, single organism, cellular process, and mytotic sister chromatic segregation is a sister chromatic segregation, right? So these are the type of relations that you have. If we are just mapping these terms to genes, that means that for every term, we have a set of genes, right? And then of course, if one term is the child of another term, so it's more specific, you see that for the more specific term, those genes are a subset of the broader set of genes for the more general term. So if you are mapping gene ontology terms to genes and then have gene sets as a result, you have this inclusion or overlap relations among all these gene sets. And we use this property later on to visualize results, okay? Yeah? The gene ontology classes, can we subset of each other? Because like molecular function, biological components, then on the top would be biological process. No, well, those were born as separate ontologies and now they're looking at cross-connecting them. When they were born, they were completely separate. So they were not in a hierarchical relation between them. They were completely cut from each other. Now they're looking at connecting them where it makes sense. And the other case is like, if you have polymphosphorylation and kinase, well, one is a process and the other one is a molecular function, but from their meeting, and therefore our proteins get annotated with overlap. Or if you had transcription and nucleus, well, the transcriptional machinery resides in the nucleus, right? So there are, even if you create separate ontologies, there are obvious relations, but the gene ontology is now looking at building those cross-relation connection. Yeah. How are the time-grabbers generated? Well, this is from Quick Goal. So you can, Quick Goal is a visualizer that's designed by the EBI where you can put a gene ontology term, ID, or name, or search for stuff, and then you can see these trees. It's not really a tree, it's a directed acyclic graph. So something a little bit nuanced about how these gene ontology terms get appended to genes, that you'll have to keep in mind a little bit. So qualifiers apply restrictions or anthropological meaning. The one that's most used and confusing perhaps is not. So at some point there was a need of specifying that for instance, gene A, B, and C are not in a given cell component. So that means that if you don't process it, you get exactly the opposite of the meaning. Typically when you are getting gene sets derived from gene ontology, if they are from a good resource, they will have used, they will have to process the qualifiers and make sure that they never converted a not annotation as an annotation. But at times if the tools are not good or outdated, they may have missed that. The evidence codes, different business, the evidence codes are meant to document what type of evidence you have for the annotation. So if you go back to the original experiment or piece of literature. And again, we don't have time to review this in a high level of detail, but you have a bunch that describe different types of experiment. You have a bunch that describe different type of computational or in silico analysis. And in all these cases that means a curator has read a paper or a published analysis and then said it makes sense based on my expertise and based on this piece of literature or published analysis that this annotation can be made. And these are just saying, well, the original experiment was expression, thought and interactions, or the original bioinformatic analysis was sequence homology or co-expression or so forth. Then there are others which are a little bit different. Traceable author statement means there was a review where it was pointing to other experimental papers, but it was the reviewer was sort of capturing the consensus from different studies. So this is very strong. Then the curator read the review and made the annotation based on that. If you wish that this was supported is EAS informed by electronic annotation, which just means that without the curator going through it, it was just imported from a database. So say the Swiss spot has some keywords that they have built up in time without documenting the provenance. And then at some point they get met to genotology automatically. That would be an example of infer based on electronic annotation. So some people like to remove infer from an electronic annotation, IEA, but you could do it. It really depends. Just so of all these new and speakers that we can't review completely just because of time, keep in mind those two legal take home messages, not qualifiers, your resource to that process them appropriately and evidence quotes, typically people consider whether including or excluding IEA infer from electronic annotation. So moving on to the next and wrapping up about the genotology. I'm just gonna, I put here a summary just to recap on 10 tips on how to use your gene ontology from a published paper. I'm just gonna read them through. So number one, know the source of your gene ontology annotations. Are they up to date? Do they filter out certain evidence quotes? Do they process qualifiers appropriately? Number two, understand the scope of the annotations. So number three, be aware of the evidence quotes which we have reviewed. Number four, probe completeness of annotations. What does that mean? Well, this is done by curators. Gene ontology is also done by interaction within experts and curators. So maybe areas of biology, either where the curators haven't caught up or gene ontology is a little bit bad. So can look at your biological area of interest. Cancer, well, typically things are well colored, right? So it's not so much a problem for cancer. Croses is like cell cycle, apoptosis, transcription factor, sorry, growth factor cascades, these are among the best characterized. Understand the growth structure, what we've seen before. Choose analysis tools carefully, which we're gonna see later for enrichment tests. Carefully report go annotation source and version. So this is good practice. And the feedback inside gene ontology construction papers. These are some quick statistics just to give you an idea of coverage of gene ontology for humans. And I've reported them with, in forth from an electronic annotation and without. They are a little bit out of date because it's end of September, 2014. I don't know, we're six months later, but the numbers haven't changed dramatically. So there's a doubt. If you include in four from electronic annotations, that's about 19,000 genes, they have at least an annotation by gene ontology. Sorry, there's about 19,000 terms, okay? The first set of statistics about the gene ontology terms. They annotate at least one gene. And then there's about 16,000 terms that annotate at least one gene. So there's a lot of gene ontology terms. They're about as many as the genes, all right? So let's go to the second table again, in the interest of time. The second table below is statistics with gene numbers. How many genes are annotated by at least one gene ontology term? Again, with in forth from electronic annotation, it's 18,000. And without in forth from electronic annotations, it's about 15,000. So if you exclude the annotations that have maybe an evidence code that's more shaky, you have about 15,000 genes. So it's a good number. The genes that are supposed to exist based on the estimates are between 20 and 24,000. So estimates are databases. Be careful, though, that as you start removing terms that are too general, and how can you measure if a term is too general? Well, if it annotates a lot of genes, it's more general. And if, for instance, if you remove terms that have more than 500 genes, your number drops. Without electronic annotation, you go from 15,000 to 12,000. So the genes that are really well characterized, they're not really 15,000. It's closer to 10,000 to 12,000. So some genes may just have a term like regulation of biological process. What does that mean? It doesn't mean much. Or it's in the new clips. Yes, but what does it do? Right, so we have to be careful when we say how many genes have a functional mutation. Finally, genes have resources. In the lab, we will use a tool that's called G-profiler for enrichment analysis of gene ontology and other sets, and that already has gene ontology in it, up to date for different species. That's a couple of other resources that are reported here that you can use. There is one from the battle lab. There's MCDB from the broad, and there's a bioconductor package. These are all good. Maybe MCDB is getting a bit outdated. How do they differ from each other? Well, species support or type of gene sets, most of them cover gene ontology and the essential pathway databases, but as you move to the other types of gene sets, different resources can capture or not capture something. For instance, G-profiler has microRNA targets, and so does the battle lab, but the drug targets, I think, are only in the battle lab resource and not in G-profiler. So if you want some gene set that's more exotic for your experiment, you may have to look around a little bit more. Okay, yeah. Is this an online tool called David? Yeah, well, David. We used to do this lab using David, and then it turns out they were not annotating gene ontology very frequently. David seems like 2009. Yeah. So that's why we have dropped David from this lab, sadly, so. Have you used Gorilla? I've heard about it. What was interesting about Gorilla was that it was scanning, I mean, there's one problem I would describe later, that's where do I set my threshold? So Gorilla was introduced so that it would scan multiple thresholds and find the optimal one. The problem was that that type of problem was solved for at least as a statistical problem by GSEA in a better way. That's why I personally never used it and never put it in a presentation. I can't comment if it has its own gene ontology sets, how good or up-to-date those are, because for every tool you have to do your own research, check their documentation, where did they get the annotations from, did they propagate them so that always when you have an annotation for smaller terms, you also have the annotations for all the bigger terms. So it's really, if you just take a random tool, especially if it's not boldly used, and sometimes like David, boldly used, but it falls on how frequently they update gene ontology. So you have to be very careful in checking the frequency of updates. They did the up-propagation, which means there's primary annotations and they use the ontology to inform all the other annotations and they check that either they process the modifiers like not, or they have annotations already without the not modifiers and that they transparently tell you if they're keeping or removing the IAEA at this course. I mean, it's something relatively trivial, but... Or these two, they used to buy you to download the database. No, no. Gepa file, it's just an online tool. You upload your gene list, set the parameters and get the output. So it's based on the, yeah. So it's based on the current versions of the database. Yeah. Okay. Then how... If you want to tightly control a version of the database you're using, then you're better off using your own, for instance, for overrepresentation, you can use R packages. So you use an R package for gene ontology, use an R package for the statistical test and then you tightly control what version you're using. So if you want to maintain reproducibility, then you better tie to a particular, remember the version. Yeah. I mean, if you are, if you... There's traceability in this reproducibility, right? They do different concepts. One important is traceability, even more than reproducibility. So traceability means when you use something, make sure to check, when did they do their last update? Do they have any other versioning that you can copy and save for your analysis? That's good. If you want to have it reproducible, it's difficult to find tools that save their gene ontology version for so many updates stratified. So you may prefer to develop your own pipeline. Michelle has a suggestion. Okay. Many people using Genuity, mostly because it has a lot of more curated information in it, but I think you can already do a good job with the public resources. I think in the end of the day, maybe the greatest advantage is that you have a one go-to point and the greatest disadvantage for Ingenuity, which I don't use that often, but the last time I looked at it is that you don't have a lot of choice for the statistical test that you're doing. It's really, they just use simple statistical tests. So they may not have all the tests that you would want to implement. So moving on to pathways. What are pathways? We saw before gene sets as flat sets of genes without any relation of structure inside them. There was structure in gene ontology, but then we said we would use sets of genes corresponding to gene ontology terms. So within those sets, there was no structure. Pathways have a lot of internal structure, but there's a lot of variability. Many different standards, some of them converge on exchange standards, other do not. You'll see a reactant in the next lecture, which is a fairly well-principled pathway database for the standardization of the representation format. In comparison, CAD has a less formal structure. The biggest advantage of pathways is that the information in pathways is curated, gives you the mechanistic details of the process. Of course, gene ontology is not telling you much beyond the genes, working the process known as apoptosis. Some of them are positive activator of apoptosis, and then maybe there's a subset that are functional in your own apoptosis. But if you want to see the apoptotic cascades with the biochemical steps, like phosphorylations, the phosphorylations, you have to go to your path. So in gene ontology, we have three terms. Is there any relationship like a gene set in one term will always have interactions? All the genes in that term will have interaction partners. They will interact with each other. That's something that, what you could say is that, especially for cellular component, if you take the genes in a given cellular component like the ribosome, it's a very high likelihood that the mutual protein interactions are much denser than the protein interactions to the outside world. But there may be others, like process or even more molecular function there where you cannot expect that things that belong to the same molecular function interact with each other. If you take kinases, well, there's kinase cascades, so they will have some connectivity. But if you take, for instance, DNA binding proteins, yeah, some of them interact, others will not, right? So you want to have a linear relation between higher internal protein-protein connectivity and being a member of the same gene ontology term. You'll have time in this lecture, but there's a tool that's called G-mania that uses networks of different sorts to predict gene ontology functions. That's how it was born. And now it's used mostly as a browser of interactions and is stuff from one gene and found out. So you can predict functions using important, important interactions. But it's always linear. That things within the same functional set will be all forming an interaction model if that's what you were asking. Okay, so here are some examples of pathways that will be fly over them because we still have about halfway to go. The second part, so the first part was mostly about annotations. The second part is mostly about the enrichment tests and relative statistics. So just to recap, we are here. How do I go from the gene set, my experimental data to what gene sets are relevant or interesting for my experimental data? So typically, this is the conceptual layout of an enrichment test. You have an experiment that's identifying out of the space of all possible genes or in statistical jargon, the gene universe, a set which are upregulated or in some other way of interest. In the lab, we take genes that have significant, record and somatic mutations. But in your experiment, it could be as transfusion, it has a focal amplification, yada, yada, yada, right? Mind that one first methodological thing to keep in mind, well, always think at your experiment is these are the genes that I found, and these are the genes that in principle I could find. Some experiments can find all the genes. If you do RNA-seq, in principle, if the gene is expressed, you can see. But other experiments cannot see all genes. If you're doing a phosphorylation assay, you may see only subset of proteins. If you have an experiment that's working with kinases, you only see kinases. So mind always to think if your gene universe is somewhat constrained so that it already is already sampling only one function or one group of genes. See, I have to model that in the test. Then you have the enrichment test, which is a relatively simple or sophisticated statistical model. And what we're feeling it is not only your list of genes that are experimentally positive, but also the known gene sets, right? So derived from gene ontology, from pathways, anyone else. So who's which do enrichment studies? Do they take into account from which source the gene list has come from? So in your profiler, there's a simple thing which is you can explicitly state what is the gene universe. So you can give it a long list of genes, say these are all the possible genes that I can detect. That's the easiest way. There are more nuanced issues, which is if it's not just that you can see certain genes are not ours, but the probability that you can see a gene is proportional to some characteristic, like the length. That's more complicated to correct. In that case, the correction is not only in stating the gene universe, you have to go inside the test and modify the statistical model to take that into account. I have slides addressing that. Okay, so in the end you get an enrichment table with the gene sets, the P values. And of course, don't forget that these rows in the enrichment table refer back to sets of genes, right? The statistical test that you will see used over and over and over is the Fisher-Sexac test. Sometimes surprising people from the statistics community being referred to as the upper geometric test. What it really means is the Fisher-Sexac test and the Fisher-Sexac test is based on the hypergeometric distribution, but it's just terminology. Just to say that if you hear hypergeometric tests, we're talking about the same thing. So what is the idea here? The blue sphere is your gene set, like set cycle. And you have your yellow genes, experimentally positive, your brown genes, experimentally negative, so not found. And you wanna test what property. You wanna test if the overlap between the functional gene set and your experimental genes taking into account the gene universe, that overlap is larger than you would expect just by extracting things of random. So just by choosing a random selection of genes, stuffing them into a functional gene set of the same size by choosing them of random, not based on annotation. So this explanation based on random draws is intuitive, but the test is really not doing the random draws because this is a combinatorial problem that can be resolved analytically. So there is a formula that's working inside the Fisher-Sexac test to produce the p-value. And remember, the p-value is what? If you want to summarize inferential statistics in one minute, all these tests are about the following thing. If I'm looking only at the slice of the world, there's a problem that's called sampling. I'm grabbing a sample of the world, looking at it, and it's just a subset of the full picture, there may be some stochasticity. So all these tests speak a p-value, and that p-value is the probability that just by sampling and sampling and sampling, you could see that property. So in this case, the p-value from the Fisher-Sexac test is telling you why if you were just to pull out g sets of random from the gene universe of the same size, you could see that overlap with your experimental genes if the p-value is closer to one, if the p-value is closer to zero, p-value is 10 to the minus four, you need 10,000 draws to see the ones, right? So that's the meaning of the p-value. It's all about sampling, and the stochasticity that comes with sampling. So I've already said what's the importance of the background. Next concept that we have to assimilate, before we are seeing the p-value, when testing one gene set and my experimental genes. It's one statistical test. One statistical test, I can look at the p-value, the p-value is small below 0.01 or 0.05, and I can reliably believe that the property that I'm observing is specific. It's not just coming from the inherent stochasticity of sampling things at random, right? But what if I don't have only one test? I have many tests. So do this mental experiment, perform n random draws and you test enrichment. Then you set the p-value significance threshold, like 0.01. Well, there's gonna be n random draws multiplied by 0.01 that will pass a threshold of 0.01. So if you're doing a million random draws, and then you set the p-value one over a thousand, it's small, very small, but still you have a thousand of tests that will pass it, even if you're just doing random draws. So clearly when you're doing many tests, and not just one, so you're testing many gene sets and not only one gene set, you need something else beyond just looking at a pre-fixed p-value threshold that's arbitrarily small. Or in other words, how small that p-value threshold is needs to depend on the number of tests that you're doing. Because if you were just doing random draws, you know that the more tests you would do for a fixed p-value threshold, the more gene sets will pass. Now that's a fraction of the total as an absolute number, right? As I said before, I use a p-value one over a thousand, so 10 to the minus three, a new million test, 1,000 tests. I do a billion tests, one million will pass, right? So as a fraction, it's the same of the total, but in absolute numbers it can grow. So you can't just do gene set enrichment over 3,000 gene sets, and as I said, oh, there's a p-value of 10 to the minus two. It's good. No, it's not. Okay? So this problem is solved in a couple of ways. Here, I'm showing you a way of solving this problem, which is the Benjamin-Hockberg FDR. And it's the most used for gene set tests. And also for gene differential expression or other applications. So first of all, what does FDR mean? Franklin Delano Roosevelt? No, it's false discovery rate, right? So how do you think about the false discovery rate? You have your gene sets, false discovery rate, sort rate, 1%, 1%, 2%, 5%. That means when you are a 5%, if you take the gene sets, 5% and lower, that group of gene sets are the 5% false positives, estimated, right? So it's really a descriptor of a group of tests that come with the best p-values, estimates in that group, overall, what's the false discovery rate? So think well about this difference. The p-value is a property of one test. The false discovery rate is a property of a set of tests. So don't confuse this detail. You cannot just extrapolate and take a gene set that is FDR 5%. Now look at everything else and just say this gene set is FDR 5%. No, you have to say, if I take all the gene sets, minor equal FDR 5%, in then there's a false discovery rate, so a percentage of false positives or 5%. So you see that it's very useful for stuff like gene set enrichment analysis because you're willing to tolerate a fraction of false positives. You don't want everything to be absolutely true, but at least if you contain it, so around 5% or lower, you can minimize the incidence of false positives in the picture that you're getting. In the lab, we're gonna use a ridiculously low false discovery rate, but that's just because the set of genes that we will use of experimental provenance are so robust and so also pretty well characterized that their false discovery rate, they produce in the enrichment is very, very low. There's a formula here for the Benjamin Hockberg FDR, which I don't have much time except to read. You can spend more time on it. I just wanna show you how it works. So here, there's some imaginary data where I have a gene set, transcription and regulation p-value of 0.01. And then the false discovery rate works very similarly to this type of approach. The first gene set p-value of 0.01, I've done a total of 53 tests here. So for some reason I was testing just 53 gene sets. And then I say, if I were just to do random draws, 53 random draws, a p-value threshold of 53, how many would I get randomly? How many of those compared to the ones that I actually see? So you do 0.01 multiplied by 53 divided by one and you get something that's very close to the FDR, Q-value. Very close, I would say why it's not exactly the same. So as you walk down this list, then you will go to p-value of 0.02 multiplied by 53, but now there's two in the real data that pass it. So you do 0.02 by 53 is the ones that you expect to see by random draws and two are the real ones and again you can calculate the false discovery fraction. So you see that these false discovery fractions which are in black are not really constantly increasing. They're going a little bit up and down. And then they're not necessarily comprised between 0 and 1. So the final Benjamin-Walkberg FDR is transformed so that this keeps being the same or going down and it's between 0 and 1, right? And it's just a matter of how you formulate the definition. So you see that the final FDR that's in green and red, it's 0.04, 0.04, 0.04. Now, and then 0.53. Remember what I said before, you cannot take the first gene set and say it has FDR 4%. You have to take all the gene sets at FDR equal-minor, 4% and say those as a group have a false fraction of 4%. Okay? Any questions so far? Just to make sure I understand, let's take line 52 for a second. You calculate the alpha to be 1.004. Sorry, the alpha, usually in statistical jargon, the alpha is a set threshold, okay? So here there is no really, well, okay. In this case, you could say, what is my alpha? If I want to estimate how many false findings I have, that means that for every gene set, as I go through that list, I choose the alpha that is exactly at its p-value. So if I wanted to use strictly appropriate terminology, I would say, which one do you want to pick? For instance, nuclear localization, number four? Sure. Okay. So when I'm nuclear localization, I say, I set an alpha at exactly 0.031. And I take all the gene sets that have equal-minor, that alpha, which is a total of four real ones. Then I say, how many false findings I expect to have? That alpha multiplied by the number of tests. So 0.0031 multiplied by 53. So that's my absolute number of false positives expected. And then I want to do a fraction of false positives. So I take the false positive expected, which right now I cannot mentally calculate 0.0031 multiplied by 53, what it makes. But that is a number. And then I take the number and divide by the real ones, which are four. And that is my sort of raw FDR. Then I have to transform the raw FDR to be between zero and one and monotonic decreasing. And then I end up getting an FDR that's, in this case, identical to the raw one, but it's 0.04. But why do you have the transformation? Well, the FDR is always meant to be for a group, right? So when you are at that alpha, if in an alpha before you are getting a higher FDR, you just test it away. You don't even stop at that alpha. You stop at the alpha below. That's sort of the trick to get a monotonic FDR. When you have an FDR that's higher and then it becomes lower, you never stop at the one that's higher. You just stop at the one that's lower. And so that's how you get a monotonic decreasing FDR. Again, it's a matter of playing with the definition. But if you were constructing an empirical FDR, you would get the raw non-monotonic one. The formal one is monotonic decreasing. And just by this artifact of, sorry, this trick of not stopping when it's a bit higher, but stopping later. Okay? So I have to move on. More new things. So we've seen a simple type of test, the Fisher-Sexac test, which I can call over representation test. We have seen that I can get a p-value for a test. And we've seen that if I do many tests, I have to do multiple tests correction. And this is the Brellenbauer of gene set ambitions. Gene ontology patterns and so forth. But where else to the story? Well, we've seen the case where I had to explicitly say what my universe was because I had the bias experiment. There are more new forms of biases. The more new forms of biases is general problem, genes do not have a uniform probability of displaying some genomic signal. And that can come in many different ways. The general solution is modify the enrichment statistic or the construction of the null hypothesis distribution. So maybe by doing a resampling driven by that probability curve. So we're not gonna see this in action, but I just want to give you a little bit of prying. So you know the problem exists, you are aware. If you're using given part of data, you use the right tool. We don't have time to cover all these tools and these examples in detail in the limited time. So I'll just do a quick fly through, couple of applications. For RNA seek, the problem is that especially for count-based methods, the longer a gene is, the more reads you can count. The more reads you can count, the more accurate your estimates of abundance will be because you have a larger statistical sample, so there's less toccasticity. So that means that longer genes have better power, have greater statistical power to detect differences, than smaller genes. So that means that if you have the same magnitude in change, a longer gene has more power to be significantly different than a small gene, where a small is not the length of the gene, it's the length of the transcript, so the exonic sequence. So that needs to be corrected, there's a way to correct it, it's implemented in an RBI conductor package, that's called GoSeq, and there's the header of the paper. Binding sites, which you may obtain if you do chip seek or material seek or some other capture seek. Now here you're mapping binding sites to a gene, right? So if you define a window around the gene, again you have a problem, the bigger genes have a higher probability just by random sampling of intercepting signals, peaks, right? And there's a tool that's called GREAT, which corrects the problem, rather than using the Fisher-Sex-Sach test, uses a binomial test, taking into account the size of the gene as a window. Again, GREAT is an available tool that you can use if you have chip seek or similar types of data. And finally, there's somatic mutations. We are, in the lab we once see a type of gene set test that corrects for biases, but we'll see a way of scoring genes that corrects for those biases. And then after we score genes correcting for those biases, we can use the significant genes reliably. Well, if you look just at rough tellings of somatic mutations, what genes act as magnets for somatic mutations? The ones that are very long, and the ones that are not under strong negative selections, that I call factorial receptors. So you have to model those, and then when you score significance presence at the gene level or even the gene set level of somatic mutations, you have to factor in these confounders. The experimental gene list that we will use is from many cancers that have sufficient data support to identify genes that have significant recording somatic mutations. They already corrected out these confounders at the gene level. So we can safely test the gene sets without screwing up. But if you have a smaller experiment, you may directly want to do a gene set test. So look into this type of issues. And it will correct like expression level. All genes are not expressed equally. Okay, but is that something you really wanted to correct for? Because if you have a gene, two genes that are both absent, and two genes that are both present. No, if there is a sample under, say, patient sample, if the gene is expressed at a very low level, at a very low level, and the gene in, say, the disease sample is expressed slightly higher than this, but actually you won't be able to pick that different sample. I know, but that's because the difference is modest. And what if, right? But there is a difference. There is a difference. I know, but why would you want to say, I have this little difference and I have this huge super reliable difference, I'm gonna score them equally? Is it something desirable? But this high difference would be due to that the gene is expressed in normal cells also, it is expressed highly. Okay, what you are saying is that if I have a gene that's expressed really, really high, I can pick it up better than the gene that's expressed at low levels, but we have to be careful about it. Because if I have a gene that's expressed at high levels and it also has high variability, that would be factored in by the test. If I have a gene that's expressed at a low level, but it's completely absent versus it's present at reliable levels, if I do sufficient sequencing that with RNA-seq, I will be able to pick that up as significant. So I would want to see more formal analysis that at 100 million pair than reads, you cannot pick up differences for genes that go from being absent to being expressed, but not to the greatest level. So there may be some residual bias, but not as strong as, I have a gene that's a kilometer long and one that's a centimeter long, they have the same change, one gets a p-value of 0.1, the other one gets a p-value of 10 to the minus six. Right, I have to accelerate a little bit because we're gonna have only half an hour for the lab, otherwise. Final thing about enrichment analysis, you've seen gene ontology as a lot of terms, organized hierarchically, what about if I bring in pathways and other sets, I have a zillion of sets. I need to organize them in a way that's showing me what are the mutual relations. And the way we found to do this is called enrichment map. It basically connects gene sets that overlap a lot. So if two gene sets are very, very similar, they have about the same genes in their belly, they end up being connected and clustering together in the network. If they are completely different, they end up in completely different places in the network. So that's the image that I showed in the very beginning, and this is a snapshot from that image, and here you can see a bunch of clusters of gene sets representing DNA metabolism, microtubule, cytoskeleton, and other stuff. Finally, I'm gonna do a really five minutes, two minutes right through concepts of network visualization. You represent a network graphically with balls and sticks where the balls are genes or other genomic entities like proteins and sticks are links or connections or physical interactions. And you can represent the same just as a table of A interacts with B, B interacts with C, and so forth, or as a hit map. Cytoscape is a good open source tools with a very broad usage community to visualize and analyze networks. We're gonna start using it in this lab and in the next one. Couple of key ideas in network visualization. Number one, automatic layout helps you visualize in patterns using the processing capacity that you have behind your eyes. Number two, you can take numerical or categorical attributes and map them to graphical attributes using cytoscape. So here's some visualization. This was a network from yeast showing the proteins in the chromosome, before and after layout. The colors represent different parts of the chromosome and you see before layout it's a mess tangled that doesn't have any meaning whatsoever. After automatic layout, you see the single parts form separate modules. So that gives you an intuitive idea that after you do automatic layout, you can clearly see patterns with things that interact more closely group you together. So you see with the Reactome FI plug-in in the next lecture, how you can create a network of your cancer genes and then run a clustering algorithm to find the modules. But of course, that's somewhat complementary. If you do a good layout, you will see the modules by eye, right? Without even running a formal algorithm. You will see there are areas that tend to connect better and areas that are more sparsing out. Visual attributes, cytoscape is a number of controllers inside it so that you can say, I give you a table of attributes for my genes that interact in a network. And then I map them to graphical feature, color gradients, thickness of lines, labels and so forth. And we see some example.