 Welcome everybody. As I mentioned earlier, my name is Gary Bader. I'm going to give you guys this morning's introduction to the JeanList workshop and tell you a little bit more about what interpreting JeanList and pathway network analysis is and cover some basics that will be useful for the rest of the workshop. Okay, so the basis of this workshop is that you usually have some kind of screen or genomic experiment that you've done and it's taken maybe a while to get to the data stage and you've gotten your data and everything's working and you're like, great, everything's working with this new technology that I'm using. But then you realize that you have thousands of genes that are differentially expressed or bound to your transcription factor and now you're wondering how am I going to sort through all of these thousands of genes. I definitely don't want to do it one by one. And so that's where this workshop comes in. Basically we at this stage generally want to know what's interesting about these genes in some way and a very popular way of looking, finding out what's interesting about a set of genes is that you want to find out how they relate to biology that we already know. How the cell works, information about biological pathways, complexes, gene functions. So typically often a lot of my labs and Quaid and Lincoln later are going to use gene expression data as a main type of data that we're using as an example. But the analysis that we're going to tell you about and also Wyeth will tell you about is very general to any kind of gene list or in the particular areas that we're going to be talking about. So one way that people get a gene list from gene expression data for instance is they rank, they might have an experiment that they've done and they compare the gene expression and the experiment to control and this is the same for methylation or any other kind of differential experiment where you want to find out how different genes are acting in a particular in your experiment compared to control. And so you might get a ranking or a big list of genes that's differentially expressed for instance. If you have lots of different experiments, lots of time points, lots of samples you can cluster those samples and cluster the genes by how similar their expression pattern is for instance across the samples and that also can give you a gene list. So it's the list of genes that are acting similarly across your experiment. So there's different ways of getting this gene list and we'll talk about more in a little bit. And then what we want to do with this question here is compare it to information about pathways that we already know and ideally we find something interesting like a causative gene that we're looking for. Okay, so pathway and network analysis is really any type of analysis that involves pathway or network information. It's most commonly applied to help interpret gene lists and the most popular type probably is pathway enrichment analysis but there are many others that are useful and we're going to talk about more than just this type of analysis. And it helps gain mechanistic insight into genomics data and I like to think about it more and more about how pathway and network information can help you get some insight into the mechanistic details of how things are working which is really telling you something about ideally the cause. If you see some effect in your experiment, you see a whole bunch of genes changing, you might do some pathway analysis and you see that those genes can be thought of in terms of pathway activity and pathways are changing. And you're wondering why are those pathways changing? You might be able to find some reason. For instance, a spice factor or transcription factor is changing. That's sort of the promise. It's not always that easy to get there but definitely the more mechanistic insight we can get into our data, the more likely you are to get access to something causative. And in general, but we're kind of moving in pathway analysis more towards this causation idea from a correlation idea which is most popular in genomics. So for instance, the gene expression data that I showed you when we cluster data, it's all based on correlation between gene profiles, for instance. Or in GWAS, genome-wide association studies, we find genetic markers that are correlated with disease and it doesn't tell us that those genetic markers are... that's a mutation that's causing the disease. It says that there's some mutation nearby maybe that's somehow correlated with the phenotype. So these correlation approaches are very powerful. There's a few issues with these. One, they generate a huge amount of data sometimes. And for instance, in genome-wide association studies, the more... and there's sort of a little bit of a paradox here, the better genomic technology gets, the more data we get out of it, right? And sometimes if you're looking for association in genome-wide association studies, for instance, that can reduce the statistical power of the approach because you have so much information, you're trying to find that signal. You want to... and if you keep on, for instance, doing correlation tests of various different types of statistical tests, you have to correct from that multiple testing which we'll talk about tomorrow, I think. Also, there's some issues with signal-to-noise. So if you have the important parts, you know, signal in your data is very rare, then you really need a lot of samples to find it. Otherwise, you won't be able to get access to that signal. So this... I'll tell you a little bit more about pathway... using pathway information to help address these questions or these problems. So for instance, when we're thinking about human genetics or gene variants that are associated with a phenotype, we might, instead of thinking about lots of different individual variants, we can think about an individual pathway and how it's affected, and we might see that lots of different mutations are affecting the pathway in different ways. And when we think about the individual mutations, we don't see how they're connected, but when we think about the pathway, we see this really nice strong signal. I'll give you an example of how that works soon. Okay, so I want to also mention that before the analysis, we're not going to cover these things. We expect that the genomics technology that you use should be appropriately normalized, the background should be adjusted, and there should be the appropriate and relevant quality control measures taken. Each genomic technology and every generation of genomic technology has its own ways of doing this, and usually, increasingly, people are kind of going to core facilities to get a lot of the genomics done, or it's becoming more of a service. And so however you're getting your data processed, the people who are processing your data usually recommend as the experts how many people work with a core facility to get their data processed? How many people do all of the sequencing themselves, or gene expression, microwave reading themselves? Okay, a few people. So we kind of assume that there's knowledge already in place to help do this properly. In particular, to use statistics that will increase signal and reduce noise, specifically for your type of experiment. And that's important. So I'm going to just present an example of a successful pathway analysis that we were involved in, that we did a few years ago. This was for a study in collaboration with Steve Scherer, who works at the Hospital for Sick Children, who studies, among other things, autism spectrum disorder. He was interested in, he studies the genetics of autism spectrum disorder. And when I got involved in this project, I was surprised to learn how heritable it is that there's 60 to 90% concordance among identical twins, not as much with siblings. Although there's, you know, in typical, as is typically the case with complex diseases, not a lot of heritability had been explained, only 5 to 15% from rare single gene disorders and chromosomal rearrangements. Although they had some clues at this time, this was published in 2010, that de novo rare copy number variants may play a role. And so they wanted to study these further, and they mapped copy number variants in around 900 cases and over a thousand controls, children who had severe autism spectrum disorder in the cases. They used the Illumina Infinium 1 million single SNP chip at the time to do this. The SNP array gives you an intensity of a particular SNP or DNA level at the germline level for, across the genome. And then they processed this to look for, say for instance, regions of the genome that have no intensity, no measured DNA, and those are a region of deletion, or a region of the DNA that has a much higher intensity than you expect, and that's an amplification region. So they processed using different algorithms to get access to basically convert the SNPs to copy number variants, which are larger, and they found hundreds of them. They were only interested in the rare ones, so they removed all the big common ones, which there weren't that many of, and focused on ones that were present in less than 1%. And then they looked for genes that were in these copy number variant regions, either deleted or amplified, that were correlated with the phenotype of interest with autism spectrum disorder. And they found a few genes that were significant, but they didn't have enough statistical significance to get lots of genes, so really there was just a few genes. What we wanted to do was try to see how, as I mentioned earlier, how pathways were associated with the phenotype instead of individual genes. And when we did this, and Danielle Americo, who was a postdoc in my lab at the time, did this analysis, we found a rich set of pathways that were really strongly associated with this phenotype. And all of the circles here represent pathways, and the connections represent pathway crosstalk, really how to make images like this in the later part of this workshop. All of these circles here represent pathways that are strongly associated with autism spectrum disorder. Here's one that's focused on central nervous system development, which made sense. Some of these other ones are more signaling related. And what we found when we looked at some of these pathways, like regulation of cell proliferation, or these particular pathways in central nervous system developments like cell projection and organization, we found that there wasn't really an individual mutation that was affecting the same gene over and over again across 900 patients, 900 cases. Instead, the pathway was affected over and over again, but it was different genes. And if we didn't use pathway information, we would never have been able to see that signal. So we get this really nice, we get a fairly weak signal with those individual genes because there were only one or two affected genes across 900 samples. But when we look at the pathway, we see more than a dozen cases that are affected or 20 cases that are affected, and that ends up being a very strong signal in this dataset. Okay, so that's an example of the, a good example of pathway analysis that shows you the, explains the, or how to increase statistical power here by getting more signal out of the data. And it also helps dealing with these large gene lists because instead of having thousands of genes, we now have sort of a few major functional themes here, like cell proliferation and GTPAs, RAS signaling. Those are, those are sort of central themes in this data, it seems. Okay, so now I'm going to go and switch topics to a more basic introduction about gene lists, and that's mostly in preparation for some of the future lectures in the course. So I mentioned this already. Where did gene lists come from? I mentioned some of this already. So a lot of gene lists come from molecular profiling using technology to measure all the mRNA transcripts, protein levels, as much, and many proteins as you can in the, in your sample of interest. And there are two major ways of thinking about this data. One is just identifying all of the genes or proteins. Often in proteomics we, we all is changing now, but classically proteomics often just identified all the proteins. So you could say proteins were there or they weren't detected. Although now people are getting more quantitative information. And so there are certain types of experiments that just give you a gene list without any values associated with them. We just know they're there or not there or not detected. We can also quantify levels of these molecules that we're studying. And so we got the gene list, we identify the genes, and we also have some values associated with them, like gene expression, we know the genes highly expressed or not expressed. And then I mentioned earlier we can rank or cluster these using standard biostatistics or bioinformatics methods to process these. And these can also sometimes generate gene lists like I mentioned before. We can also get gene lists from protein interact, molecular interaction type assays. So if we do, if we have a set of proteins that interacts with another protein or a set of targets of a microRNA or transcription factor binding sites and the genes that they might be connected to, this also defines a set of genes. Usually we don't have a ranking for these. Genetic screen, we do an S-I-R-N-A screen. Or association studies, like the genome-wide association studies that I mentioned where people are using some kind of genetic markers like single nucleotide polymorphisms to see what changes in the genome associated with the phenotype. Okay, so because gene lists come from multiple different sources, you need to, and we're trying to explain how to interpret gene lists in general, before you do that, before you use the methods that we'll be talking about in this course, you need to understand what your gene list means, and some of the methods might be applicable better or less applicable to some of these types of gene lists. Some gene lists are really about a biological system, so we might get a set of genes in our gene list that's associated with a protein complex or physical interactions or a pathway. We might find genes that are related because they have a similar function, like they're all protein kinases. They might be similar because they're present in the same part of the cell or tissue where they might be present in a region of the genome. And you do need to think about what your gene list means. Does it mean something about biochemical mechanisms? Does it mean something about a set of genes on our genome that's not really necessarily related to biochemical mechanism? And we're looking for one gene in that gene list. So, you know, step one is to kind of just consider that, which is generally most people already do this naturally, but just to make it clear that, you know, hopefully this is part of your experimental design, you need to sort of think about your question. And then some of the frequently asked questions are what, you know, just the most basic one is, you know, I have a bunch of genes, what types of genes are they? So, you can summarize your biological processes or other aspects of gene function using pathway analysis. People sometimes are interested in differential analysis, like tumor versus normal. And you can, so you see what's different, and then you want to know what pathways are different between the samples. And that might tell you something about the pathways that are involved in disease development in that case. You might, this is a very popular one that's a little bit difficult sometimes to answer, but Wyeth is going to tell you about this afternoon. People are very interested in finding controller, like a master regulator for their, you know, their process of interest. A transcription factor or a microRNA that might be kind of explaining why things are changing. You might also be interested in finding new pathways or pathway members. So many people who do an S-I-R-N-A screen, they are usually have a phenotype of interest and they're trying to figure out which genes are involved in that phenotype. And they are looking for any kind of pathways or any genes that might be part of the pathway that leads to that phenotype. So they're interested in discovering new gene function there. And also, as I mentioned with the GWAS and the Autism Study, correlating with the disease or phenotype. Okay, so there are a number of different pathway analysis methods that help answer these questions. And in particular, during this workshop, we're going to focus on regulatory network analysis from Wyeth, the remainder of the day today, after I finished. And tomorrow, we're going to talk about pathway, so regulatory network analysis helps find controllers, regulators. Pathway enrichment analysis helps summarize and compare, so summarization and differential analysis. And network analysis, which we'll talk more about on day three, and a little bit tomorrow, is useful for predicting gene function, finding new members of a pathway, identifying functional modules in your data which might be new pathways. Okay, any questions so far? Any other gene lists that people work with? Yeah? I wondered, related to the gene list ideas, I mean, we come with our own gene list, but in another sense, there's a lot of other gene lists out there that people sometimes want to... So maybe this is premature, but I wonder if we're going to get into that a little bit in terms of the kinds of resources we're going to... Yeah, so the question is, other people have generated an interesting gene list and how do we compare our gene list to those? And the pathway enrichment analysis that we'll talk about is really about that, it's comparing your gene list to other gene lists. Often those gene lists are sets of genes that are involved in a pathway, but they could be other people's gene lists that they've previously found, like the set of genes that's known to be associated with the phenotype. And some of the databases that we'll use actually have collected a lot of those in one place to make it conveniently accessible, but you can also collect your own gene list and use some of the same statistics to do comparisons. Any other questions? Okay, so this is getting... now switching topics to sort of basic background information that we found is very useful to just get everyone on the same page before we start going to the detailed analysis and the rest of the workshop. So one of the first types of pathway analysis, or one of the most common types of pathway analysis that we'll be talking about mostly tomorrow is pathway enrichment analysis and what it needs. So basically you're looking for, if you have a list of genes, you're looking to see if there's a pathway that's enriched in that list more than you would expect. So say you're interested, you have a thousand genes, and you are interested, and you see that a lot of those genes seem to be involved in the cell cycle. And you know that there are 1% of the genome is involved in the cell cycle. So you can look in your gene list and you can ask, is there more than 1% involved in the cell cycle or fewer than 1%? So if there's 10% of your gene list that's involved in the cell cycle, it's 10 times more enriched than you would expect given just what you know about cell cycle genes in the genome. There's a number of gene analysis and there's a number of tools that are available to do that, which we'll talk about in detail. But the basic idea before using these tools that you need some kind of gene list and you need some kind of attributes of those genes which could come from pathway databases, for instance, these are lists of pathways and the genes that they contain. So I'm going to talk about these briefly because there's sort of a few issues to consider here. Okay, so very briefly just talking quickly about gene and protein identifiers. The idea of an identifier is that it's some unique number or name for something. So if you have a social security or social insurance number or your gene has an entree gene ID these are numbers that are ideally not changing over the life of the gene. And they're ideally unique so that if I tell you I'm talking about gene 41232 you will understand what I mean because that gene is always being called this number in this entree gene database. If that's not the case then it can lead to problems. So one of the issues is so this is a basic idea but there are a few issues with this idea in practice. One is that there are lots of different pathway databases and gene and protein information databases that exist and each one has their own way of numbering and naming genes. There are also different types of records for the gene, the DNA, the RNA there might be more than one RNA coming from a DNA and the same thing with proteins and so it's important to recognize the correct record type when you're talking about gene list so here's a gene list with human gene symbols so these really talk about a gene they don't necessarily talk about splice variants of the gene and for instance entree gene which is a major database from the US National Center for Biotechnology Information they don't store protein sequence or DNA sequence they just have the idea of the gene described and then they link to other databases that contain that information and the NCBI the National Center for Biotechnology Information the US that keeps a huge amount of information that's useful has a very complicated set of databases that have lots of connections between them you don't have to know all of these but it's useful to know that there's when you use the websites behind the scenes so you might actually get different identifiers depending on which database you're in and here's a list of a whole bunch of different identifiers that are you know commonly seen and the ones that are recommended are in red and even more recommended are in bolded red so entree gene is very useful and species specific gene identifiers like human gene names or official human gene symbols or similar symbols that are official from model organism databases are often useful to use and some of these others are just you can sort of see how they all this is just a bunch of examples to see how diverse they are okay so there's lots of different identifiers and one of the problems you might have when you're using pathway network analysis tools we've tried to select tools that kind of recognize common identifiers but one of the problems you might have is that you have some identifiers from your technology that you're using and it's not read by the system so for instance you've used aphymetrics microarrays or you've used and these are getting less and less useful as RNA-seq becomes more popular but there's still a lot of use of chips and specific platforms that have specific identifiers for genes so if you ever use a chip platform often there's specific microarray platform specific or identifiers for genes and you need to map these to some some set of natural set of gene identifiers like human gene symbols and the there's a number of services available which I'll mention and I just want to mention quickly the some of the challenges with identifiers so before I get there so one of the issues that can happen if you don't have if you're working with gene identifiers that are not unique or they're changing over time there's that you might make some mistakes if you are using these services and mapping identifiers and there's a mistake made so you can sometimes if you're not using protein names for instance which are not which are very ambiguous and gene synonyms which are very ambiguous often you can tell the system I'm looking for p53 and it converts it to some other gene which because there's multiple genes that have a synonym called p53 so this is all of these names here are actually synonyms for what often people call p53 the protein level gene symbol is tp53 so this is the one that you should use sometimes it's confusing to see that these look similar or people often call this gene by this name but this is the official gene name and so the point is we should stick to official names if possible because they reduce this ambiguity problem another problem that occurs sometimes is that excel which is pretty popular how many people use excel quite a few excel tries to be smart but it's smart but for accounting or something like that so it's not smart for biology so sometimes you can have important genes that you type in and it changes it to a date it recognizes oct4 important transcription factor thinks it's october 4th by default so how many people have seen this problem so and this is not a problem if you're just typing the gene in but if you paste 5,000 genes into excel and number 4,302 gets changed then you might not realize it and then you copy it and it you may have lost that gene because future software doesn't recognize it so the way to and I think I should have used Veronique's updated slides here but I have a couple of tips later but the way to avoid this is to copy as text when you paste information to excel paste as text into text formatted columns there's also interest there are sometimes problems reaching 100% coverage so if I give you a list of a thousand genes and you type it into a pathway network analysis you paste it into a pathway network analysis program the pathway network analysis program might only recognize 95% of them or something like that you might lose a few percent because I might have used I might have an old gene list that I gave you from an older publication and some of the names have now changed slightly because they were gene names that had systematic names and now people have given them an official gene symbol that is more human readable for instance so there's version issues and if you ever come across this often these days especially if you're working with certain organisms that are more well studied you don't have such a problem with this but if you do come across this issue then you should try to use multiple sources to map information or multiple paths so if you have certain genes that are mapping well like 95% take the last 5% that aren't mapping and try them in different websites and see if you can get information from other websites that might have more up-to-date versions and just as a cautionary note there was a paper in the stock that was people were really excited about in the lab and then they quickly realized that this paper in nature about myRNA target particular myRNA target the people had said HES1 is a myRNA target but then they didn't realize that there's two HES1s one is homologue of ES1 and the other one's hairy enhancer of split so they were both called HES1 in the database and they actually did a search at some point but gave them the wrong one and they did all experiments on the wrong one and then had to retract their paper unfortunately so it does happen so there are a couple of different identifier mapping services that are websites that you can go to and type paste a whole bunch of genes and then convert them to another set so here I'm converting gene symbols to entree gene IDs using Synergizer and there's a few others okay so just some quick recommendations for proteins and genes we should try to use gene identifiers either entree gene IDs or the official gene symbol and you might need to use different websites to do conversion if you don't get 100% conversion or manually curate the missing mappings and be careful of these Excel auto conversions so remember to format everything as text before pasting or pasted as text this doesn't consider splice variants so one of the and in general this course is and the tools presented in this course in general pathway network analysis are really oriented around genes we just don't have enough information about the differences in function between different splice variants on a genome-wide scale to do a lot of this analysis when we're considering different splice variants so it's a question that often gets asked and in the future it will be better when we have more higher resolution data coming out and presumably all the RNA-seq data that everybody's producing now will give us a lot better picture, a more accurate picture of the splice variants that are present in different tissues and in different disease conditions and then once we know about those then people can do experiments and test the different functions between the different splice variants but right now there's a lot of information known about them but not on a genome-wide basis compared to just looking at the gene level okay so that was a quick introduction to genes and identifiers and some of the caveats that you just have to be aware of when you're working with large gene lists any questions so far okay so I'm going to next talk about this part here gene attributes which are pathways or functions of genes and sort of the second thing that's sort of used for pathway enrichment analysis there's actually a huge amount of information about gene function that's available in databases information about the function of the gene chromosome position, disease association transcription factors that might regulate it protein properties whether there's any protein domains that have a known function that are on the proteins interactions with other genes and proteins and a lot of this information can be used for pathway network analysis so I'll just talk about this first this function annotation where we want to know something about the biological processes that the genes involved in the molecular function that the gene product the protein might carry out or where the gene is expressed and where the protein is expressed so I'm going to talk about an important source of this information that is quite useful in a lot of the rest of the course how many people know about gene ontology already how many people have never heard of gene ontology most people have heard of it and probably just over half know about it in more detail I'm going to give an introduction to this and even if you know about gene ontology you might still be some useful points here gene ontology is a project that has been running for more than a decade that tries to create capture all of the concepts in biology in a big dictionary so concepts like protein kinase apoptosis, membrane and this dictionary has the word or the phrase and a definition and so it's quite useful as a source of biological definitions but it's also an ontology an ontology is a formal system for describing knowledge and in this case gene ontology the ontology aspect incorporates relationships between terms and in particular the terms are structured hierarchically from most general to most specific so here's a specific term at the bottom here b-cell apoptosis and it's related to some of these other terms there's two major types of relationships is and part of so b-cell apoptosis is a type of apoptosis which is a type of program cell death and you go all the way up it's a type of death it's a biological process and b-cell apoptosis is also part of b-cell homeostasis so it's a component of b-cell homeostasis so those are two major types of relationships and just important to understand this hierarchical structure here because when you are looking at gene ontology information and this gene ontology is sort of a major source of pathway information that we use for pathway analysis this there it describes gene function at multiple levels of detail so you might find that you get a very specific term coming back from the analysis that is b-cell apoptosis where you might find that you get something more general like physiological process and so that's one aspect that's important to note and also terms can have more than one parent or child so here cell death is a type of death and it's a type of cellular physiological process so the structure usually means that when you have a gene associated with one of these terms you often have a lot of other terms automatically associated you can also have multiple terms associated to your gene in different ways so that's sometimes challenging to work with because you get lots of information coming back from an individual gene so gene ontology covers and I kind of mentioned it already cellular component which is where cell things are expressed molecular function what enzymatic function is for instance and biological process which is really pathways in general and there's two parts there's the terms which I explained terms are added by editors at professional people who spend most of their time curating gene ontology and people can also add terms by request there's over 30,000 terms 23,000 biological process terms so there's 23,000 pathway terms with definitions not as many many many fewer cellular locations so only about 3,000 and just over 90 or around 9400 molecular functions this is as of this year I should have put 2013 so the second part of gene ontology is the annotations this is really the valuable part for us because we're using this we'll be using this annotation mostly starting tomorrow the annotations are where people take terms from the dictionary and link them to genes so it's not just so you might know you might have a gene and you say okay this is a protein kinase this encodes a protein kinase so it's not just that you take the term and you link it you also provide additional information including the evidence of why you've linked it so these associations are sometimes known as annotations or gene associations or go annotations as I mentioned there's multiple annotations per gene and the other important thing to know is that some of these gene ontology annotations are created automatically without any human review so there's a lot that are curated by scientists that are very high quality but they're more time consuming to create so there's also some automated methods that predict gene function and then somebody reviews it to make sure that the system is working properly so those are also good and then there's electronic annotation which is derived from automated processes that nobody looks at nobody checks I mean somebody programmed it to make sure that it's working as best it can but in the end it's predictions it's mostly computational predictions the accuracy varies sometimes computational predictions can be very accurate so for instance if you give me a protein I can tell you with almost perfect confidence that whether that protein will have a transmembrane domain or not based on sequence analysis and then if you were looking for proteins that are expressed in the membrane you would be able to predict that pretty perfectly almost using computational methods the numbers are lower quality and the prediction accuracy might be closer to 70% or 60% so in general people treat this part of gene ontology annotations with caution although they're useful if you don't have any manual annotation this happens for genes that are less well studied and also organisms where you haven't had a lot of study about that organism a lot and you have just sequenced its genome for instance and then all of the annotations are usually predicted by orthology from a nearby organism that might have some gene ontology that's being curated so the key point is to be aware of where this information is coming from these are different evidence codes just for your information so all of these guys here are experimental evidence codes and these are evidence codes that are traceable literature somehow like a traceable author statement and these evidence codes would include the publication reference of where this is coming from. Computational analysis evidence codes like inferred from sequence or structural similarity and then IEA is the big one that's electronic annotation that doesn't is not reviewed so just to know sometimes you see this IEA means inferred from electronic annotation I mentioned a little bit already that the species coverage of gene ontology is not perfect across all species it's far from that. Major eukaryotic model organisms and human are covered quite well several bacterial and parasite species through various different databases are covered and there's always new species annotations and development and you can always look at the gene ontology website to see the current list and here's a little bit of an older slide but it gets the point across that there's variable coverage and it's not just variable coverage of annotations it's variable coverage of curated annotations versus electronic annotations. So here are a number of genomes organisms and this Y-axis measures the percent of genes in this genome with annotations so here this guy here is Saccharomyces cerevisiae one of the first first sequence eukaryotic genomes and has been studied for a long time and the Saccharomyces genome database has basically reviewed all literature available that's ever published on yeast and budding yeast and assigned a gene ontology term manually to every gene a lot of those genes are unknown but they still assign an unknown term in that case and they've actually done a literature search to check that it's unknown you know here's another one here that has some this is I can't remember the first name of this fumigatus I used to know this sorry? so this is another fungus right that is has some manual coverage non-electronic annotation sources and then most of it is predicted so electronic annotation okay so this is a number of databases that contribute to the gene ontology project and the just for your information with some links and they also make available some other forms of gene ontology so I mentioned that gene ontology annotations can often have multiple annotations per gene and this is usually good because it provides extra information about the function of the gene but sometimes when people want to do certain things with their functions in particular often especially early on people wanted to create a pie chart that said okay I have a thousand genes from proteomics study and here is their distribution of cellular locations in general so when you have thousands of gene ontology terms and lots of terms per gene it's difficult to make a pie chart like this because a lot of the terms are specific and if you make a pie chart with 1000 wedges it won't be very useful so the gene ontology project has created a few what they call slim versions of gene ontology where they have created a reduced set of terms that are more general and there's a generic version there's a plant version, there's yeast version, there's some other versions and so sometimes you can get your gene ontology terms mapped to the slim version and that might be useful for higher level summaries there are lots of tools that are available for gene ontology that for working with gene ontology that are available from the gene ontology websites one that I like if you're just if you have a gene ontology term and you want to know more about it I like this quick go website where you can type in the term and then it will tell you more about it lots of details the definition of the term, how it's related to other terms and all the proteins that are annotated to that term proteins that are terms that commonly annotate these proteins not only this term so it will find related terms there are also other ontologies so gene ontology captures three main aspects of function as I mentioned where, you know, cellular component or expressed biological process pathways and molecular function and but there are other ontologies like cell type ontologies for instance or you might be interested to know if a protein is expressed in different cell types or the there's a few different types here so this is just to let you know that gene ontology is not the only ontology there are a number of others out there most of the other ontologies people have made the ontologies but they haven't done as good as complete a job as gene ontology to annotate the terms to all the genome so some of them they have done a really good job like for instance the human phenotype ontology is getting to be much more useful these days they have terms about all sorts of everything related to human phenotypes basically so long fingers short fingers there's thousands of terms about human phenotypes and then they also annotate genes to those terms gene ontology is a primary source of information about pathways in addition to pathway databases which we'll talk about more tomorrow and there are lots of other properties available for genes as I mentioned and fortunately we don't have to go to a different database for all of these a lot of them are present in central genome browsers like ensemble entree gene individual model organism databases and I noticed that there's quite some diversity in the organisms that people study in the class and also the technologies that people are using so this is a good time to just mention that the course focuses on eukaryotic systems because those are the most well developed but the concepts in general are widely applicable and the we might focus on specific databases but during the lab or any time you can ask instructors or TAs to recommend additional software that might be compatible with your technology or organism okay so ensemble bio mart is one of these genome browsers that makes available a lot of information about genes and it's a great place to access information about your gene list they have a tool called bio mart how many people have used bio mart before okay so a few how many people have never heard of bio mart okay so many more than gene ontology so bio mart is I really like it because it's it's a very powerful tool to get information about a list of genes you can give it upload your list of genes and you can ask for all sorts of information like gene ontology terms protein domains you can also get DNA and protein sequences variations mutations that might be associated homologues in many different other organisms so it's a good way of getting converting your gene lists from one species to another for instance and the like a little workflow here the way this works the first time people use this website it's often I don't think it's very intuitive the first time you use it but once you get the hang of it after just following one once through this you can it's much easier the second time so you need to select your genome your genes database and your genome and then once you've done that you can select your filters which is ensemble bio mart starts with the idea is that it starts with the whole genome and you have to tell it what part of the genome you're interested in so one way you can select filters is say give me all the genes that are matching a gene ontology term and it will just give you those genes and so it'll get a smaller set of subset of the genome another way you can do that is by providing your gene list and it will just give you the genes in the genome matching your gene list so that's usually what's useful for us and then we can select the attributes to download and this is where you go shopping it's called bio mart you just kind of pick a whole bunch of information that you want and you download it and it can be downloaded in spreadsheet format or other formats yeah so that's a great point so for people who are doing scripting there's these nice scripting interfaces and you don't even have to figure out how to script it you can just go to the website make your query and then get the script the code that would run that for you at least the query part okay so just to summarize there are many attributes available in databases I talked about gene ontology but we'll talk a lot more about pathway databases tomorrow and other databases throughout the course and ensemble and entree gene are good sources for a lot of this information in particular this ensemble bio mart is fairly handy this just summarizes some of the issues that come up with gene ontology just because it's complicated but there's it's not that complicated once you know this sort of understand the structure okay so that's basically it for the morning session I think we're supposed to end at 10 15 right yeah so the we've in the past had a lab related to this first part of the lecture but we have in response to feedback that we got from previous courses we've we don't have the lab anymore for this section because people thought it was too easy and and they'd rather use the time for the more interesting analysis in later courses so we still include a lab if you're interested you can try out these these tools and and this is just a little little set of things to try out the with with your gene list for instance there might be this I don't I can't remember if this gene list is now on the wiki okay so yeah this was put this up in the wiki just to try it out the small set of genes and so you can do this on your own time and ask questions about about this okay any other questions