 I'm going to talk to you about, for the next hour or so, about gene lists and networks at an introductory level. I know some people here may have already some knowledge in this area, and hopefully you guys will also learn something new, but there's also a lot of people that have not seen certain things. So this, hopefully, will get through a lot of the basics. I'll sort of provide some general introduction and motivation for why we want to interpret gene lists. And then I'll talk about where to get information about genes from lots of different online sources, and we've recommended particular ones that we think are as easy as to use. And then talking about issues with how to deal with large gene lists, and if you haven't dealt with these, then we'll also provide some useful information there. And then the second part, I'll give you an introduction about network analysis software called Side Escape that we are, and many other people are involved in developing, and there's a lot of information in Side Escape. Most of the analysis information will be presented tomorrow afternoon, although some of the various modules will mention it, but we felt it would be good to give people an introduction to the software at the beginning of the workshop so that during the labs you can, because it's a complicated piece of software or there's a lot of functionality in it, you can try it out and ask questions starting from the beginning. Okay, so typically the problem with interpreting gene lists and probably one of the reasons why some of you are here is you've done a screen or gene expression experiment and you've ranked it and clustered it and there's a reasonable amount of tools to allow you to do that quite easily. And then you have a thousand genes and the question is now what, right? So one of the big problems with genomics information is that people use it. You can query 20,000 genes in the human genome and then you rank it and find the top three most differently expressed genes. So you use that three out of 20,000 and then you throw the rest of the information away. So it would be great if, you know, there's a lot of information that you're measuring there. It would be great if you could use it all and to use it all you have to deal with large gene lists and that takes some knowledge to how to do that. So that's sort of the typical situation. So typically what people do is they make, as I mentioned, rank or cluster data. So we won't really be talking about how to rank gene expression data or how to cluster it. There is this statistical analysis course that Francis mentioned. And a lot of people work with biostatisticians and sort of the typical software that a gene expression core facility or other types of facilities hopefully will have sort of that type of thing set up. And we can ask, we're familiar with that so you can ask us questions about that if you're interested. But in this course we assume that your data is sort of processed outside of, you know, part of the data collection. So you have, maybe you cluster the data in this particular example. This example is using looking at a cluster of downregulated genes and disease in heart disease. And most people will look at all the different texts associated with each of these genes. So if people haven't seen this, most people have. Each row here represents a gene and the columns represent different experiments, gene expression experiments and the intensity of the square represents the differential expression of the gene blue in this case being down. And so most people will basically manually look through these texts and see if they can link this information that they see to textbook knowledge that everybody, all biologists have in their head that you've worked so long to put in there. And you might see particular pathways coming up like, oh, this is interesting. There's fatty acid degradation, a few genes from there. So maybe fatty acid degradation is involved in this particular, is downregulated here. But there's a lot of other pathways and processes here. And how do you know that fatty acid degradation is significant? Maybe that's just, maybe almost every gene is involved in your experiment is involved in fatty acid degradation. So that's where computational tools come in. They help you manage all of this prior knowledge about cellular processes that we know about. So we have tons of information about pathways and protein interactions, different types of functional relationships between genes. So all the prior knowledge that we have, ideally you'd be able to automatically integrate or automatically link up to your results and then using analysis tools find interesting patterns. And that would make it easier to interpret the gene list and the gene expression data. So that's sort of the goal of this course is to teach you about, to cover a number of areas of these types of analysis tools. So just as a quick sort of summary of different types of gene lists. So I think many people here are using gene expression data. And that's the example that I used. But gene lists come from a number of different sources. People might identify genes without quantifying their expression level using proteomics. You may even have protein quantitation using proteomics. And ranking and clustering using biostatistics is sort of the typical first step for dealing with that information, as I mentioned. You may have a set of genes that came out of an interaction screen. So you did a yeast to hybrid experiment or a chromatin immunoprecipitation experiment. For chromatin immunoprecipitation you found a whole set of transcription factor binding sites for a transcription factor of interest. And all the genes that are upstream of those binding sites form a list. And you're interested to see what, you know, if there's any pathways that are targeted by the transcription factor. A lot of people these days are doing genetic screens with RNAI. So knocking down genes and screening for a phenotype. So if I knock down a thousand or a few thousand genes and I measure, I see if the cells are able to grow in a particular medium or not the genes that affect that phenotype are all on a list and could be involved in a cellular process. And also association studies. A lot of genomics technology is being used to identify SNPs and copy number variants on a large scale. And people associate hundreds of SNPs or hundreds of genetic markers with a particular disease and if those are connected with genes then that's a gene list. So those are all of the sort of topics that the tools that were, the types of gene list just as an example, the tools that we'll be discussing can analyze. There may be a lot of other examples and we can chat about that during the lab. Okay, so what do these gene lists mean? So that's an important question. A lot of people sort of assume that the genes that are coming out of their screen, especially from gene expression, would be relating to biological processes. So we think that biological processes are being regulated somehow and we have a readout of that. So that's something we'll focus on. But it could also be, you might also find genes in your list that are just linked because they have the same function. You've found all the many genes in your list are protein kinases and so that's just not really related to one particular process. A similar cell or tissue location, or as I mentioned for genetic interaction, genetic association studies, chromosome location, maybe there's just a whole linkage group that's, you know, you're trying to study and see how it's related to other genes in your system. So you have to obviously think about that, what types of genes are you expect to come out of your screen and ideally have some particular question that you're interested in answering. You already know in advance whether you want to summarize biological processes or other aspects of gene function. You know that you want to maybe find a controller for a process, maybe you're interested in finding transcription factors. If you are, and that might be useful for a follow-up experiment where you perturb the transcription factor and see the response in a cell line or something, that might not be reasonable for patient tissue samples to do that kind of thing. Your goal might be to find new pathways or new pathway members or to discover new gene function. Maybe there's a particular gene that you're interested in and you're interested to know more about what it does in the system that you're studying or correlation to a disease or phenotype. And then finally, what a couple of people mentioned, you might only be interested in differential analysis. So what's different between two samples? I have a cancer sample that doesn't metastasize and one that does metastasize. What pathways are different between these? So these are all different types of questions that you have to be, that has to be in your mind when you're doing this analysis. It's important to define that. But the tools that we'll go over can answer all of these questions in different ways. So we'll talk about gene set analysis, helps summarize information in large gene lists. So it's more of a descriptive summary of what's in the data which you can then help interpret. Gene regulatory network analysis that Wyeth is going to talk about will help you find controllers like transcription factors that might be controlling processes. And pathway network analysis might help you find new pathways and gene function prediction we'll discuss as well. But first we just wanted to, does it sound good? Okay, is that what you are interested in? Okay, but first I just wanted to go over some gene list basics and talk about different types of attributes that ideally you would want to get about genes and this is sort of a starting point often for most people. Any questions before we go on? Okay, so as many of you are aware, given a gene, any biologist can find tons of information about it. So different types of known functions, positioned in a chromosome, whether it's associated with the disease or not, properties of DNA like the gene structure, any SNPs, a whole range of other properties, protein domains, post-translational modification sites, interactions with other genes or proteins, the list goes on. And we are not really going to cover what all of these mean in biology. Mostly I just wanted to go over some of the bioinformatics methods that people use to collect this information and practical places to get this information easily. So there is a lot of information out there basically and there's some sites and tools out there that allow you to get access to it quite readily and we'll just cover those. So the first thing I wanted to go over is gene ontology. So how many people here are familiar with gene ontology? How many people use it on a regular basis and really know about it a lot? So I'm going to go into... Gene ontology for those people who don't know is a set of... It's like a biological dictionary for gene function. So it's a set of controlled vocabulary which just means it's a set of terms that people have agreed upon in a standard way to represent gene function. So there are terms like protein kinase or apoptosis or membrane. And the word ontology, if you're not familiar with it, basically just means a formal system for describing knowledge. Interestingly, all these terms have definitions associated with them, so it's also a dictionary. And if you're interested to know what anachoresis is or something like that, you can go and look at the definition in gene ontology, and it's usually pretty good. So the terms are related. They're not just... So it's like a dictionary in that it has definitions associated with each term, but it's unlike a dictionary in that the terms are related to each other. So there's some relationships like apoptosis is a type of... is a type of programmed cell death, which is cell death, which is death, which is a physiological process. Or B-cell apoptosis is part of B-cell homeostasis. So there's these relationships like is a and part of, and there's a couple of other relationships in gene ontology that relate terms to each other. And in general, this relationship is set up so that more general terms are at the top of the hierarchy and more specific terms are at the bottom. And so this is a way of just organizing all of these terms so that you can understand how detailed the term is or how general the term is. Do you have one question? Yeah, feel free to ask questions. How did you build this... I think, for example, antiapoptosis or things that are related to the same thing that they're really involved in this... They would have a term called antiapoptosis process, but they wouldn't have a particular... If it's antiapoptosis, it wouldn't relate to the suboptosis. It would relate to the antiapoptosis in both... You would have a separate term for antiapoptosis. There's no relationship in gene ontology that says antinom, like the reverse of the opposite thing. So it would just be a separate term and it would be related to maybe negative regulation of... You have terms like negative regulation of apoptosis, things like that. They deal with it by adding additional terms. It would be a child that... It would probably be a parallel in some related physiological process. Like control of cell death or something like that would be maybe a parent term. You'd have to look to see what it actually is. Importantly, terms can have more than one parent or child, so it's not a hierarchy, it's not a real tree. That's important to understand because it actually causes issues. It's not as simple as you'd maybe like it to be. So this is sort of an example of... It's the sort of cell migrated over here. It's supposed to be here. So you have cell as a general term. Membrane is part of this cell and chloroplasts could be part of the cell. Chloroplast membrane is a type of membrane and is part of the chloroplast. Mitochondrial membrane is a type of membrane. So you sort of get the idea how you go from general to more specific. This is a good example because it includes sort of something that people recognize as part of plants, chloroplasts. It will go itself as species independent. It does have some specific terms that are specific to a group of species, group of organisms, the higher level terms are generally not. So they don't try to be species specific in a genotology. It's meant to be general. And genotology covers three main aspects. Cellular component, parts of the cell, molecular function, like enzymatic activity and biological process, which typically pathways. That's fairly important to understand the distinction between those. So the terms are... There's thousands of terms. Actually right now there's 27,000 terms as of a couple of days ago. Mostly, most of them have definitions. Many pathway terms, thousands of cellular components and molecular functions. These terms have been added by people at the European Bioinformatics Institute, curators or editors, and by database groups over quite a number of years. And they're adding more every year. There's a couple thousand or two thousand more terms added to genotology. You can also add terms yourself if there are no terms for your chemokine pathway that you're interested in. You could suggest that to genotology. And experts in biological experts in specific fields are helped with major redevelopment of certain branches of genotology. So this is an ongoing effort by lots of different people, collaboration, international collaboration. And the practical consequence of that is that genotology changes and it's updated all the time. So if you do your analysis from your analyzer data set from two years ago with the current genotology, you may find some new things. So, genotology, the project itself is mostly focused on defining this dictionary and hierarchy. And then people use that to annotate genes. So genotology is two parts. It's the terms and the annotations. So the annotations are linking the terms to the genes. And each annotation is... you can have multiple annotations per gene, so a given gene might have 50 different terms associated with it because genes can have multiple functions, so that makes sense. And these annotations are often created by trained curators, meaning that they're manually assigned, but also there are some annotations that are created automatically and it's very important to understand the distinction between that practically. So because there's so many annotations that you might get from genotology, some of them have different quality levels and you may not be interested in including some of them in your analysis. So the ones that most people are ideally interested in is the annotations that are manually... that are manual annotations by trained scientific curators that basically said, okay, this is p53, it's involved in whatever process. Processes. So these are typically higher quality annotations, but they're smaller in number because it's time-consuming to manually go read the literature and add these terms to each gene across many organisms. And so it's supplemented with electronic annotation. And some of this annotation is derived without human validation. So some of the electronic annotation is actually reviewed by people and some of it is not reviewed by people. It's just put into the annotations. And some computational predictions are actually quite accurate. You don't really need to review them. Like prediction of transmembrane domains or signal peptides is actually 98% accurate for many systems. But a lot of others that are just predicted based on sequence similarity, for instance, are not as accurate in all cases. And it's generally assumed that it's lower quality than the manual annotation. So key point, be aware of this annotation origin. The way to be aware of that is to look at the evidence type. Each annotation, if you look at each gene ontology annotation, it's associated with the evidence that was used to assign that term to the gene. And there's a lot of different terms here. All of these guys you can sort of read through in this list. The different terms, for instance, traceable author statement from a paper. That's particular type of evidence. And it would be associated with the PubMed ID. So you can actually go back to the paper and see what the statement was. Some of it is inferred from sequence or structural similarity. The important thing is here is that it's actually all of these things, even if they're computationally derived, all of these are manually reviewed. And then the one that's not manually reviewed is IEA, inferred from electronic annotation. So sometimes and increasingly actually people are recommending that you try your analysis and don't include, you may do both. You may include electronic annotation and not include electronic annotation. Look at the results. Because a lot of us, electronic annotation is not correct. It just may be reviewed in the future and cleaned up. And the reason these codes are here is so that you can filter the information. The other major concern for people is species coverage. So genontology itself, all the terms are species independent. But the annotations are specific for every organism. So you can download the human annotation or the particular E. coli annotation or mouse annotation. All major eukaryotic model organisms are covered reasonably well. Human is covered by genontology annotation group, the main group, the Uniper database. There's a number of bacterial and parasite species. And new species are constantly being added. If you go to genontology.org, and I actually wanted to bring it up in a, can bring it up later in a, bring up the website in the lab to show you the list of organisms. But the important thing to notice, to note is that there's variable coverage between organisms. So this is a little bit older plot, but it's still the same today in terms of the general idea. This represents the coverage of genes with annotation, either electronic or non-electronic annotation. So you can see that certain model organisms that have been studied using genomic techniques for a long time, like budding yeast, have 100% coverage. Every gene in yeast has been manually looked at and assigned a term. Some of the terms say unknown function. Unknown function is not really a term anymore, but something similar, like just a very general function, like biological process. But they've manually at least checked that nobody knows about that, and they're keeping up, keeping it up to date. So the, that sort of, you know, this, as XP mentioned, is sort of hides the fact that some of the terms that are associated with genes are fairly general and may not really tell you much about the function of the gene. Like, oh, it's involved in the biological process, well, surprise, surprise. So the, but this, it does, certain species are better than others. So, for instance, I do some work with C. elegans sometimes, and I'm surprised at the low level of coverage of C. elegans in gene ontology. It makes it more difficult to use, whether, whereas human and mouse and, and yeast and certain other well-studied species are much better annotated. And this, for some of the tools that we'll talk about later, they depend on having gene ontology annotations. And if the gene ontology annotations aren't that great for the species that you're studying, you have to either work to improve that in the community, or you, you could understand that it's worth a try to try these tools out, but it may not work as well as the human case, for instance. There's a number of databases that contribute to this, and you can look at this also on the gene ontology website. A couple of other useful tips for gene ontology. There's so many thousands of terms that some people have difficulty using gene ontology, especially for high-level summaries. Like, often you might want to have a pie chart that just in your paper that says, I have a thousand genes, and I just want to give a, give a sense of where they're located in the cell. So, for this, for these types of applications, there's something called Go Slim, which is a little slim version of, it's a reduced set of Go terms that's official and made available on the gene ontology site, and there's a few different ones. There's a generic one, there's a plant one, there's a yeast one, and you can use these sometimes for sort of high-level views. And there's also dozens of free software tools. All of the resources in gene ontology, all of the terms and annotations are freely available. Anybody can use it without restrictions. And there's also a number of tools that are available, and you can go to the site, and we'll cover some of them today. A good place to access gene ontology, one of my favorite places, at least, because it's relatively fast, is QuickGo at the European Bioinformatics Institute. And you can search for Go terms and get a lot of information, including charts of the relationships of the terms to other terms. If you click through here, you can get statistics about how often terms are used in different organisms. There's a large amount of information. There's a lot of sites out here that do this, like you may have heard of Amigo and other ones. All of them use the same information in the background, but this QuickGo, I think, is easy to use and relatively fast. And there's also a lot of other ontologies. So the success of gene ontology is for lots of other people to create similar dictionaries for cell type or anatomical part or other things like that. And there's a list of quite a few, probably almost 100 of them that are available. So those might be useful for other types of projects that you have. Okay, so spend a fair amount of time on gene ontology, mainly because it's used by so many people and so many of the tools that we'll be using as an excellent source of gene function annotation that is computer readable, easily, freely accessible, and a lot of people are contributing to it. All of these other types of gene attributes that are available for genes and proteins are not going to spend as much time on, but mostly these can all be retrieved from just a few sources. We're going to talk about one ensemble biomarkt and mostly for eukaryotes, although now it's starting to branch out into plants, viruses, bacteria. So every organism will be covered at some level with this tool. Entrez gene, most people are aware of it, is the sort of best general resource on genes that exists. And if you study a particular organism, often there's a community that's built up around that organism that's created a database called a model organism database that's like the one for yeast, yeastmen.org, or mouse, that most people are working in that field would be familiar with, and those are all excellent sources of all of this information, sort of a single point of, single one-stop shop for this information. There's lots of others. If these resources, when we go through them, don't cover your particular organism or study that you're doing, we can discuss to try and find others. So we'll go through this in the lab. How many people have used Ensembl Biomart? How many people have used Ensembl, the website, as a genome browser? So a few. Okay, so that's great. So Ensembl Biomart is convenient access to information about gene lists. You can submit your gene list and get information about gene ontology, information about sequences, SNPs, structures, homologs, all the different types of identifiers that might be associated with a gene, chromosomal locations, you name it. Most information associated with a genome is available, and when you initially go look at the website, I found that it's not as intuitive for first-time users. But it's actually very easy to use. You just have to get over this little initial hump of understanding how it works, and then it's very easy to use after that to go to just shopping for information. So the first step is selecting a genome. So most people would select ensemble genes, and then once you select ensemble genes, you select your species of interest, and then you can define filters. There's sort of a little button that says filters, and filters basically allow you to filter down from the entire genome to some subset that you're interested in, and there's different ways of doing that. So you can say, I only want genes on chromosome one, or I only want genes... oops. So there's sort of a way to filter by a region, or I only want genes that have specific gene autology terms, or I only want genes that have these different types of identifiers. I have a list of gene names, for instance, and I only want information about those gene names, or genes that have membrane domains or specific domains or lots of different things. So once you define your filters, you can then select attributes to download, and that's sort of step two is... that's where you say, okay, this is the information I want about the gene list. We'll cover that a little bit more later, but that's sort of the general idea. Okay, so... just as a summary, basically we've mentioned that there are tons, lots of different gene attributes available in databases. Gene autology is a great place for most information about gene function. It's not the only place, but it's probably the most convenient, and definitely for some organisms, it's the most comprehensive. And we talked about how that's a classification system, and there's terms and annotations, and how Go can be simplified and used. We'll go over a lot more uses of this information later. And then many other gene attributes are available from Ensembl Biomart. Any questions about that? You talked about the species which should be covered, or which can be covered. How about... the species which I did a high throughput sequence in, for example, which is a non-model species? If they're any way to electronically infer annotations, or what we also tried is we last a bit on a model of organism genome and then look for vehicle annotations. And what's the worst of this? What's the sense, or is there any value of this kind of analysis? So if you have, and this is increasingly the case, you have a new genome sequence because it's getting easier and easier to get new genome sequences. Before you... the first step is exactly what you do. Compare... do a sequence similarity search with every other protein, known protein, and known gene and transfer the annotation. It fits over a certain sequence similarity. Most genome centers, genome sequencing centers kind of have automatic pipelines for this these days, but I'm not sure how it's working now in the south. I'm not sure how that sort of fits in with the genome annotation pipeline. But typically what would happen, you know, if the Sanger Center or Tiger sequenced the genome, they do exactly what you said. They just blast used existing software to do that. Also, they map protein interaction networks and functional association networks from other species via those homology links to this new genome and try and use that for network-based function prediction. We'll talk about that tomorrow afternoon, how network-based function prediction works. But always it's going to be by... the gene function transfer is always going to be by sequence similarity in some way. So if that's all you have, then you have to use it. And that's reasonably accurate for many cases. It's just that there are some cases where it's definitely going to fail. So certain enzymes, for instance, have almost the same sequence, but there's a few residues in the active site that are different. So it's going to be a different substrate binding. That's very difficult for these things to analyze. So it's good for more of a general sense of the function, not specific enzyme functions or other things like that. You may know it's a kinase, but you don't know what kind of kinase it is. You can... Quick Go is mostly for looking up information about Go terms. Sorry. And you can get information about annotation, but I think it's most useful for Go terms. And then Biomart, I think, is better for gene annotation. Sorry, I should... If you do this last thing, you usually get a mixed list, right? So if you get some... I'd say I'd sequence it inside, then I'd get all kinds of hits from disaster, but also from bombings from other species. Is that fine to have a mixed list? Is that for the Go annotations, or should it be only one gene? No, it can be a mixed list. So typically people use all available information, so they'll just blast the proteins, translate the proteins against all known proteins, maybe there's two million known proteins or something. Great. Sorry, then the mixed list, you want to make sure you only have one annotation for gene, though. So you've got a filter, so if you've got blasts and multiple... you've got blasts and multiple organisms from the same gene, what are your bias group results? Like one annotation. You've got to choose one of the genes. Which is usually the highest... So that... We can chat about that more. Yeah? Which of the Go check, and that would be like accessing the... Oh, and I actually... We can check that out. I haven't used... This is a newer feature that I don't use so often, so we can look at that during the lab. Okay, so any other questions about annotation? Okay, so the other sort of main... So that sort of covers getting information about your gene list, which is pretty basic first step. Another quite basic first step is how to deal with all of the different names for the gene and different identifiers. I'm sure many people, anybody who's worked with a gene list, has gotten stuck on this problem. You have, you know, your affymetrics IDs, and you want to get the gene symbols for that and then convert it to entrate gene IDs because this other tool only recognizes entrate gene IDs or something like that. This is a huge headache for everybody, and people are working to solve it. There's a... You do have to kind of deal with this and it's a little bit of a pain, but usually it only takes... You only have to deal with it once for each experiment. If you do deal with it and you get identification from, say, David, if you're using some other software, and if there's a difference between what is there in the affymetrics and which one is the more reliable one? So I'll mention that, but that's a hard point, and it's not... For each question like that, like, is David or affymetrics more reliable? It's a different answer, right? So in that particular case, I would trust affymetrics because they're the source, and David probably just downloads from affymetrics, and maybe it's conflicting because David's a little bit out of date. Maybe they haven't downloaded the latest file or something like that. It could be many reasons for that. You have to kind of do some investigation work to figure out why, but it's always good to get from the source if possible. So if you know what the source is, then that's good. If a resource like David, if you know by reading their paper or talking to them that they do more than affymetrics, they just don't use the affymetrics data and copy it. They also supplement it with additional things and correct errors and other things like that. If they do a lot of extra work, maybe it's value-added information. I don't think David does in that particular case, but there are resources that try and fix problems, mostly for affymetrics, just affymetrics. We won't be going through that any particular case, but I'll just talk about some general tools that will help you for many cases. So as many people know, identifiers are used to track things. So something, ideally, if you want to track it, like a protein or a gene in a database, ideally that is the thing you use to track is some unique and stable name for it, right? So we all have fairly, you know, social insurance numbers, for instance, are unique and stable. Entrez gene IDs for genes are definitely unique and fairly stable. I don't think they almost never change, but a lot of identifiers for genes are not always unique and stable. And because gene and protein information is stored in many databases, every database assigns their own tracking number, and that causes a lot of confusion if you have different tracking numbers from different databases. So basically the problem is that genes have many identifiers. So there's different records for genes, DNA, RNA, protein structure, and it's important to recognize that not all the databases that talk about a gene are talking about the same thing. Some, like Entrez gene is talking about genes, they don't have information about the protein sequence as part of Entrez gene. If you download Entrez gene, it doesn't tell you what the protein sequence is. Instead, there's a link from Entrez gene to the protein sequence database, which has its own identifier. And the relationship of that is that a gene can give rise to multiple proteins if you have alternative splicing. So it's not necessary that you'll just have one gene, one protein. So you have to understand not only the different types of identifiers, but the relationships and the type of data that you're dealing with. So those are just something to be aware of. Most of you probably know that. This is an example of all of the links between databases at the NCBI, the people who maintain Entrez gene as part of the National Library of Medicine. And you can see how complicated it is. All these different circles or databases and the links, you know, the lines between them are pointers from one database to another. And each one uses its own tracking number. And there's a, here's some examples of different tracking, different ideas that people use. So for gene, you have Ensemble, Entrez gene, RNA, protein, and species-specific, various different ones. Many people will recognize these. There's even tracking numbers for annotations like SNPs or disease associations or domains. Some are specific to experimental platforms. Just for your reference in your notes, we've just highlighted and underlined ones that we personally recommend. These are ones that are most likely to be stable and unique and not to change that much over time. Whereas some of these other ones, even Hugo human gene names change from year to year. It's very annoying because you'll have problems like what Raman mentioned. So for specific cases, during the lab, that's an opportunity for you to try and use the software that we'll recommend and also the software that you're used to using. There's lots of different places to get this to handle this information. And if you have issues, like specific issues, we can try and answer them, all of the instructors. So ideally what you want to do with these IDs typically is map them from one type to another. That's the practical thing that you want to do when you have gene expression data and you want to map it to gene function annotation. So I mentioned that conversion is a headache. There's basically four main things that you want to do with this information. One is you want to search for a favorite gene name. So ideally you would have access to all the known gene names so that your search is more likely to get an answer. If you don't have all those gene name synonyms, then when you search for your favorite gene name, even though your gene is in the system, it may not hit it. So that's an important type of use. You want to link out to related resources. So this is sort of getting information about the gene list. You want to have a set of genes. I want to link to related resources. I want to get sequence information and domain information, et cetera. And also identify your translation. So this typically happens most when you have a gene expression or a genomics experiment and there's particular tracking numbers for all the different measurements on the platform they're using. Like for gene expression, you'd have aphometrics or alumina of different ones. And so you want to translate those to recognizable gene symbols that you can then use to better figure out what's actually going on in your data. And the other thing that we'll discuss is unification. So you want to merge data sets from different databases that have different IDs and you want to make sure that you find the equivalent records and say that they're equivalent and sort of all related but slightly different uses. The important thing I guess and I'm going to mention it later as well is that the important thing to kind of understand about this are these are it's sort of even though there's several differences between these uses, it's important to understand them because there's different ways of making mistakes in them and I'll talk about the challenges next. There's a lot of different services out there now that help you map these IDs. David as was mentioned there's actually a few dozen of them. And the one I like recently is called Synergizer and I like it because it's just so easy to use and it's just focused on one thing, identifier translation. But you can also do identifier translation with Biomart and at Uniprot basically with and many other sources as I mentioned, Synergizer basically takes all the information from Ensembl and you can put in a set of genes in Synergizer and say okay these are Affymetrix IDs and I want Montray Gene IDs or I want Gene Symbols and you basically get a table that results in this. That's all it does. It's fairly simple. It should be able to convert a list of Gene Symbols or names. That's one of the challenges I'll mention. Do you mean when you mean Gene names, do you mean synonyms like protein names? Right. So it can handle standard Gene names but usually if there's synonyms that are not standardized these tools are not that good at recognizing them. I'll mention that in a sec. For certain cases that work really well in other cases it's a little bit more difficult. So here are the challenges. So basically this is one of the probably major sources of errors in genomics analysis because if you don't do a good job mapping IDs, you map the wrong ID then you'll think that your Gene MDM-2 is expressed in this way but it's actually not MDM-2 that you're talking about. It's something else. That's a pretty important pretty serious mistake. So it is important to check these lists over. There's a couple of different problems that may occur. One thing is Gene names. There's lots of different types of Gene names and in general a lot of Gene names are ambiguous which means that they're not necessarily unique for every Gene. You can have two different genes with the same name and that makes it very difficult for computer programs to sort out software is not smart and it can't figure out which is the right one that you mean. So it just has a big list and it looks up in the list. So in general Gene names, synonyms are not a good identifier. So an example of this is for the Gene people typically for instance for P53 people say P53 but there's a whole bunch of different names if you go to Entrez Gene. Entrez Gene is actually the best source I think of Gene names synonyms. You can see these four different names for P53 but the standard Gene symbol is actually called TP53. So it's best to use the standard Gene symbols if possible for tracking your data and it's also the sort of standard Gene symbol is recognized by these tools and the standard Gene symbols are organized differently in different organisms. All of the Gene databases usually have the standard Gene symbol as they call it the symbol. So often they use the word symbol. So if you go to Entrez Gene you'll say the standard symbol or standard name. If you go to a yeast database it will say this is the standard name and then there's aliases or synonyms. Usually. And not every database may make that distinction but it's an important distinction and the ones that we discuss do make that distinction. For every organism it's different basically it's just up to the people who are working with that organism to standardize the Gene names. So for human it's the human Gene naming consortium. For mouse it's the mouse database for yeast it's the yeast database who standardizes these names. Yes. Yeah. So the if a Gene name, if you put something in the list and it's not recognized it will tell you. And you can go and see maybe manually look at that to correct it. So I have to notice that there is a progressive lessening of recognition of these identifiers as you go from you know say for example Akimetrix code to the Entrez Gene and then to Gene Zimbal. You mean you mean as you do the translation you lose genes? Yeah. So that's this last problem. So there's problems reaching 100% coverage. So these are due to the things that you are aware of probably. There may be version issues like synergizer made is getting information from which gets information from mathematics for Akimetrix data gets information from Illumina for Illumina data etc. Some people have problems with different companies annotations so when a company for instance for Gene expression provides you with a chip they have these annotation files and in the past people didn't like the Akimetrix files but they've gotten a lot better. And for certain organisms they still might not be ideal and within the community people sort of are not as good maybe we can do better and they update it. Maybe manually even. For other platforms like Illumina sometimes for certain platforms people say I don't really trust this Illumina annotation as much as getting it from this other source. So if there's problems with the annotation people usually try and fix it and then it's a problem that you have to deal with somehow and it's going to take more time. But generally a lot of the platforms that are fairly established have very good annotation. But one of the reasons that you would not get any coverage just off the bat and you wouldn't even expect this is if you have a probe ID on a Gene expression array and you don't even know what Gene it is maybe it's not even a Gene anymore maybe it was designed when they thought that an open reading frame existed and now they don't think it's a Gene. Or there's hundreds of different reasons why that might happen. Also there may not be any known information about that Gene there may not be a Gene symbol known. So you will lose a whole bunch of probe of information through a lot of reasons like that and then there's another type of and those are unavoidable you just can't do anything without doing more experiments. But once you have a set of Gene IDs that are avoidable things that you can do to increase the coverage where it is possible. So sometimes it's a problem with different databases. So Entree Gene is pretty comprehensive but maybe it doesn't have an Entree Gene ID or number for all of the genes maybe instead there's a Uniprot number for some of those genes that don't have Entree Gene IDs. So it's usually good to try to map to a few different sources and what I usually do is I put them all in Excel and I see if there's a category of genes that are missing Entree Gene IDs but they have Uniprot IDs then I'll kind of copy the Uniprot IDs and see if I can get Entree Gene IDs. Usually by doing this initial thing with the easy thing with Synergizer you'll get 90% coverage reasonably well and then if you want to do this additional work in Excel improve your coverage by actually examining the results. Just to comment on a couple things these favorite tools change from year to year mostly because the people who are creating the tools lose interest in the graduate who does the news on to the postdoc somewhere so they don't keep updating in their life as you get the sense of your Synergizers don't play for the month. I put a few other links up to Wiki but there's one in Spain. As we go along we'll find it. Thanks, yeah and by all means the tools that we present in general are not the only tools, they're just one that we're presenting and the tools that I present are just the ones that I use personally. Yeah but like you said I did try this and you do this but you always worry about it so the point I'm trying to make is that would you then at one point go back to the source like for example you go to the company and say okay, another company and you go to the website and get the actual sequence and try to blast that sequence and see if it comes up with something different from what your even your unit fraud or your other identifiers have come up. So that would be great if probably if you did that I'm not sure how many errors you'd find this is mostly a question of trust and reputation of the annotation file so whether you choose to do that or not sometimes you can't even get the sequences because they won't give it to you so that's a problem and then you know who knows what you do then, use another platform but that would be an ideal thing to do, it's just time consuming if you want to go to that level of detail you could go through and check every gene name and it will reduce, definitely you will find errors and it will reduce the number of errors, it's just the amount of time that you're willing to spend I'm guessing on those that you didn't get or that you would be able to do that for the ones that you didn't get you can definitely do that and it's possible to do that another thing that you guys should be aware of is errors that are introduced by Excel so how many people have seen errors introduced by Excel so quite a few so we have to write a group letter to Bill Gates to complain basically Excel has auto formatting features that it tries to be really smart but that's not the kind of intelligence that is good in biology, it's maybe good in business when you type in the gene Oct4 it changes it to October 4 because Oct4 is obviously a date not a gene when you're using Excel you can turn off these features but they're on by default the major the nasty part of this is that when you're copying and pasting gene lists of a thousand or 20,000 genes you will not notice that this is happening because it's just you don't even see it on the screen so definitely be aware of that and turn off these auto correct features if you're using Excel or if you don't like that you can try and compare the results based on Excel again in a text editor just compare them that's probably too much effort if you know how to do Excel macros and functions you can automate some of this double checking but ideally you just turn off these features and there's a paper about that by Zeberg et al from 2004 just about all these different errors that are introduced and it's in the book sorry? open office I haven't really used open office that much I think it also has these auto correct features but I'm not sure if they're on by default or with the particular way that they work on it may be better but most people are using Excel anyway because you know you don't want to not use a very powerful tool just because it's it has this error and it's something to be aware of and you can prevent it from happening if you are careful about the formatting of the copy and pasting and you're just aware of it but this paper actually when it came out it was very interesting because they analyzed databases and they found that that this problem was October 4th was listed as a synonym for Oxford and Andrei Jean or something like that what time did we say the break was 10.50? 10.30 ok ok so so ok so this is a summary of the main things that we discussed there's lots of identifiers for genes and their products and we have to when we're working with large gene lists you have to understand how to convert this using available ID mapping services and try and use standard commonly used identifiers when possible to avoid some of these challenges that exists ok any other questions? the ones that we underlined in the book are fairly general and standard like for instance I try to use Andrei Jean IDs wherever possible but mostly in figures you'd use if you're working with human the standard symbol or something like that so I try to use a standard symbol even if I work with proteins and protein has a different name I try to use the gene symbol just because it makes it easier for people to look up but there's no official standard because who does research and biology has a license to name their own gene and they want to name this particular thing and two people are working on the same gene and one at name different things and people argue about this eventually it will have to be standardized but and there are standard efforts like the human genome naming consortium stuff so try to use standard efforts where they're available and I think that's safe ok so in the next few minutes before the break I just wanted to give you a quick introduction completely change topics and tell you about network visualization and analysis give you an introduction to particular type of software side escape software again there are other network analysis and visualization software out there this is probably the most commonly used one and many people work on it including my own lab and nine other labs so the focus of the next section will be on just the basics of this software and what it can do just to give you a flavor of it so that you can try it out during the lab the open labs when you have time and also this evening if you're planning to stay for the open lab this evening you can try it out and during that time you can answer questions but the real use of side escape in terms of analyzing will be covered in tomorrow afternoon and we'll go over that again in more detail in the lab you can by all means try out all of this the things before that so side escape I think many of you have installed it on your laptop already and how many people before this course, before hearing about it by this course have used side escape ok so it wasn't really successful it needed more time ok ok so so side escape is a freely available network visualization and analysis software that is you can install on most types of computers it's made by a number of different labs people in academia and industry California and New York in Paris and also Agilent Technologies in Unilever contributing to the software it's great for visualizing networks if you have any kind of network information like protein interactions or gene interactions you can visualize it and overlay lots of different types of data on it on the network and I'll mention that it also has a lot of additional functionality that comes in from plugins that you can install sort of add-ons typical way that you'd analyze network network analysis on gene expression for instance is collect information about networks and this comes from a number of different sources again I'll kind of go into more detail on this I'm giving you a quick intro but I'll go into more detail about where these sources are and what they are and how to use them tomorrow afternoon but the idea is that you get network information and you then you can analyze it and you can combine that with gene expression data or protein expression data and the toolcape allows you to visualize and manipulate networks so you can move parts of the network around with your mouse you can automatically lay out networks in different ways and automatic layout basically just helps you see the network so that it's not all crowded and all the nodes and all the things are not overlapping each other you can filter the information to get information about this about networks and maybe just before I continue module 4 again has sort of more general introduction about interaction databases it also has more general introduction about networks and what they mean and I'll just maybe just give the example of protein interaction networks now which is something that people may be familiar with so just think about protein interactions for this type of thing and this type of networks later so one of the things that people sometimes if you're dealing with lots of network information like large massive protein interactions one of the problems is that you get a hairball effect and if you see papers that you may have seen that show networks often this is a big ball of it looks like a hairball and people it's difficult to interpret so the way to deal with that is to kind of focus on particular areas and focus in on that so if you are interested in particular pathway you can just select the nodes from the proteins from that pathway and look at them and look at their relationships and Site Escape allows you to do this software also allows you to visualize lots of different types of data on the network all together so this is a protein interaction network from yeast that's centered around genes that are involved in DNA damage and repair and each circle is a protein each line that connects the circles is a protein interaction from the biogrid database and the nodes are colored by go function kinetochore nucleosome replication for sort of a particular general type of function and the size of the nodes are related to the transcription amplitude or the activity in a cell cycle experiment so this is a particular experiment that looked at the expression of genes over the cell cycle and the highest level of expression is the size of the node and also so you can see that the nucleosome is highly expressed at some points of the cell cycle whereas the kinetochore genes are not as highly expressed over the cell cycle the thickness of the lines is represented to the correlation of expression between two genes so if there's very thick lines between two genes like here it means these two proteins or genes are highly correlated they're always expressed at the same time or not expressed at the same time so just by overlaying all of this type of data onto a network you can really get a sense of you can start understanding some additional information about a network like gene functions are sort of clustered together if you're looking at protein interaction at works and this property can be used to predict new functions so if there's an unknown gene here that is not known to be involved in the kinetochore but it's connected to a lot of kinetochore genes and maybe it's part of the kinetochore controlling the kinetochore somehow involved this is also highly clustered things like this are protein complexes often you can see that and it sort of gives you just an idea about how the biological processes are connected so this is sort of the advantage of visualizing genomics information on a network is that you can bring in a lot of prior information put it together in a single picture and most people are visually visual so it's nice to kind of look at a picture and start thinking about how that relates to your experiment so this is just an example of particular examples of the color in this case is linked to a go term I don't have the go term ID here but this information is derived from gene ontology so set escape you can use it to make figures like this and we'll show you how to do that in the lab and also tomorrow afternoon and you can try it yourself so I guess that's one of the main advantages of set escape is network visualization the other main advantage is that there's a large active community around the software which has developed a lot of help documentation and tutorials different case studies there's a mailing list for discussion and different data sets that you can try out we've included a paper a protocol of how to use set escape for gene expression analysis in your binder there's annual conference if you end up being really interested in this there's thousands of users quite a lot of downloads of this thing because there's an active community people have extended the functionality of the software by writing their own plugins so you can go shopping for plugins that are useful for particular types of analysis and some of those might be particularly useful for you I will be covering more of the plugins tomorrow I will cover about five or six of them that are most relevant for ONIX interpreting gene lists but there's quite a lot of other plugins that are available that you can go and look at if you are a programmer or you're friendly with a programmer and there's no plugin that does what you need you can write it yourself because the software is open access and freely accessible to extend okay so that's really just a taste of what set escape is just to give you an intro that it's useful and free software for visualizing and analyzing networks and it provides the basic, all the basic routines for manipulating networks and their plugins to extend the functionality so again we'll go over that quite a bit more the next day and a half or two days but any questions right now about what we've mentioned so far the latest version is 2.6.3 and you don't need to necessarily upgrade but there are new features in 2.6, some of them will be mentioned in here so you may want to upgrade one of the plugins is not going to be the editor it's not going to be the editor oh yeah so we'll look at why that is for me so I will have to just check I know some people have had problems including or installing the Agilent literature search plugin I don't know if so I couldn't verify that but there's a checkbox that says um show outdated plugins I don't know if you try checking that sometimes a new version of set escape comes out and not all the plugins have been updated to be verified with it but they're still usable with it so if you check that box you'll see additional ones and if that's really not working then we can do there's a particular file that you can delete and just reset it so that it may fix certain problems so and during the lab we can hopefully get everyone set up with set escape 2.6.3 the only difference between that and earlier versions of 2.6 is fixes for Macintosh computers that upgrade it to the latest version of Java so it's fairly technical but if you have a Mac you probably do want to use the latest version any other questions ok so we have 10.30 now and we have a 20 minute break I guess there's refreshments outside and after we come back from the break at 10.50 we'll have a demo of set escape I'll just show it to you on the computer and we'll also talk about, we'll actually take you through the resources that we mentioned, Biomart and Synergizer and you can check out the wiki for other links that some of the other instructors have posted and we can chat about that as well