 So welcome everybody from me as well. I'm going to be giving an introduction to gene lists. One of the things that we find every year is that some people know already how to use some of the basic tools. They already have gene lists they work with. And some people have never really worked with large gene lists. So this intro might be familiar to some, but we need to have it here to kind of get everyone in the same page with some of the basics. And there will be something that you'll likely learn even if you're an expert in gene lists. So I'm going to talk about just the idea of gene lists, what they represent, and then talk about places to go get information about gene lists and how to deal with identifiers. This is probably one of the things that almost everybody is a headache for almost everybody is sort of if you're getting gene lists from other people, or you're mixing gene lists from different sources, how to connect them together. So the basic idea is, which most of you are, I guess, most biologists are facing these days, is I have some screen or gene expression experiment or chip experiment. And I have hundreds or thousands of hits. Now what? So what do I do with all these gene lists? And basically, what we're going to the main question that we want to ask is tell me what's interesting about these genes. And this is what we're focusing most of the electrons. So you can get your gene lists, say, from gene expression experiment. You might get a list of genes ranked by full change, comparing your cell line state of interest to a control. Or you might have a lot of different gene expression experiments, and you cluster them. And that cluster represents a set of genes. Might be have 100 genes in it. And you want to summarize that cluster, find out what that cluster means. And the way we're doing this is by combining this gene list with information, prior information, we know about how the cell works, or about physiology, depending on the system that you're looking with. And there's a huge amount of information that we know from databases. And there's a number of different types of ways of doing this analysis to try and figure out what biological processes this experiment is telling me about. And see, I think I'll just switch to another slide I wanted to use here. It's not working. So basically, we developed a kind of overview workflow that we'll also modify during the course. And the workflow will summarize the different areas of the course and the tools that we're using. But basically, as I mentioned, gene lists come from all different types of places. And that means that they mean different things. So you can have molecular profiling, transcripts, or proteomics data. You can just identify a set of genes. Some of these technologies just identify genes. And so they're not associated with any additional numbers. You just get a flat gene list. And sometimes you get some quantification, some additional number, which allows you to, it's basically gene list plus values and allows you to rank or cluster the information. We're not actually going to go over how to do that in this course. So we expect that this course starts with a gene list, not you have to know how to do the ranking and clustering yourself before. And it's fairly important to actually do that well so that you get a good gene list to input into the next tools. Protein-protein interaction or chip transcription factor binding sites, somebody mentioned that earlier, also generate lists of genes. Genetic screens is another common thing that we see. People are doing an RNAI screen, for instance, with a phenotype that's output. And you get a list of genes that are affected by RNAI when looking at that phenotype. And then more clinical genomics type of applications. Next-generation sequencing of tumors, generating a huge amount of mutations, which can be mapped to genes. Or if you're doing an association study, looking at a specific disease and you have a lot of mutations markers like SNPs or copy number variants that are associated with the disease, converting those to gene lists as well, and then trying to understand what biological processes, for instance, are involved in the disease is interesting. So most of these, so a gene list can mean different things. Most of the experimental methods that I mentioned before usually have something to do with understanding more about protein complexes, pathways, physical interactions. You might have a screen that just tries to find genes of similar function, like find me all the protein. You get a screen that finds all protein kinases, which is not really related to one biological process. It's just a particular type of molecular function. Or it might be related to location in the cell or tissue. Just give me everything in the nucleolus. You could use mass spectrometry to do that. And then also it might also be chromosomal location. So there might be linkage groups in your association study. And you have a whole bunch of genes that are on a section of a chromosome that are associated with your disease. That locus in your chromosomes associated with the disease. And you want to find out what's the causal gene or what's most likely interesting in the disease. So because gene lists mean different things, you have to know what you want to accomplish with your list. And hopefully that's part of the experimental design. But basically, as I mentioned, one of the things that you can do is you could just summarize all the biological processes or other aspects of gene function that gene is telling you about. And that is mostly hypothesis generating. So if you're doing a screen and you have no idea what to expect, that's usually a very useful place to start. So the screen says, OK, all of the genes that are coming out of the screen are related to the cell cycle. So the cell cycle must be important in this phenotype. And then you can test that. If people are also very interested in finding a control or first specific process, and usually this is, it could be any kind of molecule as it could be a controller in the cell. But usually we don't have a lot of information about protein or small molecule controllers unless you're adding those things in your experiment. For instance, you're adding a small molecule or growth factor and you're seeing the result. But we do have a lot of information about transcription factors and microRNAs from gene expression experiments. And sometimes those can be, if you have information about a lot of those, you can find a controller. We'll talk a little bit about that later. And another question that people are interested in is if you know something about a pathway and you're very interested in that pathway and finding new members of that pathway or new members of that complex, I have a set of genes and I want to get new information. I want to find similar genes that I could test as a new member of that pathway. And that's sort of related to this sort of discovering new gene function. It might not just be a single gene, it might be the function of the set of genes that you have in your list. And then I mentioned correlating a disease with a phenotype. I mentioned the Canada gene prioritization. This is another really interesting area that people are really interested in. Often from when you have association studies in human genetics, you can get a list of genes. You don't know which one is really underlying the effect and you might be interested in finding the causal gene. Maybe it's not one gene, maybe it's lots of genes, but trying to look at the function of genes and relate it to the function of other genes that are associated with that disease might help you better understand which a set of genes is most likely to be more functionally similar to what you expect. And also performing differential analysis. So what's different between samples? If I have tumor and tissue around the tumor, what's specific to the tumor or different stages of tumor? Or if you have time points in your experiments, you have 12 hours, 24 hours, et cetera, what is changing over time? And being able to compare all of the results of your gene function and pathway or your pathway analysis, for instance, across time points will tell you something maybe about the dynamics. It doesn't have to be time points it could just be different stages of an experiment or different cell lines or something like that. So this is actually a good place for me to try and find this work-lifting there it is. So we put together sort of, we're gonna be updating the slide over the course of the workshop and adding sort of the tools that we go over for each section. But basically, this is sort of the general workflow of the course that we'll refer back to in different instructors' lectures. So we start with some gene list. Here we have a set of human genes as an example. The first thing that we'll be talking about is annotation. So getting more information about these gene lists in an automated way, not just looking at them individually. We'll also be talking about something called gene set enrichment, which is trying to find out if there's something surprising in these gene sets. Like as I mentioned, you might get a whole bunch of genes and they're significantly enriched in cell cycle genes. Another thing is pathway or network analysis where you consider connections among the genes and there might be different types of connections. The genes might encode proteins that interact where they might be part of the same pathway in which case you have a process diagram or a process, you have some understanding of what the process like A phosphorylates B and B phosphorylates C, et cetera. And then sort of seeing where your genes fit in that process. And that helps you kind of connect your gene list to biological pathways that you might be familiar with, a map of the cell that you might be familiar with if you studied a lot of cell biology, that sort of your understanding of the cell. So that helps connect your, all of these things I guess when we're dealing with processes are helping you connect your gene list to sort of your idea of how, your model of how things are working, but network analysis and pathway analysis are usually focused on biological processes. And then gene function prediction, which is I have a list of genes tell me more genes that should be on this list. Like if I have some members of a complex, give me more members of that complex. So predict new members of the complex that you can then go and test. So, but first, so we're gonna go over all of those fun things, but first we just wanted to go over some of the basics. And two major areas of basic information is attributes and gene identifiers and mapping. So I'm just gonna go over attributes in the beginning. So there's a huge amount of information available about genes in databases. And ideally, and it's useful to know sort of how to get quick access to all of that information. So all the information you expect, a huge amount of functional information, functional annotation information. If you have a gene, you might already know the biological process and molecular function and succinase or the location. You know where it is in the genome. You might know there might be an existing association of this gene with a disease if it's human. There might be a lot of information about the gene structure, the gene model, where the introns and exons are, if there's splice variants, if there's no transcription factor, binding sites upstream of the gene, if there's no known variation. And also protein properties, protein domains, secondary and tertiary structure, the 3D structure of the protein might be known. It might be known that the gene is phosphorylated, so post-translation modification. And you also might know how the gene interacts with other genes. So this is a really huge amount of information and actually it's fairly challenging to kind of collect all of this data for a set of genes. So we're gonna talk, I'm gonna talk initially about gene function annotation, explaining gene ontology. And then I'll talk about a place where we like to go get a lot of this type of information that's sort of present in a convenient way or available conveniently. So how many people have already already know about the gene ontology or use it? How many people really know about the gene ontology? Like it's our experts in gene ontology. So about a third of the class already knows about the gene ontology, so that's good. Okay, so the gene ontology is a system for that's very widespread. It's used quite a lot to define gene function to, and basically it's a set of biological phrases or terms which people have agreed upon to apply to genes. So there might be a name, a term called protein kinase, another term called apoptosis, or a term plasma membrane or something like that. There's thousands of these terms and there's sort of these agreed upon terms that they're sort of standardized. Gene ontology is also a dictionary because almost all the terms have a definition, so you can actually look up if you don't know what a certain type, a cilia is or something, you can look it up, the definition. And it's also an ontology. So this word ontology just means a formal system for describing knowledge. Just somebody has thought about how to organize and structure information formally and I'll explain a little bit more about that, that means. So the ontology aspect is basically the terms are related within a hierarchy. So you have relationships between terms. You don't just have a flat list of terms. That's also sort of part of the ontology idea. And this hierarchy defines gene function in multiple levels of detail. So for instance, you have at the top the gene ontology and then you have, there's a biological process and physiological process and you get more specific all the way down to tissue homeostasis, immune cell homeostasis, B cell homeostasis and then B cell apoptosis. When in this hierarchy, the things at the top are more general, things at the bottom are more specific and there's relationships between these terms. So in this case, red here is, there's two types of relationships mostly, is a part of and I can't remember what red is. I guess it is a part of and black is is a. So B cell apoptosis is a type of apoptosis which is a type of programmed cell death. And if you look at this hierarchy for different terms of interest, you can get an idea of how the terms are related. Most of the part of relationships are in the cell component so you'd have something like nucleolus as part of the nucleus. Terms can have more than one parent and like B cell apoptosis has two parents. It's a type of apoptosis and it's also a type of, that's part of B cell homeostasis. That's important to understand. Here's the sort of more detailed example with a simple, another simple example. One thing that's important is that genotology is in general species independent. So some more specific terms are specific to a group like chloroplast is specific to plants but higher level terms are generally general for any organism. So in general this can be used for any organism and that's the goal of the genotology. So I mentioned different types of gene function. Go covers three, they sort of divided gene function into three different types. Cell component where things appear in the cell. Molecular function, what the enzymatic function is. For instance, protein kinase or glucose phosphate isomerase. Usually these terms have something activity afterwards so the protein has isomerase activity or the gene has the encodes of protein with isomerase activity and then biological process. This is more sort of related to pathways although the pathways can be very general. So Go terms are, the terms themselves are added by a group of editors at the European Bioinformatics Institute in Hingston near Cambridge in the UK and also from database groups that are working with individual organisms like the yeast genome database or the mouse genome database or adding Go terms that are important for their communities. You can also request a term to be added. Anybody can do this. Also experts help with sort of major areas of redevelopment in genotology. So genotology is actually being edited all the time. There's new terms added every week and people from all over the world are actually organizing this. So just as of this morning I checked there was over 32,000 terms almost all of them with definitions. The majority are different types of biological processes but there's also almost 3,000 cell components defined and almost 10,000 types of molecular function. So the second part of genotology, first part of genotology is these terms, this dictionary of terms and their relationships. That's the real genotology. The second part of genotology that really makes it useful for us is annotations. Basically this is taking a term from the dictionary and linking it to a gene. And that's a separate group of people and separate process that does that. And in general these things are known as gene associations or Go annotations. You can have multiple annotations per gene. So a given gene might be protein kinase involved in the cell cycle present in the nucleus. And most of the gene ontology annotations are, well, there's a lot of different types of gene ontology annotations that are reviewed manually but there's also a section that is created automatically and I'll go over more of that in detail. So this is a fairly important point when you're working with gene ontology. Usually the type of annotation that people want is the sort of high quality annotation that some scientist has verified is correct. And it's curated by trained scientists. It's usually higher quality. But unfortunately there's reduced coverage your gene might not have being curated by a scientist. So there's a smaller number of type of information about this because it's time consuming to create all of this information. So there's a second level of quality that is reviewed computational analysis. This is something where a computer program has tried to predict something about the gene function but then somebody has checked that it's reasonable. And then there's pure electronic annotation which is annotation derived without any kind of human oversight at all. And this accuracy varies. Some of these computational predictions are actually really good. But in general it's considered lower quality than the manual type of annotation. And just an example of a type of computational annotation that's really good, there's a couple of examples. So computer programs are very good at predicting certain types of features on proteins like transmembrane domains are very well predicted. Signal sequences or nuclear export sequences, nuclear import sequences, those are very well predicted because they're very clear motifs that you can find in a protein. And also protein domains like protein kinase might be recognized because it's similar to other protein kinases. And so often things like transmembrane domains you'll see an annotation associated with that called membrane that might be more trustworthy because the computer programs behind that are well known to be very high quality. But other things where it's just, hold on, I'll come back to you in a sec. Other types of information where it's just purely enzymatic activity by sequence similarity for any given enzyme activity might be, especially if there's a bunch of enzyme activities that are in related sequences, sequences that are very similar to each other but the enzyme activity is different, that will, computer programs do a terrible job of predicting that. So it's just really important to be aware of the annotation origin and I'll tell you how to do that in a sec. Question? They are reviewing papers and it depends on the source of the annotation. So I'll talk about the annotation sources in a sec. So for some organisms it might be better than others so I'll talk about that. So each time that somebody, one of these curators or annotators or automated systems assign a term from the gene ontology to a gene, they add some more information about that. It's not just A and B connected. They assign an evidence code which is their evidence for actually assigning that term to the gene. There's lots of different evidence codes. This is in your book for your information. All of these evidence codes in red are manually reviewed and this evidence code in blue inferred from electronic annotation is the part that's not manually reviewed at all. So you can see that there's types of evidence codes from experiments so inferred from experiment. There's computational analysis evidence code like what I mentioned reviewed computational analysis or inferred from sequence alignment or sequence and orthology. And this is what you were asking about. There's a set of evidence codes that's probably the highest quality which is some author statement in a paper either traceable or non-traceable. And then this is where the curators are kind of creating kind of a synthesizing the literature and adding their own knowledge. You have a code called inferred by curator. They could also say there's no biological data available because they did a full literature search and they couldn't find anything. So these codes are associated with gene ontology annotations and this no biological data available might not be very useful for you. Also typically a lot of times people are removing this inferred from electronic annotation before they do analyses or at least for first pass analysis. As I mentioned the difference between these reviewed ones and the non-reviewed ones is coverage. So these reviewed ones only cover so many genes and you might have a number of genes that only have electronic annotation. And so if you have a set of genes that has very low coverage and there's only electronic annotation you might be forced to try and use that although you should understand that where it's coming from and so that you can decide for yourself if the results that you get from your analysis are valid. So all major eukaryotic model organism species and are covered by gene ontology annotations. The Uniprot database which is a big protein database based mostly in Europe has a gene ontology annotation group that is responsible for human annotations but mostly most model organism databases will also have their own gene ontology annotation groups that are basically just focused on updating these annotations. A number of bacterial parasite species are available as well. Some of these are more automated and there's always new species annotations in development. If you're working with a genome that is fairly well established usually you can go to the gene ontology website and just download the gene ontology annotations and they'll be available in every software package that you're using that we'll be talking about today. If you're working on a new genome you just sequence a new genome there's not gonna be any gene ontology annotations available. However, there's probably a lot of tools that are part of the genome sequencing center's pipeline for when they assemble the genome and then find the genes they probably have an extra step of that pipeline that annotates gene ontology terms by orthology to other, to closed species and that's an automated process. So those types of things will be only inferred from electronic annotation but at least you get some gene ontology terms for that new species. So the other interesting thing is there's very variable coverage. The slide is a little bit out of date I just took it from this paper from 2005. I haven't seen an update yet but the message still stands that depending on the species that you're looking at gene ontology annotations can be really good, really good coverage and maybe not so good. So one of the best is yeast. But in yeast Saccharomyces cerebrusae the yeast genome database had a very large project to make sure that every single gene in yeast was looked at by someone at least once and even if they verify that there's no data available. But you can see that there's actually so this is the light gray is electronic annotations and the dark gray is non-electronic annotations so yeast is 100% covered by non-electronic annotations those are the human reviewed ones. And these guys, there's a sort of number of model organisms starred here with the red star. So you can see the difference. Some of them only in 2005 like mouse only had 50% coverage, it's much higher now and some of them have 100% coverage. So sometimes we've worked with for instance C. elegans or fruit fly and the coverage the gene ontology annotations for fruit fly haven't been as good as when we're working in yeast but they're improving all the time. So there's just a further answer to your question. There's a number of different contributing databases to the annotations and in fact, anybody can make annotations available if you have a way of predicting gene function you can make available annotations and you can advertise them. If they're official like from these various databases then they would be available on the gene ontology website. So you might actually be able to get gene ontology annotations from a place other than the gene ontology website. So one of the problems with gene ontology that you'll probably face when you work with this information is that there's sometimes too many terms. So I've mentioned there's 32,000 terms and if you wanna just make a simple pie chart I have 100 genes and I wanna just summarize the function in a pie chart like tell me how many things are in the nucleus, it might be difficult because there's not just a term called nucleus there's a term called nucleus and every different part of the nucleus, chromatin, nucleolus and there might be hundreds of terms related to the nucleus. So that might be really nice to look at if you're looking at individual genes but if you're summarizing them like this then you get 1,000 or 100 little pie slices which is not useful. So there's a type of, there's an additional set of ontologies called go slim which is an officially reduced set of go terms that sometimes is nice for a high level view. So they've traded off number of terms for the specificity of those terms. So you won't get very detailed terms, you'll just get more general ones but there'll be fewer of them so they're easier to work with. So there's a specific, there's a generic go slim that's useful for every, basically eukaryotic or non eukaryotic cell and then there's a plant and yeast version and there might be other versions, other specific versions. So the other thing that's useful to say here is that when you're doing these kinds of summaries you often realize quickly that it's difficult to work with genes that have multiple annotations. So if a gene is known to be involved in the cytoplasm and the nucleus maybe it's shuttling back and forth then which pie slice does it go to? Is it, you know, you'll have to basically decide whether you wanna have it as one of them because you're interested in only the nuclear part of its function or you'll decide that it has to be part of both pie slices and so that the numbers get updated. So that's something that you have to think about often in genomic studies that I've been a part of if people are really interested in getting a very detailed view of a very sort of study specific summary like high level summary like this they'll actually go through the list and choose the individual terms that they're most interested in. So if they're only interested in the cell cycle maybe they'd be interested in more nuclear based related functions rather than some other unrelated function. But that really requires manual curation and if you're, but if you're making a paper of a figure for nature or science or something or cell that might be worth doing that. And in general actually a lot of tools on the visualization side don't do a great job of visualizing multiple different functions because if a gene has 20 different functions it's actually hard to visualize that in a pie chart or something like that. We can talk about that later. Okay so there's a number of different tools that are available. You can go to genontology.org slash go tools if you're interested but we will be introducing specific ones and talking about specific ones here. So a good place to get information about genontology is quick go. So if you type in quick go into Google or type in this URL then you can search for a go term or search for a protein or compare go terms. And that chart that I showed you in one of the previous slides is accessible from this page as well. So you can sort of navigate through this hierarchy of genontology information. You can see statistics about the go term. And if you're just looking to get familiar with genontology this is a good place to browse around just to get an idea of the types of information and that's available. Genontology is not the only ontology that's available. There are quite a few others and it's just genontology is the most popular one. And there's a site called an ontology lookup actually, I don't know if this URL is being updated but you can browse around to about 100 different ontologies and some of the ontologies might be useful for you. For instance, there might be a human phenotype ontology or a disease ontology or a cell type ontology which tells you which the gene ontology doesn't cover those things. Okay, so that's most of the information that you can collect about biologic processes and molecular function cell location comes from the gene ontology. So it's important to kind of know where that information comes from and then there's all this other information. So most of this information can be gathered from different databases. I'll mostly, this is sort of actually two types of information here. One is sequence related information, chromosomal position and individual information about genes and then there's also interactions with other genes and the interactions we'll be talking about mostly tomorrow in the network session. But there's a number of, luckily people have built nice resources that you can just go to and get all of this information in one place, so one stop shop. The three sort of main ones that most people would be interested in are Ensemble which is a website from the Sanger Center in the UK that involved in sequencing a lot of genomes including part of the human genome. And Ensemble.org, you can type in a gene and get a huge amount of information about that gene, the sequence. This used to be mostly eukaryotes but actually it's recently been updated so they have every bacteria species, every bacterial and plant genome as well. So there's now Ensemble plants and Ensemble bacteria. Most people are familiar likely with, how many people have used Ensemble or use Ensemble? So this in a lab we can try and play around with these tools, QuickGo and Ensemble and I'll take you through some of those and those are really good websites to get familiar with because there's a huge amount of information there. Entrez gene, how many people use Entrez gene? So quite a few. So Entrez gene is a website from the NCBI where that hosts PubMed which probably everybody's used and it is also contains a lot of information about genes including protein-protein interactions but mostly it's linking to other sites but it's fairly general, almost every gene that has been sequenced and put into NCBI is in Entrez gene and then if you're working on a model organism chances are there'll be a specific database. If you're working on a very established model organism chances are there'll be a specific database organized by the community. Like if you're working on yeast there's the Saccharomyces genome database or a mouse. There's the mouse genome database. Similarly for rat and Arabidopsis if you're working in these areas you've probably heard about these databases and those are typically the best databases to go to if you're working in that area. And worm-based which Lincoln runs. So if you don't find the information that you, during the lab basically it's giving people, the lab will be giving people a chance to kind of get familiar with some of these things and if you don't find, if you're working in an area where that doesn't have good coverage and there's some of the information in these databases then the instructors will help you find some resources that might be interesting because there are quite a lot of others. So one really useful tool that we're gonna go through in the lab as well that helps you get information about gene-less is called Biomart and it's actually developed by people in this building at the OICR upstairs but it was originally developed in the UK. And Biomart is a general system for getting information about databases and biological databases and kind of like the ideas that you go into Walmart and Walmart has everything, Biomart has everything about genes, about biology and in particular the Biomart that is organized, that is accessible on top of the ensemble database is particularly useful because there's a huge amount of information on the ensemble. So how many people have used Biomart? A couple, a few, okay. So this is really, really fantastic because you just select a genome of interest. Hopefully the genome of interest is in ensemble but there's quite a few. So here I selected homo sapiens from the latest version of ensemble genes and then you can select the, it starts off sort of, if you just select ensemble genes which is all genes in ensemble and then you select your organism, homo sapiens, you're sort of progressively telling a Biomart what you're interested in and if you just select a genome or an organism, Biomart thinks okay, you're interested in human, there's 30,000 gene records related to human or 20,000 and helped me filter this further. So there's a number of different ways you can filter. You can ask for all the genes in a region of the chromosome or a set of genes that you already know like your gene list, you can upload to Biomart and if you have specific identifiers that Biomart recognizes, you can just say okay, I have 100 genes, my gene list, tell me more about those 100 genes which is what we'll be doing in the lab and you can ask for genes based on gene ontology term so if you're interested in apoptosis, you type in apoptosis, it'll give you back all the genes that are associated with apoptosis term, gene ontology term. Expression, multi-species comparisons, protein domains, so this says limit genes to transmembrane domains here or things that have signal domains or maybe they have SH2 domains if you're interested in those and variations, give me genes that have specific types of population variation. So a lot of different ways you can slice and dice ensemble using this tool and as I mentioned, we'll go over using this gene version to upload your gene list and get more information about it. So you select, those are all called filters, you select the filters that you're interested in and then once you've selected your filters, they show up here, I'll go through this in the lab with a live demo and then you can select what you actually want to download. Oh, sorry. What is, what do they mean by expression? I have to look in expression as probably things that are expressed in a specific tissue or that are expressed in specific, we can. Different tissue, specific tissues. So you're a biomarch developer? Okay, so we have a biomarched, first of all it's actually developing biomarch. So the, I think we asked you a question last year when we were. Yeah, you came to us a few months ago. Yeah. And he commented that it was that amazing. Yeah. So, the, once you select a set of, once you tell biomarch what types of genes you're interested in then you can select things to download and there's a few different things to download. You can download information about genes like all the gene ontology terms, you can get links to external systems, download expression, these are just in terms of this little features category that I highlighted here. We can go through the website and see what other things are available. You can download all the protein sequences for your gene and there's actually quite a lot of information or all the structures or all the snips or all homologs. Yeah. How do the ensemble build within the gene? So when you think in the database you have to pick the number, right? So it's both 37. How does the ensemble build compared to the NCBI build and is there a quick website that you could tell us where we could compare it to see where they're getting all our information from the same build within the gene or what the pieces are? So ensemble gets all of their DNA. Our ensemble is actually a big pipeline for automatically annotating genomes and they get their information from the NCBI originally. So that's this GRC CH37 version of the human genome. And then ensemble, I'm not sure, do they reassemble everything themselves or do they use the assembly? I think they use the assembly that NCBI gives them and then just run a pipeline on that. Typically, you might get different versions of ensemble with the same version of the genome and that's just because they've run the pipeline. They've updated their pipeline which will update new information. They're also curating information. They have different sections of ensemble that curate homology and gene models and they keep track of that. So as the databases behind the scenes are growing the next time ensemble runs their pipeline and releases it, you'll get more information and more coverage. So one build of the human genome can have multiple ensemble numbers. Yeah, so Tim, yeah? Yeah, so the GRC is actually, it's a consortium between the CSE but they all agree on a set of coordinates. So the coordinates are the same for the three web browser. They have all the same but each one will have their own pipeline to what they decorate their genome. How they annotate it. So you will find a variant between NCBI and PBI and CSE on which annotations they include, how they build their gene models. They all use RFC to detect the reference but then they have a bunch of other things that they have in the field. They have, it's slightly, when you pair different parts of the genome and you find that one's better than the other so they're all different in the way they decorate their genomes but the way the coordinates is exactly the same with all the new details to this one build, basically. So at 30 seconds, it's in one build and then from that you'll have various versions which will be better working with that patient. But it's an international read-on coordinate system that is the same for everybody. And that'll get updated periodically. I don't know what the frequency of update of that is. So not spaced, so every couple of years. Every couple of years. The CSE and CDI, they'll get different sets of genes. There's a consortium to validate common set of genes. So it's a set of, well the CCDS project with about 18,300 genes that they all pre-agreed on. Or they, and then they disagree and then you're ready to work that out. So that's for human and if you're working in a model organism typically Ensembl is actually getting their builds from the model organism databases who are usually the keeper of the model organism genome. So they're usually more, it's only, because there's only one source, they're usually more standardized. Whereas for human there's actually a number of different people interested in assembling and annotating the human genome. So we mentioned Ensembl, we should probably also mention a link to the UCSC genome browser. UCSC University of California, Santa Cruz has a really popular genome browser that is used by a lot of people. And the only reason why we don't cover it here is because this biomarker is on top of Ensembl right now and I don't know if there's a biomarker on top of UCSC genome browser, I haven't seen one. But this is really easy to kind of get data from. But you might as Lincoln said find that the different genome browsers that might be out there might have slightly different annotation for an individual gene. And that's sort of generally, I guess there's a general comment to be made about genome annotation. Once the genome is sequenced it's, the genome annotation as you probably know has, it changes over time and it eventually sort of stabilizes but there's always questions about does the gene start here or here or is this an exon or not. And the gene model where the introns and where the start site is, where the introns and exons start and end is changes over time and sometimes new DNA sequence information is available in the, is coming into the system and the build system changes the way that the coordinates a little bit. And so that updates all the genes for instance. So you might find that you go back and look at this system if you give it a gene list today and then next year you give the same gene list there might be different coordinates for instance of those genes. Hopefully not changing chromosomes but it could happen. So okay, so just throughout this lecture series we try and have summary slides at the end of the section. The basic take home message is that there's a huge amount of attributes about genes that are useful in databases. Gene ontology provides gene function annotation and there's a lot of information about gene ontology provides gene function terms and there's a lot of annotation relating to those terms and there's a lot of information for ensemble and entree gene among other sources. Any questions about that before we move on? Move that. Okay, so I'm gonna move on to, what's our timing, Michelle? Yeah, yeah I see that, okay. So probably one of the biggest problems headaches that people have when dealing with gene list is gene identifiers. And so this next section I'm just gonna tell you give you mostly some tips about how to do things so that you will reduce that headache. So identifiers are ideally unique stable names or numbers of things that help track database records. So your social insurance number is an identifier but they're also the entree gene ID 41,232 is an identifier for a specific gene. And they're ideally unique and stable but there's problems with this uniqueness and stability in biological systems and biological databases. So gene and protein information is stored in many databases. And so the first thing to know and you probably know this is that genes can have many identifiers. So if you type in, if you go and look at a human gene and entree gene it has a whole bunch of aliases and other links to other databases where that gene is also stored. And so you might get from somebody, your collaborator A, a list of genes with identifier entree gene identifiers and the other person gives you a list of genes and they have uniprot identifiers and you wanna somehow connect them. Are they, the names look different but they might be the same genes. And there's also the other important thing to understand is there's different types of information in these databases. So even though we sort of sometimes think about gene, DNA, RNA and protein all at the same time they're actually different databases for those different types of things. So entree gene has only gene information it doesn't store the sequence of the protein. RefSeq protein stores the sequence of the protein and entree gene links to it. And entree gene will have its identifier relating to genes and refSeq protein will have its identifier relating to proteins and you do need to understand the type of information pointed to by that identifier in order to properly connect it. So the NCBI, U.S. National Center for Biotechnology Information part of the National Library of Medicine has entree gene, PubMed and a lot of other databases. This is just an example of all of the, this is something from their website that shows you all the way that all the databases link to each other. So it's actually very complicated and each of these little circles is a database they each have their own identifier and if you have a gene it could have an identifier in all of these databases. Just for your information, here's a bunch of common identifiers to give you a sense of what these things look like. So ensemble, identifiers sometimes look like this long thing here, entree genes or numbers. Unigene usually has the organism followed by a dot and a number and this is gene, RNA transcripts, protein and species specific ones from different organisms. Also annotations, domains and can have identifiers. It's not just genes and protein snips have identifiers. Experimental platforms have their own specific identifiers so if you're using aphymetrics gene chips you'll be familiar with these strange identifiers and so this list is just in your binder so to give you a reference to if you learn these things if you're working with a lot of identifiers you should be able to look at an identifier list and kind of guess where the information is coming from and I've highlighted the ones in red that I personally think are the recommended ones to use because they're the ones that are most often unique and stable. Okay, so there's so many identifiers and converting them between these things is actually ends up being a headache. How many people have had problems with this already? So a few people nodding their heads and they understand this problem. So one of the reasons why it's a headache is that there might be different versions of the database. We talked about Ensembl having different versions and the human genome has different builds. If you get a gene list that was published in a paper from 10 years ago and it's in the supplementary material maybe it's using identifiers from a database version from 10 years ago and those numbers have changed. Ideally they'd be stable but maybe they found a gene there that actually doesn't exist anymore and people realized it wasn't a gene and so that identifier has disappeared or the gene has moved around and so they've changed the change sequence so they've changed the identifier to make sure that you don't confuse two genes that have two proteins that have different sequences. So because of these issues and also when you go from one database to another one database might have links that are kind of out of date and so because of those issues it's impossible almost to just get a nice clean conversion sometimes. So I'll explain a little bit more about how to do that but there's basically four main ideas in what you want to do with gene identifiers. One is searching for your favorite gene name. So if you type in your gene name in a website and you find all the information about the gene name if the website doesn't recognize that gene that particular ID that you used you're out of luck. So you should know that gene might have other names and you could try those other names and maybe one of those names is recognized and you have all this great information about that gene. So obviously that's one use of understanding this. The other thing is linking to related resources. This is sort of related. If you have a website, if you're on Entree Gene you should understand all the different types of identifiers and where they link to so that you can get the information that you need. Identifier translation which is what we'll try in the lab. Basically moving from genes to proteins. A very common thing for people working with gene expression data is you want to, you have your gene expression data in Affymetrix IDs and you want to get information about them so you want to translate those two genes like Entree Gene IDs so that you can put them into other systems. And you also want to merge. If you're merging data sets from different collaborators you need to move them into a common identifier system so that you can do this. So luckily there's a number of identifier mapping systems, mapping services that help you do this. So the one that we've chosen to use for this workshop is called Synergizer because it's extremely simple but it uses information from Biomart and a couple of other proteins, a couple of other sources. There's also a new one called Picker, protein ID cross-reference system from the EBI which is only for proteins. If you're working with proteins this one looks like it's pretty good. And basically in Synergizer you choose your species, you choose your authority, in this case Ensembl and you say okay I have Affymetrix IDs, I want to convert them to Entree Gene IDs and you can do that and we'll go through this in the lab. So it's particularly important to just understand a couple of ID mapping challenges that happen to everybody just so that you can avoid errors. Just these are a couple of things that might crop up in your day-to-day work that if you just know that they're possible then you can make sure that you can try and avoid them. So one thing that people often do is they use a common name for a gene name like the protein name which might not be standardized. And so it's not unique and stable. And so these IDs are all relating to P53. Everyone who knows about cancer research uses the term P53 but the actual standard Hugo gene name is TP53 and so this is the symbol that you should use in your gene list, not these guys. If you use these guys there might be another protein that is called one of these things or another gene somewhere and so you'll get cross-links. And if you get these cross-links, if you have an ID that you think is one gene and it's really another gene, that could be a pretty big mistake. In fact, there was a nature paper of five years ago that was retracted for this reason because people looked at HES1 and there was another gene called HES1 that had a different capitalization and they thought they had this big new story but it was actually the wrong HES1 and their paper was retracted. So if you get to that stage where you're really going to town on a gene, you better know that's the right one, right? Not just have this identifier mapping problem. So another thing that people might have seen is that Excel, a lot of biologists use Excel spreadsheet software to work with their genes. If you copy and paste a really big gene list into Excel, Excel by default tries to be smart and recognizes dates. So if you have a gene called Oct4 which is a pretty important transcription factor, it'll change it to October 4th or worse, it could change it to some number or some weird date format. And so you might have seen this and it's like incredibly difficult to get Excel to not do that. So you can kind of do it by ensuring the format of the columns is text before typing in, sometimes by default all of the columns in all the cells in Excel are general and so it tries to guess what, if it's a number or text, but if you really force it to be text, then it usually won't do that, although you really have to check because sometimes it just goes back to general and you don't realize it. So, and this is, as I said, a problem when you're, it's not a problem when you're typing in a few genes, you can fix it, but if you're pasting in a thousand genes and like one at the bottom gets changed and you don't realize that, then you're stuck with that, then you might find out, only find out later. And the other thing is that there's usually problems if you have a thousand Affymetrix gene identifiers coming from your gene chip and you want to get information about all of them, all thousand identifiers, there's often a problem reaching 100% coverage and this is due to the version issues that I mentioned before. And so sometimes you can type, you can put your thousand Affymetrix gene IDs or probe set IDs into Synergizer and you press and you say give me all the gene ID, the on-trade gene IDs. And Synergizer will often return less gene IDs than the number of Affymetrix IDs because of the relationship that there's actually more than one Affymetrix probe set ID sometimes for a gene. So that will map to two Affymetrix IDs, will map to one gene, but Synergizer will tell you if it can't, if it doesn't recognize the Affymetrix ID and if it doesn't recognize the, or whatever ID that you put in, if it doesn't recognize it, you can go to another source and try and find out if that ID is recognized in another source. Maybe the ID is out of date or it's not yet, or it's too new. It's not yet an ensemble. And so typically what I do is I use one of these systems like Synergizer and gets me 90 or 95% of the way there and then I manually take the remaining things that weren't mapped and I go to the other websites and try and get, try and increase my coverage. You might not care that much about increasing coverage, but these recommendations are kind of mentioned here. So, okay, so recommendations for doing this and we can do this in the lab. If you're working with gene lists, I recommend mapping almost everything to entrate gene IDs using a spreadsheet. And then, and if you want 100% coverage, manually create the missing mappings. Be careful of these Excel auto conversions. I forgot to mention that there's actually a paper here. I think we gave it a pre-reading that actually talks about problems with, all the different problems with mistaken identifiers and name errors using Excel. And the only issue, the only caveat with using gene IDs for everything is that they won't work with splice variants, it doesn't consider splice forms. So if splice forms are important to you, then you need to use an ID system that differentiates splice forms. And many people these days are, and many tools, especially the tools that we are gonna be talking about, kind of just recognize genes. So even if you have splice variants, there probably is not specific gene ontology annotation about the splice variants or things like that. So splice variants are very important, but they haven't been annotated to as high level, a high level of detail as genes have. So typically we kind of move to genes and then use that in the future for the downstream analysis. Okay, so genes and their products have many identifiers. Usually you need to deal with, if you're working with genomics information, you need to learn how to convert these from one type to another, but their service is available to do this. And if you use these tips, then you can reduce the headaches of problems, yeah? That's one of the big different things that NCBI is not able to serve in. Any questions? So that's it for sort of the intro morning session. Hopefully that was interesting to people who haven't seen some of that stuff before. We now have a break and what time do we come back and attend to 11? So we have 20 minute break and then after the break, we're gonna open up our laptops and try out some of these sites. What I'll do for the lab is I'll show you guys, I'll kind of go through some of the websites that I showed and with specific examples, just to show you how it works and you can all watch me do that. And then the rest of the lab is basically trying out some of these things. There's a specific lab assignment that is very simple, but basically the idea is just get you to try out these websites and provide time to answer all your questions to make sure everybody's on the same page and get some time to get familiar with these and learn these tools, okay? So also the idea of the breaks as well as Francis and Michelle mentioned before is that meet other people in the class, other researchers who might be interested in similar things. This is sort of the networking idea of this, that networking is an interesting side effect of these courses that's pretty cool.