 This lecture is part of a Canadian Bioinformatics Workshop and is made available under the Creative Commons. Welcome to Module 1, Introduction to Pathway and Network Analysis of GeneLists, as part of the Pathway and Network Analysis of OMEX Data Workshop. In Module 1, I'm going to provide you a brief introduction to Pathway and Network Analysis and some background information about gene lists and pathways that are going to be required for the rest of the workshop. So the basis of this workshop is helping people interpret gene lists. And the general idea is that you perform some kind of experiment where you get thousands of genes that result and then you want to know how to interpret those. Typically, any kind of genome scale analysis or OMEX, such as genomics, proteomics, RNA-seq, transcriptomics, etc., produces lots of information like this. And one of the main ways that people interpret this data is by trying to understand some mechanistic story that pulls these things together. So if we want to know what's interesting about the thousands of genes that were resulted from our transcriptomics experiment, we might ask if they are enriched in known pathways, complexes, or functions. So this picture represents a transcriptomics experiment, and once you have collected the raw data, you might rank or cluster it to generate a gene list. And then in this course, we want to compare that gene list to prior knowledge about cellular processes using various analysis tools and ideally find some new interesting discovery. Pathway Network Analysis saves time compared to the traditional approach if you had hundreds or thousands of genes and you wanted to interpret them manually. You'd have to look each one up, understand its function, and understand the relationship of the functions of all the genes on your list to try and figure out if there are particular mechanisms that were in common and that might be interesting to highlight. And so this approach traditionally is very time-consuming. Pathway Network Analysis helps gain mechanistic insight into omics data in a more automated way. For instance, we could identify a master regulator, drug targets, or characterizing pathways that are active in a sample. And any type of analysis that involves pathway or network information is a type of pathway or network analysis. Pathway Network Analysis is most commonly applied to help interpret lists of genes, and the most popular type is pathway enrichment analysis, but there are many others that are useful and will be covering pathway enrichment analysis and a number of others as part of this workshop. The benefits of pathway analysis compared to analyzing information at the transcript or protein level one by one is that considering the data in the context of pathways usually makes it easier to interpret because we're working with familiar concepts like the cell cycle. We identify possible causal mechanisms, which is useful, especially if we're interested in therapeutically targeting those mechanisms. We can use pathway and network analysis to predict new roles for genes that we don't know the function of. And pathway and network analysis helps improve statistical power, which will cover more in the workshop, and it might also lead to more reproducible results. For instance, if we do multiple experiments on multiple samples, we might get different gene expression profiles or signatures for each sample or condition, but they might all affect the same pathways, and so looking at the level of genes, we might not see exact replication of the signatures across conditions, but looking at the level of pathways, we might see replication of pathways across the conditions. And pathway and network analysis also facilitates integration of multiple data types in a multi-omics approach. I mentioned pathway and network analysis. What are pathways and how are they different from networks? Pathways are usually detailed, high-confidence models of a biological process, usually formed with steps. We might have biochemical reactions as part of the pathway. Usually the models are developed based on many years of study and many publications. So for instance, the EGF receptor pathway or metabolism are examples. Networks, on the other hand, are sets of relationships between genes of different types. For instance, we could have genes activating each other or binding each other, and they represent a more simplified view of cellular logic, frequently they're noisier, and they often come from large-scale genome-wide assays, like co-expression relationships for transcriptomics or large-scale protein interaction screens. There are many types of pathway and network analysis, for instance enrichment of fixed gene sets, so this could answer the question of what biological processes are altered in my sample. DeNovo subnetwork construction and clustering usually answers questions like what pathways are altered in this sample, and it's not limited to known pathways that could also identify new pathways. It can also be used to, for instance, identify clinically relevant tumor subtypes. And pathway-based modeling is more advanced and can help identify if there are particular molecules that are important or critical in a pathway given the data that you've collected. The general pathway analysis workflow is summarized here. So first we collect some genomics data, for instance gene expression data. This workshop uses gene expression data extensively, although the concepts that we'll cover are applicable to other types of genomics data. But transcriptomics data or gene expression data is a very good example type of data to cover for this workshop, because it's popular and it's highly informative frequently. So once we've collected our genomics data, we normalize and score it. For instance we can compute differential expression between two conditions, and this generates some kind of gene list. And then we want to learn about the underlying mechanism that this gene list might be telling us about using pathway and network analysis. And to break this box out a little bit more, pathway and network analysis usually starts by running analyses and visualization methods that help us identify interesting pathways and networks. And once we've found one that's interesting, we can focus in on it to better understand molecular mechanism and eventually develop a publication quality model. So this workflow, the simplified workflow that I just showed you, is actually more complicated. We can use that workflow to analyze many different types of data, shown in blue at the top. This middle level shows us how we might process the data. Ultimately all of these processing steps result in a gene list that we can interpret using pathway analysis or network analysis. And here's the mechanistic drill down box. So we'll be going over these in more detail in the workshop and some more detail in the coming slides. So where did gene lists come from? There are many types of genomics data types and there are many ways to get gene lists from these data types and the gene lists have different meanings. For instance molecular profiling like measuring all the mRNA transcripts or protein levels in a sample can help us identify molecules that are important for our sample and this generates a gene list. We might also get some quantification of these, which generates a gene list plus the quantification values. And we can also rank, if we have many samples, we can rank and cluster them using biostatistical methods that generate a ranking or a clustering and these can also generate gene lists like the set of genes that are associated with a cluster. Another type of omics data is interactomics. For instance we can measure protein interactions, microRNA targets, transcription factor binding sites and frequently these interaction measuring methods provide lists of genes that are targeted by a protein or microRNA or a transcription factor. A genetic screen like a CRISPR screen can also provide a list of genes and genome-wide association studies can identify genes that are linked to SNPs that are associated with a phenotype of interest. These methods generates a gene list but the gene lists frequently mean different things based on the methods and you can think of other examples that might be relevant to your own work. So as I mentioned, gene lists have a meaning and you have to understand that meaning to understand how to interpret them. So frequently a gene list might relate to a biological system where we then want to understand protein complexes, pathways or physical interactions that are part of that biological system. A particular screen might identify a list of genes that's related to a gene function or a molecular function like protein kinases. It could identify a list of genes that's related to a similar cell or tissue location or a chromosomal location. So clearly these lists of genes have very different meanings. Again we need to understand that meaning to understand how to interpret the gene list. Okay, so the first part of this larger, more detailed workflow is this blue layer here which covers many different types of omics data that can be converted into gene lists. We talked about gene expression data and protein expression data, those are very similar. DNA methylation is similar but we might need to score methylation of gene promoters to get a gene list or link methylation groups to genes. Similarly microRNAs, if we measure their expression we have to link them to other genes using known or predicted targets. Protein binding to DNA or RNA, for instance with chromatidamine or precipitation, again we have to identify target genes because we might identify a region of the genome that is bound by a protein of interest but that might not be a gene. It might be near a gene or it could be quite far from a gene so additional work is required to identify target genes. Mapping protein interactions directly identifies a set of genes that interacts with the bait protein for instance and looking at mutations from whole genome sequencing or exome sequencing identifies a set of variants which can also be linked to genes through various different means. So before analysis, when we have data from any one of these methods we need to be aware that we need to normalize and do a proper background adjustment and proper quality control to avoid giving too much noise to the pathway analysis methods. Any kind of statistics that will increase signal and reduce noise is important. We need to consider how many genes result from this. We don't want to generate a single list filled with the entire genome. It needs to be more specific and we also need to make sure that the gene identifiers are compatible with the software that we're using and we'll talk about that more later. In general these days many genomics data types are handled by core facilities and those facilities typically have standard workflows that they apply to the data and they give you the result which can then be directly input into pathway network analyses. Sometimes people run their own analyses in their lab frequently and their work established workflows are available for these too and occasionally there's new genomics data types and the workflows might not be established and then you need to know something about how those workflows work or what the latest is in that area to figure out how to apply those workflows to your data. But again the general way of working with this data especially for something like gene expression data typically those are measured by a core facility and you'll just get the result as a gene list. So these types of established methods process the raw data and convert it into a gene list using various different strategies. Once you have a gene list hopefully what was part of the experimental design is asking the question of what you want to accomplish with this gene list. For instance you might want to summarize the processes or other aspects of gene function. You might want to compare two different samples or two different types of samples. You might want to identify a controller for a process like a transcription factor that you could then over express or knock out to test its causal nature. You could find new pathways or new pathway members and discover new aspects of gene function. You can correlate a disease or a phenotype with information from your gene list or you can find a drug that targets genes in your list. Okay so the green boxes here relate to pathway and network analysis. Pathway is on the left and networks on the right. We'll cover more detail about those in the workshop but I just wanted to introduce one very basic type of pathway analysis called pathway enrichment analysis and then I will tell you more about some background information that's important for just understanding how these types of analyses work. So a standard pathway enrichment analysis considers a list of genes from your experiment. For instance all the genes that are down-regulated in a brain cancer cell line and that's represented by this blue circle. And then we compare this blue circle, the genes in this blue circle with known pathways. So we might have a list of genes that are associated with a pathway or a neurotransmitter signaling. This might have 100 genes in it and we ask the question, what's the overlap of the neurotransmitter genes with my list? And you calculate some overlap and then given the information about all the genes in your list and the size of this overlap you can compute a p-value using a standard statistical test like a chi-square test or Fisher's exact test and you're wondering if there are more neurotransmitter signaling genes in this case in my list than I would expect by chance. Given the size of the neurotransmitter signaling pathway and the size of the genome. So if the neurotransmitter signaling pathway is only 1% of the genome and I see 5% of my genes in my gene list are part of the neurotransmitter signaling pathway, that's more than I expect. And I can get a calculate a p-value for that. So we do this with many pathways. So we test neurotransmitter signaling and then hundreds or thousands of other pathways. Once we're done we need to do some multiple testing correction because we want to avoid false discovery from just trying many pathways. And in the end we might develop some kind of hypothesis that drug-sensitive sensitivity and brain cancer is related to reduced neurotransmitter signaling. So pathway enrichment analysis as I've described it requires two things the gene list that you provide from your experiment and also a set of pathways which have to come from a database. And we compare these two and we find enriched pathways for instance using techniques such as GSEA and G Profiler. So to understand some background information about pathway enrichment analysis I'm going to tell you about gene identifiers to understand how gene lists work and I'll also talk about places where we get pathway information and other gene annotation like the genontology and other sources. So first I'll talk about the gene list. So a gene list is a list of genes and the important concept to understand is that a list of genes is not just a list of names but it's a list of unique identifiers that allows to understand unambiguously what genes we're just talking about. Identifiers or IDs are ideally unique stable names or numbers that help track database records. For instance the social insurance number or an entree gene ID. Gene and protein information is stored in many different databases and thus genes have many different identifiers. There are records for genes, DNA, RNA and protein so each gene might have corresponding records and a database of DNA sequences or RNA sequences and each of those records in those databases or entries in those databases would have a different identifier associated with it. So it's important to recognize the correct record type or entry type. For instance the entree gene database doesn't store information about sequence. They link to sequence in other databases where you might have DNA regions, RNA transcripts or protein sequences. For instance in the rough seek database which does store sequence. So it's important to know that as I mentioned that you need to use some standard identifier if you want to unambiguously share and have other systems like pathway analysis methods understand your genes. You can't just use colloquial names for genes as they just won't be recognized. Some common identifiers there are many are listed here. The red ones are ones that are frequently encountered. Entree gene that I bolded here is one that we recommend because it's quite stable. We also have species specific information for instance for human. There's a human genome naming commission and the human genome organization and it standardizes gene symbols which are different from gene names. So a gene symbol again is an identifier which means that it's globally unique and it should be unambiguous which means that there shouldn't be two genes with the same name. This is not the case for regular gene names that you might see in the literature. Similarly other organisms have other standard gene naming systems associated with them. So there are many different identifiers and sometimes we need to work with them to convert one to the other. You might have entree gene IDs that are useful for pathway analysis methods but you want to visualize you want to see the list as gene symbols which are more easy to read. There are software tools that recognize these identifiers although not every tool recognizes every identifier and you may need to map your gene list identifiers to some standard identifiers or between two different sets as I've mentioned. So there's different reasons to do this identifier mapping. The main one is converting database name numbers like integers that are associated with entree gene ID to human readable names or to convert between these when you need to use and put your gene list into a tool that only recognizes some of these. One example identifier mapping service which is quite useful is to convert this URL so you can input some genes or official names or identifiers and you can choose the type of output identifier from a very long list and when you click run it will say that these genes are also called this name or this identifier in a given database. I selected entree gene I selected ensemble genes as my target database here. So G convert is a useful tool for mapping identifiers if you need it. You have to be aware of the US are missing identifier mappings so here I entered oct4, ddt, and h3, f3a so oct4 is not recognized this may not be an official gene symbol even though it's a well known gene name and actually that's correct it is not an official gene symbol so if you use this it will not be recognized by tools that only recognize official gene symbols. The ddt gene is mapped to two different genes on the genome and if you look at the description you can see that one is the real gene that we might think of that's well known. Another gene is a novel protein that is some kind of read through through this open reading frame and so it probably is is duplicated because of uncertain genome annotation in that region of the genome. Here's another example h3, f3a which is a histone a type of histone and this one gene corresponds to two different proteins and or two different gene names this one gene symbol corresponds to two different names and two different genes and it's very closely related histones so sometimes you might see this and these might be linked because perhaps they have identical protein sequences or they might have identical gene sequences but have slightly different promoters for instance. So when you're working with identifiers like long lists of genes you have to be aware that there are some challenges and you need to be very careful to avoid errors so you always have to make sure that you're working with correct identifiers and that you map these correctly because if you make a mistake then instead of thinking you'll have one gene you'll actually have another gene in your list and that will enter into your analyses and then be wrong. So I mentioned gene name ambiguity if you have a so all of these names are names of the official gene symbol TP53 you should never use any of these not good identifiers and some of these terms might point to more than one gene and you can make mistakes in your gene list if you use them. There's also errors that are frequently introduced by spreadsheets like Excel so Oct4 example that I mentioned if you have that in your list is changed frequently to October 4th you've probably seen this if you've worked with Excel a lot and gene names there's a lot of different gene names that are converted to dates and the way to avoid this is to paste this text or to select the format of the cells that you paste to as text not general. So you might have problems reaching 100% because of some of these challenges so you might also have different databases database versions that might not recognize 100% of the genes that you have in your list and so you could if you want to reach 100% coverage do additional work using multiple sources of gene identifiers to increase coverage and you might have to do some of that work manually. Gene name errors are widespread in the scientific literature there's a paper about that, an interesting paper and there's also an interesting paper trying to quantify the problem with mistaken identifiers and how gene name errors can be introduced when using Excel in bioinformatics. Just to give you a bit of a warning of the importance of this here's an example that was quite unfortunate where a paper was published in Nature in 2003 that proposed that HES1 is a target of a given microRNA and what they found was that the gene termed human homologue of HES1 HES1 is not the same as the transcriptional repressor Harriet Hanser of Split which is also called HES1. So when they did all of their searching for HES1 they didn't realize that there's two genes both called HES1 and they made a mistake and actually it resulted in completely wrong information in their paper and a retraction. So general recommendations with working with gene lists for proteins and genes is to map everything to entree gene IDs or official gene symbols using a spreadsheet to make sure it's all working correctly. If you want to reach 100% coverage you might need to manually curate missing mappings using multiple resources and be very careful of Excel auto conversions especially when you're pasting a large gene list because when you paste a large gene list you don't see all of the conversions that are happening off screen so it's important to make sure that you are careful to remember to format cells as text before pasting or past as text. This assumes that you're working with proteins and genes and it doesn't and it recommends to map everything to genes and the reason for this is that the era of genomics and all of the pathway analysis methods that we will cover in this course and pretty much all of them that exist don't consider splice forms so they don't differentiate between two different forms of a gene at the protein level they're obviously different and they could have totally different functions and the gene level we don't know that because we just refer to them as a given gene and unfortunately currently most tools just work at the gene level and so it's important to understand that protein specific or isoform specific information is not captured by the majority of tools. So what have we learned in this section? Genes and products have attributes and attributes have many identifiers. Genomics often require some understanding of these identifiers and potentially conversion from one type to another. There are useful identifier mapping services available and whenever you can use standard commonly used identifiers to reduce ID mapping challenges in your workflow. So now coming back to pathway enrichment analysis and requirements we talked about gene lists. Now we're going to talk about pathways. Pathways and other gene function attributes are used in pathway enrichment analysis. There are lots of information available on databases. A good source of pathway information is gene ontology, biological processes terms. Also pathway databases like Reactome which we'll learn about in a workshop. But there are many other types of annotations associated with genes. For instance molecular function and cellular location chromosome position, disease association transcription factors that bind a set of genes and many others. So these all represent ways of annotating genes and you can all represent many of these as sets of genes. Like a set of genes with a given molecular function, like a protein kinase. So this information in this workshop we're going to be mostly focused on pathways but there is a lot of additional information about genes that we can use similarly to pathways in pathway enrichment analysis methods. Okay so talking about pathways first I'm going to talk about the gene ontology. The gene ontology or GO is a set of biological phrases or terms which are applied to genes like protein kinase, apoptosis or membrane. This gene ontology is also a dictionary so each term has a definition associated with it and it's actually quite useful for teaching because tens of thousands of biological terms have nice definitions in this ontology. I've mentioned the word ontology, gene ontology and ontology is a formal system for describing knowledge so usually ontologies have a couple of different features. So one feature is that terms are frequently related to each other which I'll explain in a bit. You also have a standard way of naming terms and this is more information about gene ontology is available at the project homepage. So gene ontology is structured in a hierarchy the terms are these boxes here and the relationships between terms vary but two examples are ISA and part of relationships so we might see for instance B cell apoptosis is a type of apoptosis which is a type of program cell death which is a type of cell death this is a type of physiological process etc all the way up to a very general part of the hierarchy. B cell apoptosis is also part of B cell homeostasis which is a type of immune cell homeostasis. So you get the idea of how these different relationships here have different meanings and the advantage of organizing all of these terms in this hierarchy is that it describes multiple levels of detail about gene function. Terms can have more than one parent or child and that sometimes complicates gene ontology analysis but it's just important to understand that. So gene ontology covers three different aspects of gene function cellular component branch or aspect of the ontology covers where genes are located in the cell the molecular function branch or aspect covers molecular functions like enzyme functions and the biological process aspect covers pathways like cell division or more general path more general processes. So gene ontology is actually composed of two parts the first part are terms which I talked about just now Go terms are added manually by editors at various different annotation groups worldwide they're added by request and experts help with major redevelopment the gene ontology terms there are many of them there are over 44,000 terms currently in the ontology this grows over time so it is evolving although over the past few years it's remained relatively stable they've covered a lot of the known concepts in biology the second part of gene ontology are annotations so this is a particularly interesting part of gene ontology that we use for pathway enrichment analysis this links terms to genes so annotations take a term and they connect it to a gene using an association usually trained curators of genome databases do this they're known as these links between genes and terms in the ontology are known as gene associations or Go annotations multiple annotations are possible per gene and frequently genes have many different terms associated with them some gene ontology annotations created automatically without human review and this tends to be lesser quality and so it's important to understand that there's two types of annotation which I'll explain more in a bit so because of this hierarchy and because multiple terms can be associated to a gene a given gene once it's associated to a term automatically associates that gene to all of the other terms that are the parent of that term that we associate to and so that's one way of generating a lot of additional terms that are associated with a gene as I mentioned there are two major sources of annotation manual which are curated by scientists and these are very high quality although their time consuming to create so there are fewer of them relatively another type of manual annotation are computational analysis methods which generate a lot of annotation automatically but then the results are reviewed by people and filtered to just keep the ones that make sense so electronic annotation is generally annotation derived without human validation frequently computational prediction the accuracy here varies some computational methods are extremely accurate like computational methods that predict transmembrane regions and protein sequences are more than 95% accurate reaching 99% frequently but others are lower quality so in general this is frequently treated as lower quality than the manual evidence sources so a key point is to be aware of the annotation origin we can understand the annotation origin because each annotation when a term is linked to a gene is associated with an evidence code there are lots of different evidence codes like the ones listed here all of the ones in red are manual or human curated or reviewed and the one that's not is called IEA or inferred from electronic annotation all major eukaryotic model organism species and human are covered by gene ontology annotations many bacterial and parasite species are covered and new species annotations are always in development there's a current list of official go annotations that can be downloaded from the gene ontology website and you can always map gene ontology terms by from a close species to a new species that's not annotated if you were working with a genome that's not available does not have standard gene ontology annotations available so just to spend more time on understanding the relationship of gene ontology to these different species so one point to be aware of is that well gene ontology annotations are available for many species not species are covered the same at the same level so some species have many more annotations than others so for instance human has the most annotations orange here is non-experimental and blue is experimental so there's over 200,000 terms that are associated with genes that come from primary experimental sources and if we look at mouse there's quite a lot but chicken doesn't have that many it's mostly non-experimental which might be computational sources and you can see how some species have far fewer annotations also might have smaller genomes but you can see the variability and the difference between experimental and non-experimental annotations as well as the number of annotations there are many contributing databases just to list a few most small organism databases and major genome databases and protein databases like Unibrow to contribute a lot to the gene ontology project gene ontology has too many terms for some uses so people don't do this very much anymore but imagine you wanted to create a pie chart of gene functions to summarize at a very broad level your genes you might only want to show a few high level gene ontology terms and so Go Slim is developed as an official reduced set of Go terms that are available in generic form or for plants or yeast and you can map your genes you can map your Go terms to Slim terms if you want a very compressed small set of generic terms there are many software tools available that allow you to browse the gene ontology browse gene associations and use gene ontology in different types of analyses the website that I recommend if you're interested in browsing the gene ontology and its annotations is called Quick Go from the European Bioinformatics Institute it's a very nice website and it gives you lots of information about gene ontology gene ontology just to so you're aware is one of many types of ontologies anyone can create an ontology themselves by standardizing terms in their relationships and I just wanted to give you a sense that there's more than one type of ontology usually in the process we only come across the gene ontology but occasionally we come across other ontologies like a tissue type ontology or cell type ontology okay so that covers gene ontology I next want to briefly cover pathway databases so pathway databases store information about biological pathways and more detail than gene ontology there are hundreds of pathway databases that exist there's also databases that are derived from pathways and large meta databases like pathwaycommons.org which collects major databases and unifies them into one single portal we'll go over pathway databases in more detail and cover databases like Reactome in the workshop pathways you can derive pathway information from gene ontology biological processes pathway databases like Reactome that I mentioned and I mentioned Pathguide as a way for you to go and browse pathway databases that exist there's many other types of annotations that I mentioned and you can get these usually from genome browsers so one that I like is Ensemble Ensemble is a genome browser that pulls information from many different sources and one that's great is Entrez Gene from the US NIH National Center for Biotechnology information if you're working with a model organism those databases usually are the best sources of gene attributes for those genomes and there are many others that we could discuss during lab time in the workshop just as an example of how to use Ensemble Biomart so Ensemble is a genome database that stores lots of information about genes Biomart is a query tool that Ensemble makes available to allow convenient access to gene list annotation first you select your genome then you select some filters and select attributes to download and then you can download that information I'm going to give you a demo of Ensemble Biomart because I find that it's not very intuitive the first time you visit it but just a few pointers can get you on the right track so I can access Ensemble Biomart from the Ensemble homepage at the top here I'm going to click on Biomart sometimes Biomart takes a while you have to wait some number of half a minute sometimes for it to do its job so the first thing we do is choose a database because I'm interested in genes I'm going to choose Ensemble Genes database version 100 then I'm going to choose a data set I'm interested in human so I'm going to choose human genes once I choose that a set of filters loads here and I can click on filters and have to wait for it to finish loading I can click on filters and I can ask for all the genes that are in a given region like all the genes on chromosome 1 if I want to click that if I want to know that my query is recognized and properly created I can click the count button here and it will count up the genes that match my query so in this case there are 5,475 genes on chromosome 1 I'm going to uncheck that and I'm going to ask for a specific set of genes so I'm going to limit my genes to ones with gene symbols that are a couple that I'm going to add here again just to check that all of my genes are recognized I've entered 2 genes I'm going to click count and the count should be 2 out of all the genes that are in the database and that is correct so now I've verified that Biomart recognizes the input that I've added in the filter section I can then click on the attributes section there's lots of different attributes I can download so I can download features of genes like different types of identifiers I can download information about gene structures like the exon start and end positions I can download sequence information like the peptide region or the 5.UTR region and I'm going to go back to features here and download and select to download the gene stable identifier but I'm not going to download these other things and I'm going to scroll down to external identifiers and I'm going to ask for the go term name and the evidence code and also the gene symbol when I'm finished selecting features I can click results and the results will show me an example result and you can use this to make sure that the information that you've requested is there in the right columns so here I'm getting a lot of different gene ontology terms for TP53 and then I can download everything to a file in different formats I can choose to unique results because sometimes these results are duplicated and when I'm finished I press go to download all the information and then I can load it up in a spreadsheet and work with it so let's get back to our presentation that's Ensembl Biomart so what have we learned pathways and other gene attributes and databases exist and there are many different types we can get pathway information from gene ontology and pathway databases gene ontology is a classification system and dictionary for biological concepts there are many types of annotations contributed by different groups we can allow more than one annotation term per gene some genomes are annotated better than others annotation comes from manual and electronic sources and gene ontology can be simplified for certain uses using ghost slim and many other gene attributes are available from genome databases like Ensembl so coming back to the pathway analysis workflow just to remind everyone that we've been in this background lecture we've been talking about the identifiers that we're using at these steps here where we collect genomics data the results will use some kind of gene identifier to label genes we'll probably keep those during the normalization and scoring approach and we'll probably keep those making to create a gene list and then we might clean our gene list to make sure that all the gene identifiers are correct in preparation for analyzing them with a more detailed workflow so that's it for the lab for this lecture this lab is made available if you want to practice working with gene identifiers and attributes learning about G Profiler and Biomart you can use the Integrated Assignment Gene List or your own gene list to play with the G Convert System and G Profiler or use Ensembl Biomart I highly recommend just familiarizing yourself with those tools because they're very useful I use them regularly a few times a week that is the end of our lecture the main point again of this lecture is to bring everyone up to the same speed on some very basic concepts that we that everyone will need for the main workshop thanks for your attention