 Okay, so we're still talking about databases a little bit generically, and then I'm just talking a bit about a few sort of cancer-specific ones. So Uniprot KB used to be PIR and Swiss Brot, and so forth, all sort of merged together So it's sort of the European databases of protein sequences, which is sort of the facto sort of the standard protein databases are exchanged worldwide. They are, they're actually U.S. funded mostly, not entirely but mostly, and are include what used to be referred to as Swiss Brot, and Tremble, which was a translation of EMBL, and PIR, which was a protein information resource, which was the first bioinformatics database, actually of protein sequences that dates back to the 60s actually. And the importance of protein databases in general could not be overstated in the sense that's where the business end of things are. This is obviously proteins that are things that do all the things that happen in the cell, and all the analysis we do at the genomic level are really in light of what impact they have on gene expression, on proteins being expressed on gene regulation and so forth. And so although we're sequent with next-gen sequencing, we're looking at DNA and RNA sequences, we're looking at seeing what's happening at the protein end. At the protein database level, that's where we'll have a lot of annotations about the function and what these proteins do. There's a lot of projection, a lot of similar proteins. A lot of people speculate that similar proteins have the same function. That is true often, but not always, so you have to sort of take that with a grain of salt. And what's happening more and more, which is really good, is that there's evidence code associated with annotations. And so we now know what the evidence is for a particular declaration. So if we think that this enzyme is such and such, it will be stated because we think it's this because it's similar to another one where it's been proven biochemically that it does this. And so if it's through similarity or identity and so forth, that's sort of through the evidence codes. So the thing you'll find in the Uniprot record is that there will be a lot of features and sometimes some of it is computed, some of it is real. So in this case, is it just me or is it out of focus? Yeah. Yeah. Okay. Well, I can't do anything about that. I'm not going to start pussing around. No. Well, no, I think it's okay. Well, maybe. Anyways. So SwissProt is actually still exist and still referred to and so forth, but it's actually part of Uniprot KB. And the claim to fame of SwissProt is that it was better curated, more carefully annotated records and so forth. With us, with respect to the human genome and with cancer genomics in general, so it's really important to know where we are. And so with respect to where genes are, where proteins are encoded, where transcripts are, there's one reference genome. There's multiple browsers that you probably are familiar with, Ensemble, UCSC and CBI genome browser. And these, this one at UCSC is probably the most used, Ensemble, second and NCBI genome browser third. The one thing that fortunately they all share is a coordinate system. So the coordinate system of the nucleotide sequences is exactly the same between these three if they're referring to the same. They have different version numbers, but there's a one-to-one match between the version number. So GRC837, which is the genome consortium coordinate system for humans is HG19 and 36 is 18 and so forth. And so those are, it used to be that actually the three browsers had different coordinate system and that was a big headache. Now they have all the same coordinate system, so that's very, very good. But so what's the difference between them besides the user interface? Like what? Transcript data. So basically each one will decide on their own and dependent of the other one what they're going to decorate their genome with, what they're going to annotate, so which gene set they're going to use to put on the genome, which alternative transcripts they're going to put on the genome. So you may go look at one genome, you'll find a transcript there, you'll look at the same genome and the other browser in the same position and you won't see anything. That's kind of rare. What's more common is that for one you'll look and you'll see three alternative splice variants and you'll look at the other one and you'll see five or 10. That's much more common. So you'll have a lot more, sometimes you'll see the same thing, which is good, but often you'll see differences. So UCSC will incorporate, and also we'll both incorporate a lot more external data set, like chip seed data sets and computed things, like similarity between different organisms. UCSC is famous for its organisms or tracks and so forth to find for similar organisms. And so what's important to keep in mind is that where you are in the genome and which version you are at. So as you look at the dates, so HD18 was March 06, HD19 is February 09, when's going to be HD20? What do you think? It says right there, yes, in the yellow box. So this summer probably, or sometimes later this year, will have HD20. So a lot of people are freaking out over this because it's going to be, a lot of tools are dependent on a version, a lot of, every one of the labs you're going to be doing this week is going to be either based on 18 or 19. And 19 is like, sorry, 18 is more than 12, you know, six, seven years old. And it's still being used, still being referred to in papers and so forth. At the ICGC, we make it a rule that everybody has to submit their data on HD19 coordinates. And that's a requirement. But there's TCGA, which was started before ICGC, they have some HD18 data. So there are tools to copy things over from one version to the other, but it's still, it's not perfect and there are challenges and so forth. But it's really, really critical that when you record your experiment, that, as I mentioned earlier, to know your reagents and the database you're referring to is to know which reference system you're using. And so if you're using 19, which I assume you'll be most of the time, you need to indicate that. So if you go to UCSC Genome Browser, they will have, they also have the history of the various versions. And so you'll see, so before 2001, they each had their own assembly and December 2001, they actually started using NCBI builds. NCBI did a build that everybody used and so they had the same coordinate system from that point on, which is a really sort of critical and essential thing. And now they're at build, well it's not build 37 now, but it's basically the same architecture, the genome research, the genome consortium, the genome reference consortium, genome research, genome research on the head. So the one thing though that, and use a hint of it here, if you go to this page, you'll see that there are patch levels. So GRCH 37, they're patch 12 now. So it means that there have been 12 updates to the genome since 2009. These updates have been put so that they don't change the coordinates on the side of the updated region. So they're a bounded update. So they change everything between the region, so they're not affecting downstream or upstream genes. So things don't move ten nucleotides over after one of these updates. Within that region, if it's a region that's been updated, things will have moved around. But the things that get updated mostly are a region of, which would have been hard to sequence and hard to assemble. And so like the HLA locus on chromosome 6, the histone cluster, the olfactory gene receptor and so forth, all these things are the things that have been updated over the years. But that said, everybody uses the patch zero basically of the release, not the first update but the original file, because that's the only one that doesn't have to deal with alternative variants at any given locus. So for example, the ABO locus, so the blood type locus, and I forget which patch level this year, has multiple alleles of a given locus at that level. So there's variants. And so you get into very few software packages that are able to deal with multiple allele that are given locus. And so they deal with fast-day files. They're very happy to deal with the fast-day file, which is just a string of letters where the next letter is plus one, plus one, plus one, as opposed to dealing with graphs where there's a bifurcation and you go one way or you go another way and then you have two versions of that locus, then you come back again together and so forth. So that's a complexity that some software deal with, most of them don't. Some of them deal with assembly tools, deal with those kinds of graphs, the growing graphs and so forth, but this is not, most of these, most algorithms are not be able to, and so they use, not patch eight or nine, they'll use the patch zero, what we call patch zero. So a new way of organizing things, which is not fully gained popularity yet, but I think it's really sort of starting to our bioprojects and biosamples, which is basically a way of organizing groups of information together. For example, for a tumor project or for the ICGC or for the TCGA, which are these large initiative, there'll be sort of umbrella project, there'll be separate, let's say, John referred to the pancreas cancer genome project, standard OICR today. There'll be the genome sequencing, transcriptome, epigenome sequencing, there'll be different data sets, different samples, and so forth, to be together under one umbrella. So the bioprojects is at the NCBI, but actually EBI is also using it, is starting to use it, and it's actually useful because what happens is sometimes some of these data sets are in different places. So some of it could be at the EBI, and some of it could be at NCBI, and so this bioproject umbrella, which like I mentioned, is just starting to get utilized, is going to be able to handle that kind of information. So I'd say keep an eye out for that kind of information, and so basically an umbrella bioproject will have an accession number, and then you'll have under it all the pieces that belong to that project. So we want to do, get some insights about cancer biology, we're a geneticist, we're a molecular biologist, there's all sorts of types of questions that we want to ask, and that we'll be asking this week. So we'll be spending some time with clinical data, we'll be spending some time with pathway information, we'll be spending time with GO, and with transcriptome, and microarray, and so forth. So the whole package is really, so it's really important to sort of keep track of database identifiers, and what links to what, and how things are related to each other. So we talked a bit earlier about HST, human express ESTs, which are sort of really at the beginning of the human genome project. Then we had mapping, genome mapping, and sequencing, then more population analysis. Genome GWAS, genome white association studies were still quite popular, but less so, and they actually were fully open datasets, until the infamous Homer paper came out, which sort of rattled the cage and said GWAS data, although it's only SNPs across a genome, I can actually use that information to re-identify people. So after the Homer paper, NIH sort of closed the door on GWAS data being fully open and now becoming controlled access data, so the same way as our home quote genome, and so it was a kind of sad day, I think, the Homer paper. But it actually happened again recently where the 1000 genome data which was fully consented to be fully open is some people have managed to re-identify. The bad, naughty thing they're not supposed to do is to try to re-identify the people, but they were able to come quite close with several individuals. So the reason they did it is just to show how cool and smart they were, but they also causing concerns within sort of the bioethics community. So the Cancer Genome Atlas pilot which was a small project initially to do at least 20 or 50 different tumors from two or three tumor types, and it was started about 2006, 2007, and it was the really first, the whole genome, exome, transcriptome, large-scale analysis of looking at tumor and it was really a pilot in the sense to see, is this feasible? Can we do it? What are the mistakes we're going to make? How can we correct it and so forth? And they did do some mistakes and they did learn from them, and the ICGC learned from their mistakes and didn't repeat those. So that was very useful from that point of view. But the other thing that happened, and this is really sort of beyond the scope of this course, and it's sort of an interesting tidbit is that the TCGA was actually separate from the ICGC and it was actually politics within the NIH between the NCI and NHGRI and we weren't talking to each other nicely and so forth. And now that's all changed and we're all friends. And now TCGA is actually part of ICGC and so it's as big as John showed this morning on the map. So they're all the U.S. project, they're all TCGA projects and they're all part of the ICGC consortium. And so it's a large scale, so the ICGC is truly international from, I think every continent except Antarctica is represented and is sort of a large scale project where we'll be sequencing 25,000 cancer genomes. I'd say the TCGA is going to represent about, I want to say more than a third, yeah, more than a third, less than half of all ICGC will come from the TCGA project. And it's a thousand genome project which is also very interesting and for which a lot of variant calling software has been developed on and so forth, but with the one big difference that this project has compared to these other projects is that the thousand genome project does not have any phenotype data. So that's the big thing that the TCGA and ICGC bring to the table is clinical information about the samples and whether or not the patients are alive and how long they lived and so forth. So that's really sort of a key, sort of a contribution and new space that this data sets are bringing in. So I showed this like... Yes? Yes? Healthy? Yeah, so they're normal. At the time they were harvested. But if they changed, if they became sick, we wouldn't know about it because there's no reporting back of that kind of data. It's not a bad data set and it's actually a very nice data set because it is open also. And that's a big plus. But it is normal... I think that we know the sex of the person but you can go figure that out yourself and we know the ethnic background because you can go figure that out yourself and I think we know that the parameters in which whatever they define as normal so they're within those bounds at the time of where the DNA was collected. So we talked about the various controlled and open data sets. This is the portal. I'm not going to spend much time on this except to say that we at the OICR are doing a major rewrite of this and come September you will not recognize this website. And so we're changing the technology, the way it is built, the user interface, everything about it. So I'm not going to spend too much time except that in the circle area there it says download data. So that part will be there and actually that's still... that data is going to be what's used to build a new website as well. So it's going to be the same data just going to be a user, different user experience in the way it's organized. And the idea that one of the main concerns that we had about this site and why we actually for the last year now we've been sort of working on a redesign is that it wasn't going to be able to scale up to the amount of data that we were expecting in the years to come. And so we're at about 5,000 genomes now and we're expecting 25... it wasn't going to scale up to the multiple sites. So we had to... it's a new architecture and new everything in the back end. And so we're not going to... you can go look at it, and things like that. It's pretty intuitive. Those of you that know Biomart, it's a Biomart back end right now. And so it's like all the other marks. But if you go to... the click download data it brings you to the FTP directory. And right there you see 12 versions right now of data sets. So each version has... what was in the previous version plus the new stuff. Each version is actually cumulative with a new data set that's been added. And so if you go... then from this side you go either to the current one or the latest one which is version 12. It takes you to this site which lists all the current projects right now that have submitted data. And then if you click on any one of those projects you then see the files that are available for that given project. And this is the OICR pancreatic cancer data set. And if you click... let's say there's a single... simple SSM, simple mutation... simple somatic mutation SSM. Then you get a table that sort of looks like this which has assembly versions, chromosome numbers, chromosome start, so forth. And so if actually you go to the next slide it has 49 columns and the first column is the cancer type and so forth. So these are the numbered columns that you have in that table. There's actually more data represented here than there is on the... if you go to the website itself and use the mark there are some columns which are not present. And so this... the raw data which is very, very much not raw. It's very much process data. This is interpreted process data where we think the mutations are the somatic mutations. So this is all open data, right? Somatic mutations are open. They're not germline mutations. So germline mutations don't exist. This is a table tab delineated format. So it doesn't say... no, it doesn't send this one. And it has the mutation. It has the genes and so on and so forth. And we're going to work with that a little bit later today. But that's... give you an idea. So that's what's available at the ICGC for every tumor type. So this... actually this is the same piece I mentioned earlier about the importance of... if you find an error... and actually I just removed... you know, gen bang is failing in any database. If you discover never a report to the database it should be rectified and so forth. Another really sort of... very well curated... database is cosmic which is... so the catalog of somatic mutations in cancer. The... the thing you have to be very much aware about cosmic... it's been around for... trying to... I think about eight or nine years, maybe almost ten. And initially what it did is it read... so the curators of a team of about four or five curators would read cancer papers that had... mutation data and then it would sort of tabulated in this database. Initially what... in the pre sort of whole genome analysis days what people did is they would go sequence BRAF in a thousand individuals. So just that one gene and get data for that one gene and it ignored the rest of the genome. And here they would input that in this database. So that kind of data is in cosmic. In cosmic also you'll have whole genome analysis where they sequence every gene and then tabulated all the mutations from those genes and then put it in this paper. But if you count all the mutations what's the most frequently mutated gene in cosmic? Lo and behold BRAF is at the top of the list. Because there's a lot of targeted sequencing efforts that are part of cosmic. So you can search... the cosmic user interface allows you to search for one and not the other and so forth. But we're going to do another exercise this afternoon where we can look at the... and the cosmic people come and get data from ICGC. So they'll get the whole genome data from ICGC and then incorporate it into cosmic. So cosmic is a curation... but it's useful if you're interested in BRAF and you want to see all the mutations that were ever done on that gene this will be a good place to go look at. So there's lots of different tumors, lots of different mutations and so forth. There's how many genes in the human genome? Probably close to 21,000. So every gene in the human genome is mutated in cosmic. And so there's lots of lots of very nice data. It's a really good work there. But also available at cosmic is a table which has all of the data. And so you can from... I don't have the URL. Anyways, you can go find the URL up there. You can go find basically all of the I have the file see cosmic complete export version 64, so the fourth line from the top. So that file we have on here and we can we're going to be working with that file. So I'm going to skip that I already talked about that. Oh yes. So basically at the ICGC there's cancer genome project pages there's the DCC I talked about and there's also the DACO which is the place where you apply to get access to this data. So if you want to have access to the BAM files if you want to access to the germline information then you have to get a DACO approval. And DACO is as scary as the name sounds that data access compliance office there are DACO officers in which are lawyers and bio ethicists will grant you permission to this data. So you fill out the form, you submit it and it says the same has anybody ever gotten access to DBGAP sort of similar scale sort of a pain. And so you have to identify yourself on the website you have to fill out details about contact information and what your project is going to be. Do you have to write IT infrastructure for keeping the data secure and you have a data access agreement REB or why not and then you fill out the form and so this is how you log in fill out the form here's the form if you don't fill it out it's all red if you fill out a property you get all green marks good for you and then you get a PDF then you get signed by that person that's going to fire you if you ever digress so for example it would be for us at the YCR it would be Tom Hudson or the president of YCR would sign off and say yes these are all digit scientists within the YCR and Francis ever tries to re-identify somebody who will fire him so I haven't done that anyway so and then the problem so one problem is that we actually don't have all the data at YCR at the ICGC we're actually working with our friends at the EBI so the TCGA data is at DBGAP actually that's an old slide it's now at where is it now anybody know it's not DBGAP anymore where is it cancer genome data at in the U.S CG Hub have you ever heard of CG Hub so CG Hub is a new data center at UC SD San Diego and it's actually it's confusing because it's actually David Ossler from UCSC the guy that does the genome browser he actually maintains the cancer genome the CG Hub at UCSD so it's now but it's basically the same sort of validation and DBGAP sort of permissions and so forth that gives you access to that then you have access to the BAM to the germline VCF files and so forth and on the ICGC side which is also included then it's actually held not in the U.S but in Europe at the ETA at the genomic archive and so unfortunately these two paths are not it's they're not a single path so you have to get permission from one end the other to get access to everything we're trying to fix that but it's not fixed yet and so right now you have to get this two permission to get the whole data set if you want to have access to everything you have to go through both paths the other thing that's complicated is that very very few places can go through all of this and get everything and hold it on site hence sort of the computing the cloud sort of ideas is actually within the ICGC with our friends in TCGA we're actually doing a hand cancer experiment where we're going to look at 8000 8000 I think it is tumors full whole genome tumors and across multiple cancer types and we're going to do so at one we're going to a private cloud where we're all getting permissions right now to be able to send our files to and so that we'll be able to compute across multiple cancer types all of a group of people all at one place but it requires a lot of coordination and so if you're looking for one tumor and one cell line or two or three samples worth of DNA you can just go get them that's not a problem but as soon as you start scaling it up to the hundreds and thousands of samples it's a very few places can do that so that's a bit of a challenge so some of the cancer data that's structured clinical data we're going to hear more about tomorrow and basically information about the samples the treatment although in this case there aren't any treatments but ICGC are all collected before any treatment so there is not treatment per se but when they were caught in their development of the disease what we're implementing right now as we mentioned earlier the sort of bioproject type idea so that for the ICGC and the TCGA there will be projects that are going to be put together and for example there's two groups doing pancreatic cancer so they'll be organized and curated together like this so this is work and development quick words about the UCSC genome browser so UCSC genome browser has many eukaryotic genomes so it doesn't do prokaryotes per se although technically it could except it doesn't really do circular genomes and it's really a useful site for evolutionary variation data representation flexible and configuration has graphical end table views which makes it very easy and useful to export sets of data and Galaxy which we're not going to talk about but Galaxy actually uses this feature of this table views that it then incorporates into it gets data it uses UCSC as one of its data sources and you can also upload your own data and share with colleagues or with the world and it's it's a great app the the challenge one challenge with this genome browser and actually with other browsers like IGV and so forth is that we're dealing with a two-dimensional coordinate system to represent multi-dimensional data so it's really that's probably it's biggest shortcoming of any sort of genome browser and so that challenge to interpret represent and so forth very complicated data on a two-dimensional coordinate system is not been resolved yet and so there are people that will jump from here into site escape and so forth and so there will be ways to represent other types of information and UCSC has got lots of different ways of representing graphs and so forth and both in graphical view and in table view and it's definitely the limitation of genome browsers in general which has not been resolved at this point a reminder about annotations that each of these browsers and they represent annotations and so what's a quick 440 characters but 20 character definition of what annotations are what are annotations given what user community yeah so the big thing about annotations there are interpretation of what we think is present and where we think a gene is where we think a transcript is it's probably correct in most cases and the good browsers like UCSC and NCBI and ensemble are so rich and so well documented and so full of corroborating and supporting evidence that it's probably all true but it is a human interpretation and it starts gene ends transcription regulatory site is similarity exists and so forth so it's really you have to sort of keep that in the back of your mind all the time when you're using this information that it is somebody's interpretation about where to hang this this feature so we'll talk about that in the CCGA many human genome projects but specifically cancer genome projects are phenotype, clinical data, tumor pathology, gender age, treatment, survival and so forth germline data which is controlled access so this will be the genomic variants which are transferred in the family somatic mutations in the tumor and these fall in multiple types you can have single nucleotide variations you can have short and indels you can have structural rearrangements structural large scale variations translocations and so forth copy number variation which is to this day still open data until they prove it to be identifiable which I'm sure will happen someday RNA abundance and splicing itself is open but the RNA sequences are closed because the RNA sequences show genetic variants and they could become identifiable and then DNA methylation data of which there is still it's part of the ICGC to measure this but it's still very few centers have submitted that it's really the last one to come so this is just some URLs for you to have a look at that's got a lot of the various data types available how many of you have used the UCSD genome browser before yes yes this room has that room has not so we're going to do that some of that in the lab this afternoon and I will depending on how much time we'll do more and more sort of complicated but here basically was a K-RAS so if you look up K-RAS fairly well characterized oncogene you have a number so you just do a K-RAS lookup it brings you up so the first thing that UCSD will do is bring you all the various places where the string K-RAS shows up so you'll have the RefSeq reference which will be the sort of the canonical reference then you'll have multiple versions of various transcripts at the top on that genome of interest so you usually start looking at humans so this will be based on humans if I start from a dog or from a cat whatever then I'll have those where you start from will be the reference and then you'll see the other references there after and one of the things that that UCSD genome browser has is that it has all the various data it's rich so each line is called a track each track is a data element of certain type and what you see in the top part is controlled by the knobs basically if you like that you see in the bottom part and so by default the human the sort of the various organism similarity and some of the snips and some of the transcripts are shown but you can go in here and turn off and on all the various tracks as you see fit and then what it is and then you refresh and you reload a new image at the top and you can move also within the image as you can move what you want it's all sort of a very flexible user interface but then you can basically see representation here for example of various RNA various exons and so forth so these blocks are then more sequence the main exon linkers and so forth again the challenge is showing on a DNA coordinate system things which are definitely beyond that coordinate system okay so what time is the coffee break three okay so so this week another tool from that was installed on your computer is a IGV from the Broad which is a it's a it's a it's a Java tool that runs doesn't run within a browser so it runs independently on your desktop and it can sort of make calls obviously over from many different sources including the broad itself and so they will have reference genomes and reference annotations and so forth and the big sort of sort of gap that IGV one of the gaps that IGV tries to fill is to deal with as a browser of next gen data but initially it was actually built as a browser of microwave data and it was really a way of integrating sort of genomic view of large scale microwave and eventually RNA seek and so forth but now it's actually used by various lots of different communities in lots of different ways and many of the instructors this week will use it in their own way and so they'll show you the way they see fit for them and so what I'm going to do today is going to do a quick sort of of IGV and then we'll have break and then we'll do some labs which will include which will basically going on Amazon doing some sort of Unix file manipulation and looking at sort of large records from cosmic and from ICGC and so we'll look at some of the gene sets and then some of the gene sets we get out of that we'll come and look at them in UCSC genome browser that's what we're going to do for the rest of the afternoon and then I have till 1, till 5 yeah okay so as I mentioned so it's very important for for any sort of genome browser human genome browser for that matter any organisms genome browser have a fixed coordinate set and so one of the challenges for people that don't do human biology is that they don't necessarily have a reference genome set this is not our problem but and so but what it is our problem is to know which reference to work it against and so if you have some data you're picking up from some older work it may be against an older version of a genome and so you will if you want to compare your work against their work you'll have to be very important for you to use the same coordinate data system the one of the strengths of IGV is also its ability to read multiple file types and so I will not you can go on a website and it'll describe some of these file types I've never worked with some I use often but basically BAM, BED BEDGRAPH they're all sort of from UCSC except for the BAM which is the binary version of the SAM file which is another file type here which is the sequence alignment format and VCF and so forth so they're all various files used by many different tools as input or outputs and this tool is able to compute on and format and export and many of the other formats as well one of the main formats that we're going to look at this week will be the BAM file format which is the binary version of SAM which is the text version of the alignment basically once you've mapped your reads against the genome you then have a coordinate system of the read against the reference genome and so that from there you can then do various analysis to see polymorphisms to see general invariance to see structural reorganizations and so forth and so many of the things we want to do are possible the IGV for example if we look at a whole chromosome view can allow to look at different data types and different clinical parameters which is what is not very viewable here but it's more in the paper that I also attach with the lecture and it allows you to look at copy number of changes and amplifications and so forth so there's lots of different data types as I mentioned before data and databases will it's really important for you to understand where things are from and so there's actually there's a lot of things we're going to learn this week but there's a lot of things that you'll have to go discover yourself after we've teased you too much information this week and with respect to next gen sequence analysis C cancer is actually a good source of information another one is what's it called again bio bio C but no that's not that one I'm thinking about C not C cancer but the other bio star thank you bio star yes bio star is the other one which I use that another version of the slide which I forgot to include here so my apologies I'll add it on the week but it's actually taking over I'd say bio star is more up to date and more yeah yeah so as I mentioned IGV can do a lot of things mutation analysis copy number variation gene expression lots of different data formats and so you can have the portal here for the description of all the various data formats I talked about so do I want to do this now you're saying yes I'm saying no actually I want to do I want to do the other lab first the rest I'm going to jump I'm going to reorganize my lecture notes a little bit so I'm going to jump I'm going to come back to this after we've done some of the lab so we're going to skip some slides actually we're going to go to we're going to stop now okay so I'm going to do a little change of pace here so we're going to go back to the IGV lecture and really I think the best thing will probably be for me to go through these lectures and some of you may want to try it or we could are you and are you using it tomorrow do you use IGV no you're not okay so it won't be used till Wednesday I think I have my schedule here and I'll be back on Wednesday so I'll I won't be here tomorrow morning but I'll be here probably tomorrow afternoon and then I'll be back on Wednesday for sure so I'm going to be here most of the week I apologize ahead of time I'm a good heckler in the back I sort of bug the other instructors huh yes yes yes yes um okay so so the first thing when you you start IGV is you have to select a genome and for the the very important reasons I already explained many a times and so on the left panel here you select and there's a bunch of other genomes available uh MM's from Musculis and so forth and HD for human, I would say human genome um so you choose your genome from the list and import or if you have your own genome file you have a different one if you have HD20 if you have a pre-release of HD20 for example you can import it there uh then uh the alignment file so load from files you can also download files from a URL from a dash server uh and uh basically like things like um uh lots of various and a simple file source is is available here or ftp from here from from uh from the broad and so you can so again so file load from file and then you put in the file that you want to load you can have files locally of course URL servers and so forth all the various options uh here are available uh so the genomic annotations of these will be the uh the equivalent of what I talked about from that's different from various genomes will be the sort of the transcript where the genes are located and so forth and so uh and that's actually sorry I'm confusing myself here the yeah the BAM file so the BAM file is the uh the binary uh file with uh with the reads and then select the chromosomes you can once you a little scary one not scary uh when you first start IGV load up a chromosome and you're at the you're such a high level that you don't see any details and so it doesn't seem like you're loading anything so you have to start zooming in before you can see something so let's say you picked a chromosome and then you see and then you're inundated with data and so so this if we zoom in some more then you're starting to see reads paired reads so in this case it's paired doesn't have to be paired reads but uh this coverage histogram of the top and then white reads are low alignment scores other colors depends on the color alignment and you can change the color assignment what it means uh sometimes color assignment will be pairing and as you can see at the bottom here is you're starting to see you're zoomed in so much that each of these little bars here is actually a nucleotide so these paired ends I think if I recall there's something in the order of uh they're relatively short like 35 or 50 and uh so and again so you zoom in some more and and the zoom controls at the top right here uh you see the color each letter is a different color uh this is the reference sequence so this is and then you see difference against uh the reference so if there's differences against the reference then they're highlighted so uh some more here and then uh you have a summary at the top here so yes I talked about downloading genomic annotation so one place that's common is UCSC uh it has table formats here you can go to the table view and the table view is an explanation how to get the ensemble gene you go to UCSC to get ensemble genes I mean that's a so weird but anyways it's uh there are tools that are made to work together so UCSC works well with uh with IGV and so there's the file format the the organism you want the build you want you gotta make sure you have the build of the coordinates assembly that you've selected the annotation again against making sure that that you're uh so these are uh 16 18 18 and so then so here you have example of gene models so the more data annotation you load the more memory you'll need uh and so the we informed you to take the 1.2 gig version of this software on the instruction page which should be fine for most of the things that we need to do and it will fit it has to fit also with obviously with the amount of memory you have available on your machine so here's an example of tumor normal pair and where uh there's a deletion so this is an example of where you'll see a diagnostic sort of uh upper level sort of views of deletions insertions and so forth so this is a deletion um you'll have unmatched paradans on the other side of the deletion these red color boxes and so copy number variation so this is a uh I don't think I'm not sure if anybody's going to be using it this week do you know so our app will be just like this ok ok so and then this is just some other places to get and I'm going to add the the last one I talked about so this is the exercise I told you about to go complete this page and you can do that now and this is some of the things that we did in the file this one here is the space that's no good so all of these are better on the wiki they're corrected on the wiki not on the on the and so forth