 Welcome to more course Introduction to Proteogenomics. From the last lecture Dr. Kelly Ruggles provided you an overview of genomics and genomic technologies, how they are making revolutions for various diseases especially in context of cancer. Today is going to be second lecture by Dr. Kelly Ruggles and she will talk to you about sequence alignments with respect to the reference genome. So what are the terminologies like coverage and depth refers in terms of alignment of genes to the reference genome. She will talk to you about exome and whole genome sequencing, how it helps to understand the genome and different variations like copy number variants and mutation status which may lead to various clinical conditions. This lecture will also describe about various type of SNPs, SNP arrays and applications of GWAS or genome wide association studies in a population with reference to disease. She will then cover about transcriptomic fields which is going to look beyond the genome, how the transcripts are formed, how you can study RNA expression and by using the RNA sequencing data or NGS technologies and various applications how one could get some functional information just looking beyond the genome. So let us welcome Dr. Kelly Ruggles for her second lecture. Okay, so we are going to just continue where we left off. So at this point I have sort of walked us through getting to the FASTQ raw data files and now we are going to talk about what we do with it once we have those data files for all different kinds of OMEX analysis. So the first thing that we do is we have to align these sequences to a reference genome if a reference genome exists. I am going to assume for the purpose of this that we have a reference genome. And so what that means is you take the short sequences and you match it against a genome that represents whatever species that we are looking at. So for example for humans there is reference genome, the current updated reference genome is HG38, the version before this was HG19, a lot of things still are in HG19, some things are in HG38. This is something that if you are doing an alignment or you are using data that is aligned to a genome you should definitely always check the reference genome because it will completely mess you up if you use the wrong reference genome and you assume it is one and it is actually aligned to the other. So a reference genome is just a sequence database that acts as a representative sample of a species. And so as I mentioned for humans the current version is HG38, there is a mouse version MM10, there is a whole every species that has been sequenced has a reference genome you can look up if these are not including your favorite species to work with. And what the alignment does is it finds perfect matches so anything or that 150 or 200 base pairs perfectly aligns to the genome or it can allow for a certain amount of mismatches and depending on the aligner and the settings you can sort of put in how much mismatch you want to allow. And then if you use let's say Illumina you can get about 80 to 90% of the reads to map to the human or a mouse genome. There are a lot of problems that occur you know if you have a chunk that maps many different parts of the genome you don't know which one it came from and there you can read about depending on which aligner you're using sort of the limitations and the strengths of each of the different aligners. And typically the output is what's called a SAM file which is a sequence alignment map file I'm not going to go into the details about this we could have a whole day on SAM files but if you want to learn more I did include a reference here that's pretty thorough in terms of the SAM file format. And then there are some common tools I mentioned to here so Bowtie which is typically now used for genomic alignment and Star which is used for RNA-seq. I also included here some references if you're interested in learning more about either of those we don't have time to go through all of the details about them today but I did want to mention them. Has anyone used Bowtie? No Star? Neither? Great all new. So this is that's good so and so something to keep in mind here too is you hear a lot about coverage and depth when you when you hear about next-gen sequencing and what that really means is the coverage is the percent of the reference genome that you you were able to sequence and then the depth is the redundancy of coverage so how many reads you were able to get at a certain on average at a certain point within the reference genome so for example 10x would mean 10 reads on average that you are able to cover across the reference genome and the reads are the number of uniquely mapped reads here. Okay so I wanted to go through some examples of next-gen sequencing methods and how they're how they're used so we'll start with genomics then we'll move into transcriptomics and then we'll end on epigenomics so for genomics the common two commonly used methods are whole exome and whole genome sequencing so whole genome sequencing just means that you're taking everything in the genome and you're sequencing all of it depending on the species you're working with if you're working with humans that's a lot that's that's an enormous amount to sequence so if you don't care about the things that are in the intergenic or intronic regions and you just care about protein coding genes or exons then you just want to look at the exome and so what you can do is you can actually capture the exome sequences so you use these oligo probes that match to exon sequences that are able to pull out and enrich for these exon sequences during the library perhaps you kind of get rid of everything that's intergenic or so what you can do is you can enrich for these exons before you do your while you're doing your library prep and then you do all of the sequence amplification and sequencing following that so that you're only really looking at the exon so you get rid of everything else and this is 2% of the whole genome which really cuts back on the costs quite a bit if you don't care about the other stuff so it's it's another method to keep to that be a lot of people use so it's cheaper and faster with whole genome sequencing it's more coverage but it's much more expensive so depending on what you care about you decide which one you want to actually do okay and then so two of the main things that you can do with with the genomics is to identify single nucleotide polymorphisms or SNPs which are these single base parasites that vary so for example if in your reference genome there's a T and in your sample there's a C then you know that there was some sort of mutation that occurred and some of as I mentioned in the beginning some of these have been shown to be drivers of tumor progression so in cancer these are particularly interesting and you can also look at look at copy number variation which is just changes in the genome because there's a large duplication or deletion of DNA so instead of so you can see here if this is along the chromosome you see that there is this big chunk that's been duplicated here and then if you look at the copy number level you can actually see that there's double the number like approximately double the copies of this area of the genome that you see in your actual reads so you're able to actually get information on these duplications and deletions using these sequencing methods there are a couple of different kinds of SNPs so let's say this at your DNA as I mentioned there's codons that encode for for different protein amino acids so in this example there this is your reference genome so you have no mutation and then it this encodes for this RNA sequence which then encodes for a lysine you can have a synonymous mutation where this C is turned to a T which causes the RNA to change to 3a's in during translation and then our transcription sorry and then the protein though is still becomes a lysine because there is some overlapping RNA and codons that encode for the same protein so this doesn't actually cause a change at the protein level but then you can have some non synonymous SNPs so for example if you change the A the T to an A here you get a UAG at the RNA level which encodes for a stop codon so now you have instead of your protein going on and continuing to grow you actually have a truncated protein or you can have a missense SNP where you have the the middle T becomes a C at the RNA level it's AGG and then you have an arginine so it's changed the protein so these are the mutations that people typically focus on because they have an impact at the protein level but when we do SNP calling you find you actually identify all of the mutations regardless of whether or not they're synonymous or non synonymous so they're in addition to next-gen sort of the standard next-gen sequencing we were talking about there's also SNP arrays and these are still pretty commonly used so I did want to talk about them a little bit and these are just their their actual arrays that have specific SNPs that they measure so you are only going to measure the SNPs that are on the array when you do whole genome sequencing you can measure whatever it's whatever it's whatever you find you find at in a array you are actually asking do I find these SNPs and how can I measure them at what's the quantitation of those different populations so here you have your genomic DNA and you fragment it and then you lay it across the chip surface so this is there's a couple of different kinds of chips I just chose one of the newer ones and then the DNA is amplified and hybridized to whatever your array is which is here and then it scans and you're able to quantify how much of which of the SNPs is occurring in this case it's because of they have these these floor the fluorescent labels that specifically hybridized to different SNPs so you're able to actually measure which SNP is present in which sample there's a couple of different ways you can do this but that's essentially sort of the over over overview of how this works and these are commonly used in these genome-wide association studies so these GWAS studies if anyone's done like 23 in me I don't know any of these like sequence your own SNPs they're done with SNP arrays so what GWAS studies they just measure and analyze these SNPs across different populations so they're typically trying to understand it's a case control study so if you have a population of people with disease X and a population of people without disease X can you find a SNP that occurs more often and it's at a statistically significant level in disease X versus the control so and these were I think it like they were super popular maybe 10 years ago people still do them but they definitely were a huge deal for a little while and there are certain cases where they're still really useful and so you can just see here this is just looking across all the chromosomes and it's showing that at this point in chromosome I can't tell which one this is based on the color but that there is a significant association so this is a log a negative log 10 p value with the disease versus the control and you can also use SNP arrays if you're doing like a cancer study and you just want to look at specific SNPs in your population it's one way of doing it it's cheaper than doing the next gen sequencing okay so another way that you can do SNP detection is just using either the whole genome or whole XM sequencing that we talked about before so you have your whole your your sequencing data and you align it to the reference genome again as we discuss and then you have those quality scores those spread scores for each of the different reads so you know how confident you are that the the base that you're calling at that at that specific location is true or not and then you can read you can remove some of these reads or you know you do a QC step and then depending on the number of samples you can either do this multi sample calling or a single sample calling and then you your there's many algorithms that do this I'll talk a little bit about which ones are available so the algorithms will call different SNPs and then it outputs SNP calls which and in VCF format which typically in VCF format which I'll talk about what that format looks like oh here we go so this is a VCF file and what it has is information on these SNPs and where they're located so you can see in the first column it's chromosome in the second column it's the position of this of the SNP and the third column it's an ID some in this case it's just left blank but sometimes it has information like from a different database that exists that like cosmic or DB SNP which we'll talk about some of these have been annotated so it will just put in whatever that SNP is it has the reference space so what it is in like RefSeq or whatever your or sorry in the HD19 database and then it has whatever the allele that's that's different in your sample is and then it has a quality score and it can have all sorts of columns that go on and on that talk about what it is so it depends on the data but the six the first six are always there and they're the most important so there was a paper there's paper that came out in 2018 that reviewed some of these variant calling pipelines I've included it here so again the general steps for all of these algorithms are to align it to the genome reference do this recalibration and QC step then do the variant calling and look at the quality of those and then filter out the variants based on the quality that they come up in your in your variant caller so there's a whole bunch of pipelines everyone has a favorite it's usually the one that they they created or that they know the person who created it that's how these things work right but a lot of people what they do is they use several of them and then they look for overlaps and that seems to be the best way of doing this because you know that if many of them all of them have different strengths and limitations and then you know if it was called by several then the overlap is probably the best the best way to go and then there's several snip databases that are really useful if you're working with these snips so for example there's DB snip which is just a collection of every snip essentially that's been identified yeah explain variant calling that we said yeah so what it is is essentially it's it's taking your chunk of sequence and you it's it's comparing it to the reference and then it's saying this this nucleotide always comes up different in a whole bunch of reads this one nucleotide and it will it will pull out the fact that that nucleotide is different in a whole bunch of reads and then it gives it a quality score so we'll they will output that information okay so there's a whole bunch of snip databases DB snip which I mentioned is just a collection of all bunch of snips that have been identified cosmic which is specific for somatic mutations and cancer so if you're working with cancer this is a really a good one to look at open snip where you can actually upload your snip data I can't believe people do this apparently people do this they get their snip data from these companies and then they upload it so that other people can use it they're very trusting this ISG are which was started as the thousand genomes project has anyone heard about the thousands genomes genomes project great it's pretty interesting and I think it's a they keep adding more and more data so it's really useful if you're if you're looking at snips in different populations they're just trying to get DNA from people all over the world to try and see what snips occur in different populations and then there's go exome which is a snip database from a long heart and blood disorder project okay and then copy number variation as I mentioned is just looking at these changes in the genome due to these duplication or deletions of large regions of DNA and this is also really often occurs in cancer so it's something that we pay a lot of attention to and use whole genome or whole exome sequencing to get information on so you think that CNV is like less popular now I mean I we all of the projects I've worked on we still do it and we still do include it in our data analysis I think also if you're doing you know if you're doing the whole genome or the whole exome sequencing already to get your steps then why not do copy number but so I think it's also like if you're already have the data you're gonna you're gonna do it okay so I'm gonna move on to transcriptomics so the the old way of doing high throughput transcriptomics was using a microarrays so these are gene expression arrays where similar to the SNP arrays where you have a set number of genes that or now in this case genes that you want to measure and you can then you have this chip where you you make this cDNA of your so you take your RNA and you make a cDNA using reverse transcriptase and then so you're just taking your single strand and you're making it a double strand and then you you fragment it and you put it on you label it with this fluorescence and then you add it to this microarray and it hybridizes so you have probes for specific genes or transcripts in your microarray that you then measure so you just if you have a lot of a certain transcript there will be a lot of things that stick to that probe and then you can and then you can measure based on how much fluorescence there is in each of the cells how much transcription you have in that gene so this is the sort of the old way I think there are some people who are still using this but for the most part people have moved right into RNA seek so I'm going to spend some time on RNA seek so in RNA seek you it's similar to the whole genome and whole exome sequencing where you have your sample you isolate your RNA and you do a library prep like we discussed before you load on your flow cell and you do this next gen sequencing in this case the only difference is that you're measuring RNA instead of DNA and what this can be used for is gene expression so you can look for the expression of genes in in transcripts in all of your samples you can do differential expression analysis and you can also look at alternative splicing so with RNA right you're going to get anything the you're going to get splicing of different exons that you're going to be able to see because at the genome level the exons are separated by introns and then once they're transcribed you can get them in you can see what they look like at the once they have the alternative splicing has occurred so that's a that's a benefit to RNA seek that you get more information than you would get from your genome sequencing some people also do do snip calling from RNA seek from I've talked to a lot of people about this and this it's there's a higher error rate so there's some worries about using RNA seek to do snip calling but it's something that people do do so how does this work so you actually have to enrich your RNAs when you do this so you have your sample of interest you isolate your RNAs by either using a poly poly A enrichment so you pull them out of your sample or you can deplete ribosomal RNA those are the two different methods you can use so you get this enrichment of our of mRNAs then you select for specific sizes from this from the RNAs that you've enriched for and you add adapters similar to what we did I showed previously with genomic sequencing and then you just do the next gen sequencing as we as we sort of discuss and this again a lot of times it's done using alumina or similar instruments and there's a lot of applications for RNA seek I think we've all fewer in the fields at all you've read a lot of papers where people use RNA seek it's very popular right now you can look for fusion transcripts which we'll talk a little bit about mutations the TCGA which I'll talk about more later has used RNA seek to characterize thousands of tumors encode which I'll also talk about has also characterized dozens of cell lines you can look at annotation of genome so how the genomes are actually structured and then you can identify RNAs that are associated with disease you can also look at micro RNAs that takes a totally different process of sample prep but it's the same once you kind of isolate that micro RNAs you can seek with them and and look similarly at how they're expressed in different diseases so today's lecture you have learned how sequence alignment could be done and factors which help in increasing the efficiency of your analysis for the big data sets obtained from genome data sets you also learned about GWAS which contains experimental data related to SNPs in various genes leading to different clinical conditions. Dr. Kelly has also helped us to understand how to use the raw files of NGS in SNP data analysis and I hope you've also learned about various SNP databases like cosmic which contains somatic mutations in human cancer, go exome which is a SNP database for lung heart and blood disorders and many more so I hope you know by understanding by listening the lectures not only you are getting refreshed about the genomic and transcriptomic and basic of these technologies but also some of the databases and resources which are available from where you can obtain lot of new information from the publicly available data sets. In next lecture Dr. Kelly will talk to you about how one could use RNA sequencing for transcriptomic studies and interpretation of data for much more meaningful insights of a given disease. Thank you.