 Welcome to MOOC course on Introduction to Proteogenomics. In previous lecture, you have heard Dr. Henry Rodriguez, he gave you very nice and broad overview of the field of proteogenomics, especially cancer proteogenomics, the major milestones which have been achieved and what kind of challenges which lies ahead for the community. Now, we are going to talk in much more detail in depth about first genomics module. Then proteomics and then we will try to integrate the big data and proteogenomics to make meaningful insight. In this slide, the first lecture today is going to be by Dr. Kelly Rogels. She is assistant professor at New York University in USA and she is going to talk about genomics, one of the major aspect to understand the comprehensive view of any disease or any system. Dr. Kelly will talk to you about the diversity of omics in biomedicine and especially the milestones which have been achieved using genomic technologies, various type of mutations like hereditary or acquired mutations which affect different type of diseases especially in context of oncology. What are the publicly available where one could try to study and find the mutation studies and little data sets which are publicly available. Along with that, the basics of gene sequencing and methodologies will be covered as well. Okay, hi, so I am Kelly Rogels, I am an assistant professor at NYU and I am going to give you an introduction to genomics. Just to start, I wanted to just give you an idea of what my lab does to give you a little context about who I am and what I focus on. So our lab really is interested in applying multi-omics integration methods across a lot of different questions and diseases. This includes cancer, the human microbiome and I do a little bit of work with PopHealth. I am obviously just going to talk about the cancer stuff today but we do sort of think about the integration across not just cancer but lots of different scientific questions. And really what we are interested in is looking, taking a systems approach to human disease. So looking at combining omics at the cellular level, so understanding how proteogenomics within molecular networks interacts and how that impacts disease. Also thinking about intercellular multi-omics, so not just within one cell but across many different cells and then in some cases if we have the data, which is of course always the limitation, if we can look at interorgan multi-omics, so if you have many different organs and many different omics, you can really have a very large and comprehensive view of human disease and I think this is really the goal and we are starting to get there as we are able to generate more and more data. So I think everybody here is going to eventually touch on many of the things in this slide so I wanted to bring it up in the beginning. So really we are going to be talking about all levels except not metabolomics, I don't think anyone here is going to be talking about that but in terms of genomics, epigenomics, transcriptomics and proteomics and there is many different data types that we can data that we can gather from these different omics data levels and so today I am going to touch specifically on the genomics, epigenomics and transcriptomics data and go through an overview of how we collect the data and then how we analyze it. So as I mentioned there is a large diversity omics that we are currently studying in biomedicine so this slide was taken from the TCGA which I am going to talk a bit about and then I added some extra slides to represent proteomics and phosphoproteomics here but there we can look at the genome which is really the long term storage information of the cell, the transcriptome which is the retrieval of this information, the proteome which is the short term information storage and then how the signaling networks interact and the interactome which we will also touch on. So here is just a lot of the different levels of data that we can gather and at this point and so in terms of next-gen sequencing we are really covering all from the micro RNA up through the mutation calls and that is what we are going to be focusing on. There has been a long history of studying the genetics of cancer which predates the proteogenomics of cancer and so the reason why genomics and cancer have become such an enormous field is because it is known that at the molecular level cancers cause by mutations that result in a barren cell proliferation and these mutations can either be germline meaning that they are inherited or they can be somatic meaning that they are acquired at some point in life and these mutations if they are occurring in oncogenes they can cause over expression of these oncogenes, oncogenes meaning that they increase tumor proliferation, they can be fusion genes, they can be an altered gene product so there are many examples of this that we will cover some of them in the hands-on such as ERBB2 it is just one we are going to talk about specifically in the genomics hands-on session and then there is tumor suppressor genes so these are genes that normally regulate cell differentiation and suppress the proliferation of the cell. So if there are mutations in these genes there is an increase that would reduce them, there is an increase in proliferation so you can we look at both tumor suppressor genes and oncogenes in terms of the mutations in the cell and as I mentioned there are many genomic drivers this was in 2010 there was a I think this is in nature yes in nature there was an article about the cancer genome and the TCGA which is the cancer genome atlas which I'm going to cover in a bit of more detail later really was one of the projects that spearheaded this so they sequenced actually this is an outdated number I think it's more like 11,000 cancer genomes over 33 different kinds of cancer so it's a really wonderful data set that everyone has access to and there's also the international cancer genome consortium which is another large project that has worked to collect lots and lots of genomics data for cancer that you can all also access and then there's what let's call there's a typo there and I apologize cosmic which is the catalog of somatic mutations in cancer this so this is just cataloging all of the mutations that have been found to occur in tumors as well so this is another really good resource that's publicly available so now I'm going to jump into genomics I think a lot of you probably know this but I realized I should probably have a slide in there just because you know just in case otherwise you'll be completely lost after this so the chemical structure of DNA and RNA so DNA is double-stranded RNA single-stranded and there's these nitrogenous bases that make up nucleotides they differ between DNA and RNA in that there's a thymine in DNA and a uracil in RNA triplets of these bases encode different amino acids so that's something that we're going to go into okay so just two slides on the history of sequencing just because I think it's important to start start there so Sanger sequencing which how many people have heard of Sanger sequencing is a great wonderful so this was developed by Frederick Sanger who received the Nobel Prize in 1980 for these methods and it was the most widely used method for until next gen sequencing came around and we're going to spend most of our time talking about next gen sequencing so what they what this method did was use these modified nucleotides which attach to the DNA strands and each of them was tagged with a fluorescent label that identified which nucleotide it was so I'll show the schematic of this so here we have our DNA template so the double-stranded DNA was denatured to get a single-stranded DNA and then you add in these nucleotides that have either non fluorescent or fluorescent tag on them and then you have a DNA polymerase so it allows the strands to grow and becomes double-stranded because DNA polymerase is adding on these these deoxyribonucleotides and whenever it gets to one where there's a fluorescent probe it stops so it keeps growing and then these these fluorescently tagged CDNTPs are randomly added so as soon as they're added on it stops and then so you know that the the length this what is it six nucleotides up is a C and seven is a G and etc and then you run this on a poly acrylamide gel separate out the strands and then you can use a fluorescent detector to figure out at what length each of the different nucleotides is at so this worked really well it just took a really long time so eventually next-gen sequencing came along so this was established in the mid to late 90s and it was based on this Sanger method but it incorporated some new innovation so how many people are you have used like a luminous sequencing for example okay so there were more Sanger sequencing people here in the Lumina yeah so so now what happens is you still have this DNA template so it's a single-stranded DNA template and it's fragmented and there are adapters that are put on the template and we're going to talk a lot about these adapters because we can do some cool stuff with them and then what happens is the these fragments are attached to some a solid support like a flow cell we'll talk about that as well and then there's an amplification that occurs so that you stick your fragment onto this this flow cell and then you amplify it so you end up getting a cluster of the same exact same sequence that's stuck in a certain part of your flow cell and then you so this is what's called library prep so this is how the library is prepped from your DNA template and then what happens is we do what's called sequencing by synthesis so it's similar to what I mentioned in the Sanger method except now these fluorescent label nucleotides you can actually you can when they when they bind to the DNA they stop the synthesis but then you can remove the fluorescent probe so that you can continue to grow it so you don't in the Sanger sequencing once it hit there it stopped now you can take that floor off so you can keep growing instead of having to do all of the different lengths and then running it through a gel now you can just keep adding more and more on and just taking a photo every time the floor is put on you take photo you remove it etc. etc. until you grow it all the way up the strand and then the data is exported as a text file so it reads out what cluster it is what read it is and then the actual sequence so this is a much faster and more efficient method than the Sanger sequencing what we're going to talk about is a lot of different methods okay so genomics we have de novo sequencing mutation discovery exome and whole genome sequencing and copy number alteration or variation detection epigenomics we have assays to look at transcriptional activity protein DNA and interactions and methylation analysis and then the transcriptomics level we can look at mRNA expression alternative splicing and micro RNA expression and there's more these are just some of the basics so I included some current next gen sequencing technology that's being used the moment and sort of the newest instruments so the most common one at this point is Illumina high seek so the this one I'm going to talk about why it's the most commonly used and what technologies they've come up with that have made it the most popular so Illumina high seek it's commonly used for the whole genome and exome sequencing targeted sequencing RNA sequencing chip seek and it's really you can parallelize a lot of samples at once it's pretty fast and it's highly sensitive and has a high output and then there's Illumina my seek which I wanted to introduce but we won't talk about it much because it's not used much for cancer it's mostly used for metagenomics so microbiome data small bacterial sequences so or 16s ribosomal sequencing and so the cost is higher compared to the high seek but it's faster and has longer read lengths which makes it good for these these other smaller genomes that are less annotated there is pack bio and I'm going to go into each of these in more detail which produces very long sequence reads but it has very high error rates it's at this point frequently used in combination with Illumina because they offer different strengths and then there's the Oxford nano poor Minion which is super cool has anyone seen one of these okay they're like usb's which makes them pretty exciting and they're also really good for long reads they're ideal for that reason for this bacterial viral genomes they're very small they have historically had very high error rates but they are trying to improve this we'll talk about that a little bit more too so for Illumina we talked about the library prep so it happens in the same way that we discussed before where you fragment them so here I found I found a couple different here it says 200 to 500 but I've seen different quote numbers for how long these fragments are and then so you you then do this cluster generation on this flow cell and the flow cell is really what I think made Illumina the top of the market for this so they created this pattern flow cell so you could get really clustered high high cluster density and this increased the output of the actual sequencing analysis so once you do this the cluster generation on the flow cell and I'm going to talk a little bit about what this solid phase PCR is on the next slide you do the same sequence the sequencing by synthesis that we talked about before where you just keep adding on different nucleotides and just taking the picture of them as you go okay so the solid phase PCR which is also known as bridge amplification what happens is you have your DNA that is fragmented and then you have these adapters so again the adapters are put on so that you can attach it to the flow cell and then the DNA is denatured into single strands so you break it into single strands and then you attach it to the flow cell and then the single strands they have these other adapters on that are sitting on the flow cells so what happens is the single strands end up bridging over to attach to the complementary adapter sequence so it creates this bridge and then it's elongated by using DNA polymerase so it creates this a double strand and then it's denatured again so then it creates single strands again and then it bridges over again and it keeps creating these more and more of these clusters yes yeah correct so you're saying why not just put like all five prime adapters on and all so I don't think you can control which ones are going to stick but the adapters themselves attached to the plate regardless oh I see what you're saying why do you not I think it's actually because you want it's a good question I look into it more but my guess is that because you want to get both if you do single stranded you want to get it from both directions so paired end versus single reads which we'll talk about I have a feeling that that's why but I don't I'll actually look into it so thank you that led me perfectly to my next slide so um so paired end sequencing versus single end sequencing um paired end sequencing is now really the norm for sequencing um but I think some people likely still use single end reads as well so I figure we should talk about it um so what it means is that you're sequencing both ends of the fragment so as I mentioned the fragments are let's say they're 500 base pairs long um typically when you based on the reagents that are used for Illumina you only measure 50 to 200 of those base pairs so you're not measuring the full fragment you're just measuring the ends of the fragment um so if you do it just from one end right you're only going to be measuring let's say those 100 base pairs on one end of your fragment the first let's say the three prime end um but if you if you read both the forward and the reverse right 100 on both sides then you have a 300 base pair gap but you have three two you have 100 and 100 on both sides and you know that they're from the same strand so you know that that gap exists but you can get both you can get more information from that cluster than you would normally if you just did it from one direction so really you're producing double the reads um for the same amount of time so it's a little more it's more expensive which is the limitation but if you can do it it's it's um it's a better use of your sample and your time and the alignment is much more accurate we'll talk about alignment so when you actually take these reads and you try and align it to a reference genome you have a lot more information on what how to align it because you have even though you have that gap you know exactly what is on both sides of that gap so how this is done and you sort of alluded to this in your question there's the there's a sequence primer that's specific to the three prime and the five prime end um and um that's included in this adapter uh yeah so that is how the paired reads and reads work any other questions on this yeah say sorry say that again specifically is that in the left yes yeah so the adapters oh no so that the the fact that it only reads 100 or 150 is based on the actually based on how much of the it's like the aluminum gives you a certain amount of ddm tps to actually make the reads so it's more about the um reagents you use than the primers it's not random it's not random it's not random no but no no it's not random the 150 is not random because you you tell the alumina how many times to the instrument how many times to take a picture right you're going to say take 150 pictures and then stop because i know that my reagents can get me to 150 yeah yeah that's not random you'll always have 150 if you decide to do 150 so another thing you can stick in the adapters is a way of actually multiplexing your samples which means you're you're pooling lots and lots of samples together so you can run them all at once instead of just running one sample at a time so um what is done is there's a unique sequence that's added to the adapter that indicates um what sample that that sequence came from so during the library prep you add these these unique sequences on um and then that way when you mix them all up and then you do all of your your library the library prep done and then you do your sequencing after you have your data you have this sequence that identifies so the blues came from you know sample one and the pinks came from sample two and then you can just in terms of the bioinformatics you can go in and pull out which ones came from which sample so um this is a very common technique at this point as well paired n would be better for targeted i think pareden's always better right if you can afford it this is always the limitation for these things right is that your question yeah yeah pareden's always better if you can swing it yeah yeah so i did want to touch on the two other sequencing um platforms that are not as commonly used but i think they're going to become more and more present so i wanted to talk about them a little bit so there's this pack bio um may use um this what's called a smart method that um can produce really long reads so the limitation with alumina is um as we mentioned that it will only go up to about 200 base pair so you you when you're aligning and you're trying to figure out where those sequences belong on the genome you don't have a lot of information you only have those base pairs but if you have four you know 40 kb then you have a lot of information and you can do de novo sequencing you can you can get a lot of information about let's say you have an unknown um species that you don't even know nothing about you don't have a genome so you can gather these long reads and then try and do de novo alignment and and figure out the actual reference genome of a new species so that's where these things um typically come in and um another so i'm not going to go into all the methods behind this i did include if you're interested there is this paper that goes into a lot of the details on it but um it doesn't require this pause between the reading of the steps so in alumina you know you have the floors added on and you have to take a picture and you have to take the floor off and then you put a new one on and you take a picture so there's this pause step every time that you add a new one on so in this method that's not that does not happen so it's a lot faster um so unfortunately though it has a very high error rate so 15 percent's pretty pretty bad um so i think they're likely working on lowering this but it's something to keep in mind but it is still really good for de novo assembly of small genomes um structural variant identification um and you can actually directly identify epigenetic modifications without having to do another completely different type of omics analysis which is very cool um and then the other one which is actually quite similar in terms of some of its benefits is this oxford minion which again is this usb science sequencer um and it can do really ultra long read lengths so up to hundreds of kb and what happens is the way that they do this is there's this protein pour that's actually a e-coli mutant um of a csg g protein and they figured out that if you put this on a membrane you can actually read the the dna strand through this pour and the different bases depending on what basis coming through the pour it disrupts the current in a specific way that you can read electronically um so i think this is super cool um i think that the problem that i've heard from people who are more involved in the actual instrumentation of the field is that it actually goes too fast so the reason why there are high error rates is because it's moving too quickly um and they're trying to figure out how to slow it down and i think it's from what i've read um it has gotten a lot better and there was a recent paper i included here that came out this year that actually used this to assemble a reference um a human reference genome so it was just kind of a proof of concept paper to show that you could use this to do some of the stuff that we've previously just used aluminum for so um i think that this is a very exciting um development in next gen sequencing especially because you can just hold it and take it to places and sequence i guess whatever you want to so um there are a lot of different file formats that we'll talk about uh just because um i you know i sit in the data analysis side of most of this so i don't typically do i do work with people who do sequencing and i work with the with the the um the genomics technology center at NYU a lot and so i'm familiar with what they what they do but i typically just get the data after um that's what i prefer uh so i wanted to talk a little bit about file formats so um the raw data is typically in what's called a fast queue format there is a gen bank format which is an sra there is alignment formats which are sands and bams there are these genome browser formats and then there's genomic variant formats and we'll talk about each of these in some level of detail so the fast queue has anyone seen a fast because everyone's seen a fast queue who's seen a fast queue okay some people okay so um this is what this is an example of what it looks like so it just starts with information on the actual um instrument and then run number um and then it has information on the flow cell this is an alumina of fast queues it depends on the instrument just showing you the aluminum aluminum fast queue because i think that's again the most common for cancer genomics so the flow cell id the lane and the tile so you can see here if there's a flow cell contains eight lanes and then each lane has two columns of tiles so you really get exactly where um the sequence was in terms of the flow cell coordinates and then you also get an x and a y coordinate of the cluster on the tile you get a read number so if it's um single reads you'll only have a one if it's parent and you'll have a one and a two and then um if it's filtered out for quality reasons you get a yes here if it's kept you get a no and then there's a sample number and then there's the actual sequence and then you get this quality score which is a it's an encoded score using ASCII characters um to represent the quality score of each of the bases so it ranges from zero to 40 in terms of quality um and you can use there there's a key that goes along with the alumina that just shows exactly what each of these corresponds to so in today's lecture you go to the overview of genomic technologies and especially its relevance in context of cancer and how it could actually provide a broader overview to understand any disease or clinical conditions you also learned why in genomic sequencing two side reads are important as compared to the single reads and how it can affect the accuracy of sequencing you also learned about gene sequencing how it has evolved over the years and now we have not only achieved much higher accuracy of sequencing in a much shorter time frame and in a very very cost effective manner but also the instruments have become much more miniaturized and one could see the examples like Oxford Nanopo Technology Minion which is very easy to carry and transport anywhere as per the requirements Dr. Kelly has also introduced and made you familiarized with availability of data sets and how to obtain different type of raw files and format them for various parameters for further analysis let's continue the next lecture in the same flow of genomics where Dr. Kelly is going to talk to you about the sequencing alignments and various factors which affects that thank you