 Welcome to MOOC course on Introduction to Proteogenomics. This generation sequencing has really seen large applications especially in the clinical settings. It is a really good idea to catch up on use of new NGS platforms. In this slide we have invited industry experts to provide you the hands on session how to use the latest NGS technologies. One day Dr. Arthi Desai from Illumina will provide you a brief lecture on sequencing technology especially sequencing biosynthesis. She will also give you an introduction of how sequencing technology actually works. She will talk about two key concepts sequencing biosynthesis and paired end sequencing. She will show you how to open an account in base space account and we will proceed for the hands on session in the next lecture. So, let us welcome Dr. Arthi Desai from Illumina. Before we actually get started with the hands on session I wanted to show you guys a short video that recapitulates the Illumina sequencing technology. Mukesh did a great job of explaining all the platforms that we have and the key oncology applications and the panels that are currently available from Illumina. But we are not sure on you know whether the concept of reads, the read lengths, paired end sequencing, depth of sequencing if all of that is known to everybody. So, before we start the actual hands on session which is going to be short what we would like to do is you know just give you a brief understanding of how the Illumina sequencing technology actually works. And there are plenty of videos available on YouTube. We have just picked one of the you know the one that really quickly and easily demonstrates how the sequencing technology works. Sample preparation begins with extracted and purified DNA. The first step in next-tier sample preparation is tagmentation. During tagmentation transposums simultaneously fragment and tag the input DNA with adapters. Once the adapters have been ligated reduce cycle amplification as additional motifs such as the sequencing primer binding sites, indices and regions that are complementary to the flow cell oligos. Cycling is a process wherein each fragment molecule is isothermally amplified. The flow cell is a glass slide with lanes. Each lane is a channel coated with a lawn composed of two types of oligos. Hybridization is enabled by the first of the two types of oligos on the surface. This oligo is complementary to the adapter region on one of the fragment strands. A polymerase creates a complement of the hybridized fragment. The double-stranded molecule is denatured and the original template is washed away. The strands are clonally amplified through bridge amplification. In this process the strand folds over and the adapter region hybridizes to the second type of oligo on the flow cell. Polymerases generate the complementary strand forming a double-stranded bridge. This bridge is denatured resulting in two single-stranded copies of the molecule that are tethered to the flow cell. The process is then repeated over and over and occurs simultaneously for millions of clusters resulting in clonal amplification of all the fragments. After bridge amplification the reverse strands are cleaved and washed off, leaving only the forward strands. The three primings are blocked to prevent unwanted priming. Sequencing begins with the extension of the first sequencing primer to produce the first read. With each cycle four fluorescently tagged nucleotides compete for addition to the growing chain. Only one is incorporated based on the sequence of the template. After the addition of each nucleotide the clusters are excited by a light source and a characteristic fluorescent signal is emitted. This proprietary process is called sequencing by synthesis. The number of cycles determines the length of the read. The emission wavelength along with the signal intensity determine the base call. For a given cluster all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. This image represents a small fraction of the flow cell. After the completion of the first read the read product is washed away. In this step the index one read primer is introduced and hybridized to the template. The read is generated similar to the first read. After completion of the index read the read product is washed off and the three prime end of the template is deprotected. The template now folds over and binds the second oligo on the flow cell. Index two is read in the same manner as index one. Index two read product is washed off at the completion of this step. Polymerasis extend the second flow cell oligo forming a double-stranded bridge. This double-stranded DNA is then linearized and the three prime ends blocked. The original forward strand is cleaved off and washed away leaving the reverse strand. Read two begins with the introduction of the read two sequencing primer. As with read one the sequencing steps are repeated until the desired read length is achieved. The read two product is washed away. This entire process generates billions of reads representing all the fragments. Sequences from pooled sample libraries are separated based on the unique indices introduced during the sample preparation. For each sample reads with similar stretches of base calls are locally clustered. Paired and reverse reads are paired creating contiguous sequences. These contiguous sequences are aligned back to the reference genome for variant identification. The paired end information is used to resolve ambiguous alignments. Part of the takeaway of that video one is the chemistry that is used for generating data on the Illumina sequencers what is known as sequencing by synthesis. So as you saw we actually add one base at a time and record that base. So we are literally reading one base at a time which is why we have very high accuracy in our data set. The second one was the paired end synthesis. So we are essentially using the same fragment of DNA that we are using in our library prep to read it from two ends. We read it from the forward end and then we read it from the reverse end. So which is why a lot of the Illumina data that you will see will have two reads for every fragment. It is called R1 and R2. So read one and read two. And as was alluded to in the video what that gives you is it gives you again very high confidence in the data that you are generating primarily because the fragments are short. The fragments are about you know 150 to 200 base pairs and when you read them from both ends you have overlap. So the chances of you reading one base more than or rather twice every time you read it is very high. So you have again very high confidence in the data that you generate. And because again you are reading it from both ends it is very useful for certain applications like translocations or deletions or insertions and so on and so forth because the distance between the two reads is fixed. So every time you map it back to the reference genome if there is any deviation from this fixed length you can infer that there may be a structural variant in that particular region of the genome. So it is a very powerful way of sequencing genomic DNA and as you can imagine it is the leading provider of sequencing technology globally today more than 90% of the data that is available in public databases comes from Illumina technology and it is not only true for research but it is also very much true for clinical applications. We ended the video on the Base Space application so I wanted to take you guys to Base Space if I can. Base Space Sequence Hub is a luminous cloud based next generation sequencing platform that performs automated sample to result workflows for your lab. We recently released a number of new features designed to enhance your laboratory's efficiency including a new biosample centric data model that enables easy tracking of all biosample activity from lab preparation through analysis delivery, new automation and quality control features to streamline the efficiency and consistency of your workflows and an improved interface that helps you access your data and perform functions more quickly. The new biosample centric data model lets you easily track all biosample activity from lab preparation through sequencing to processing and uploading of data to the cloud and analyzing results. Biosamples support data aggregation which can be linked to multiple libraries, runs, recues, analyses and can have multiple data sets associated with them. FastQ data sets have replaced samples and are now stored inside of biosamples. Their existing samples have already been converted so that you can use the new file types as inputs when launching apps. Base Space Sequence Hub includes new features that allow you to automate analysis workflows, reducing the time it takes to process samples and eliminating costly errors. New features include automatic lane QC settings, automatic data aggregation, automatic app launches, automatic analysis QC settings and enhanced status tracking. The updated interface provides quick access to all your data from a single place while the new Action Toolbar contains new and improved app functions like recues, QC status changes, workflows and collaboration tools. Access your work groups and review your compute and storage usage from the account menu. Inline tooltips help you understand what's occurring with your data and the enhanced filters widget has been added to more places letting you get to the data you care about more quickly. The API, Base Base CLI and Base Mount Tools provide access to your data from a command line interface and have been improved to facilitate more advanced integration and automation. With these new enhancements, Base Base Sequence Hub takes your work from sample to result more quickly, more efficiently and with more control than ever before. To learn more about Base Base Sequence Hub and all of our new features, please visit www.alumina.com slash base base or contact us at techsupportatalumina.com. Okay, here it is. So Base Base is nothing but a cloud application that is developed by Illumina. So this is hosted on Amazon Web Services, AWS and it's a free application. In the sense anybody can access the application, some of the apps on Base Space are paid which means that you have to pay for using those apps, but primarily Base Space is freely available. What I want you guys to do is login to your Base Space account and refer to the handout that was given to you, but before we actually start doing the hands-on session according to the handout, I wanted to show one very interesting property of the Illumina data that is something that Mukesh again touched upon during his presentation that is the quality score, the very, very high quality data that is generated on the Illumina platform. So he talked about Q30, Q30 is nothing but a Fred score. So those of you who have used Sanger sequencing will be aware that Fred score is a quality score that is assigned to a base call that is made by any sequencer and it is nothing but the confidence that the caller has in the base that it has called. So it is a probability. So when we say 99.99%, we are 99.99% sure that the base that we have called, let us say if we call it an A, it is going to be an A. So as Mukesh said that the error rate is going to be 1 in 1000. So we are going to be wrong 1 in every 1000 bases that are read. So you can imagine because the read lengths that we generate from our platforms are no more than 600 bases and our error rate is 1 in 1000. So the chances of there being an error in the data that we generate are very, very small. This is a IIT data. So we had run a project for one of the PIs here and you guys do not have access to this. So I am just going to show you the data because I really wanted to show you the quality of the data that gets generated. So on base space when you have some time and if you are interested there are multiple apps that are available. So apps are nothing but small widgets that are created either by Illumina or by third party, researchers, companies that are supporting data analysis on the Illumina platform and they are made available. So based on the application you want to or you are working on, you can choose the appropriate app and run the analysis. What I wanted to show you today was data that is generated from an application known as FastQC. So FastQC is an application, it is an open source application. I forget which university it came out of but this has been around for at least 8 or 9 years now very, very widely used to evaluate the quality of the data that is generated by sequencers. So the way to read this data is horizontally 1 to 99. So this is a 100 base pair read, 100 base read, right. We talked about the size of the fragment, the size of the read that you generate. So when we say read it means it is the contiguous output that is generated by the sequencer. So in this particular instance we are looking at a 100 base pair read data that is generated. So the Illumina platforms can generate as small as 36 base pair reads and as long as 600 base pair reads, okay. The Y axis or rather the vertical axis shows you the Q score, okay. So this is again a measure of the quality of the data that is generated on the Illumina sequencers and as you can see we are literally touching the ceiling of the scale that is available. So the Fred score ranges from Q10 to Q40, okay. So Q40 being 100% accuracy, Q10 is I think 1 in 10, 1 in 10 error rate. That is the way to interpret it. So you can see that for majority of the length of the read our Q scores are very, very well above Q30. That is essentially what that means is all the data that you are generating, practically all the data that you are generating is usable. You do not have to throw out or filter out any data because it is low quality and this is very, very critical in clinical applications primarily because you want to make sure that any data that you generate is of high quality, right. Because what are you looking for when you are generating sequence data? You are looking for differences from the reference genome, right. You are looking for differences from the reference genome which can be in the form of single base variations, right. And what are errors in sequencing generally, they are single base variations, right. So you want to make sure that whatever variations you are calling, basis of which you may be taking some clinical decisions one have to be accurate, right. You have to be 100% sure that the base that you are calling as a variant is actually a variant, right. So this is where something like this becomes extremely critical and you can pick up any data. This is actually the data that we have generated here itself in-house or rather for IIT Bombay. So these are actually patient samples, these are tumor samples. So these are not even, you know, very, very well maintained cell lines or blood samples which is where you all pretty much expect to get high quality data. These are tumor samples. So again you can see that on real biological samples you get very high quality data. In today's lecture, we learnt about NGS technology platform, especially how Illumina chemistry works. Dr. Dixie also talked about the importance of Fred's code and how it gives the idea about our data quality. She also showed how a real data from biological sample actually looks like and how to read that data. So I hope now all of you have opened an account in base space which is available free. Please open an account and get ready for Dr. Arthidesa's next hands-on session which will be based on the base space account. In the next hands-on session, she will take you a journey where you can use various data sets from your own experiments or publicly available data sets, analyze them and make meaningful insights from their data. Thank you.