 Welcome to MOOC course on Introduction to Proteogenomics. Today we are going to talk about next generation sequencing, sequencing by synthesis. In this session Dr. Arthi Desai will give you a glimpse of base space sequence hub which is Illumina cloud computing environment to analyze the sequencing dataset. She will mainly focus on how to search in public data, import data in personal dashboard and how to create your own project. Dr. Arthi will also explain you about the sharing of data among users and how this hub is playing a big role in collaborative platforms. Finally, she will also show you how to launch an analysis in base space with the help of multiple software already available there. So let us welcome Dr. Arthi for today's lecture. So now we can actually move on to... Clicking on the link will take you to the base space login page. Login or create an account by clicking register now. This will give you instant access. The first time you log in to base space you may need to read and accept the user agreement. After logging in you will be taken to the Tuma normal base space project page. A project is a container for various analysis files. These include samples which are fast queues that result from runs and output files from running analysis. These are stored under app sessions. In contrast a run in base space corresponds to data from sequencing a single flow cell on an instrument. The project overview page contains a brief description of the dataset. You can navigate around using the quick launch buttons either using the buttons on the left tab using the names or the direct links. These all link to the same place. The collaborators button indicates who the run has been shared with. If you are the owner of a run or project you can change the name and transfer runs or projects to collaborators. The share button allows you to share your projects. Note that we cannot share this project since we are not the project owner. You can also access the base space blog from this page. The blog provides useful updates and information on base space. There are three samples for this dataset. HCC1187C is the cancer sample from the breast ductal carcinoma cell line. HCC1187BL is the matching lymphoblastoid cell line from the same individual representing a normal sample. HCC1187Somatic represents the subtracted data. The normal and tumor samples are compared and matching sequences removed leaving only the tumor normal differences. Inside the samples hyperlinks you will find FASTQ files. These will usually be full FASTQ files but in this case empty FASTQ files were uploaded as placeholders. You will also find information about the samples including the read length, the total number of reads and whether this was a single read or paired end run. Inside the app sessions there is a folder for each analysis performed. Here they are named Illumina's Uploader as they were uploaded directly by Illumina. The naming convention should be more informative in the future but here the sample names correspond to the app sessions so you can use these sample names as a guide. Clicking on the hyperlinks will also show you the folder name. Inside the cancer and bloodline folders you will find BAM files, VCF and Genome VCF files and there will usually be a summary report for the analysis. Inside the somatic folder are VCF files from subtracted data. These show the variance between the tumor normal data. There is also the somatic summary report. You can open the somatic summary report within BaseBase or download it directly to your computer. Are into your account I hope everybody has a BaseBase account and you are able to log into your BaseBase account. Go to public data in here search for true site cancer. Go to public data here true site cancer it is there on your screen okay true site TRU SIGHT cancer on the second project that is the MiniSeq true site cancer. So do you have this MiniSeq true site cancer project you all saw that okay click on that when you click on that it will expand and you will see two options there one is run and there is project what I want us to do is to import the project so click on import. So when you click on import did you guys see import or shall I go back yeah everybody with me so far okay. So when you click on import you will get this pop up that says you know I want to share a project with you accept it so once you accept it you should see it in your notifications that you now have this project MiniSeq true site cancer. So what I want you to do is click on that you know click on this notification that says MiniSeq true site cancer so click just click on that once you do go to the samples tab there is a tab here that says samples go to the samples tab take one of the samples this guy the second last one which is HD 701 rep 1 so these are this is a reference sample from Horizon it is a company that provides reference RNA DNA samples so that you can check the quality of the data quality of the panels that you have developed and so on and so forth. So it is essentially QC data right but it is very easy to look at which is why I wanted to show this to you so you select this yeah so do copy and do you guys have any projects no okay so sorry my bad I missed a step let us exit this cancel let us cancel this okay go back go back to projects go back to projects okay let us exit that for a moment go back to projects is everybody in projects yeah so just create a project you know name it whatever you guys want to name it just going to call this IITB demo very creative so you can name the project whichever way you want so it will create a project do you now have a project yes okay now let us go back to the dashboard let us go back to the dashboard now again click on the mini-seq true site cancer project go to samples last sorry okay I will go step by step so you created a project no so okay do you see this projects yeah if you click on that there is a like a file icon which says new project okay so click on that and give a name to the project everybody has created the projects yeah okay so let us go back to the dashboard into the true site cancer project go to samples okay and go to this HD701 rep 1 that is the sample that we want to copy so if you scroll to your right on the top there is something that you see copy to that menu option that says copy to so you can copy this now you should have a project the one that you just created and copy there okay the reason we are doing this is I wanted to show you how data sharing works on base space right like I said this is not a very very critical function in the application the only thing we wanted to show you is how you can share data across different users because this is meant to be a collaborative platform if there are any public data sets that you want to analyze this is how you would import those data into your own workspace so that you can run your own analysis okay that really was the crux and what I what else I wanted to show you was how to launch an app okay so let's say now I am just for the sake of time going to copy this one particular file onto the project that I have just created let me go back to projects let's go back to projects in this whatever project that you have just created if you are able to copy the file copy that particular sample it should be there okay I think I have copied it twice so you see two times but if you have been able to create the project select the sample and copy it to the project that you created you should be able to see a sample in your project okay so what I wanted to show you now is launching an analysis once you have your data into base space how do you launch an analysis and I just said that there are multiple apps or applications that are available on base space so this particular data was generated for a panel known as a true site cancer okay this is a panel that is available from Illumina that is used yes so this data has been generated for a panel known as true site cancer that's available from Illumina it is a panel that is designed to detect germline mutations okay so what we are going to do now is once I select the sample that I want to run the analysis for I am going to click on this launch app everybody or at least most of you with me so far yeah you have a project but those of you who can see this select the file and click on launch app when you click on launch app the options that are available there are multiple apps that are available okay depending again on the analysis that you want to do okay because this is a cancer panel that you know we have done enrichment on I can choose an enrichment app okay now what this will do is you know it will run what is known as variant calling okay it will call mutations it will call insertion small indels okay and then I can select the enrichment app so what I did was I selected a file from my project I clicked on launch app from the launch app I selected the enrichment app because the data that I am using for the workshop today is from a cancer panel true site cancer and we want to run an app that will give me variant calls you know mutations and small insertion deletions okay and these are again third party apps see the whole idea behind base space is to make these apps which are traditionally you know developed as command line tools available to the end users because if they are in the command line form unless you know if you are if you are unless you are good at some level of scripting it is hard for you to use them and majority of the apps that are developed for engines data analysis are command line apps because they are developed by advanced bioinformaticians okay so there is an analysis name you know which is the name that you want a unique name that you want to give to your analysis you know that is again your call where do you want to save your results so you by default it will pick up the same project from which you have picked up the input data files if you want to save it somewhere else you can do so select samples so you can go to the project select any sample that you want to run select okay if there are BAM files BAM files are nothing but aligned data files okay that are generated by aligners so you can select that reference genome that you want to use for your analysis because if you remember from the video I showed you once you generate the read data in order for you to do analysis you have to map it back to the reference genome right and in this case we are dealing with human genome data not unknown species so we will have a reference genome against which we will map the data files and you can also choose the targeted panel so that was you so in this case as I told you this is a true site cancer data right so what I will do is I will choose this true site cancer version 1 region files so essentially this will define for the application what are the regions in which it should look for variant calls right so it will significantly shorten the time it takes for analysis otherwise the app may end up scanning the entire genome which may result in two things one is extremely long analysis time and second some of these because we have repetitive regions in the genome homologous regions in the genome you know it you may end up with random alignments which will give you false information right so the targeted the target files essentially help you in streamlining your analysis okay and now let us hope that it works check that I have actually done everything that I am supposed to okay I think there are more things if you continue scrolling down you will see that there are certain other options for example whether this is for germline or for somatic I told you earlier the true site cancer panel is meant for germline okay germline variant detection so that is what I will use we will we are going to really leave all the other parameters as default right there are many many options because most of these algorithms come with multiple options that you can use to tweak the way the analysis is performed and the kind of results that are reported we recommend for our new users basic users to use a default analysis as you become more comfortable with the data that you are handling you can play around with the parameter options and I am going to have done everything yep so I am going to launch the application so when I do this what the application should do is take the input files that I have given and start running the analysis this analysis will take a little bit of time you will get a small pop up accept it okay now you should have one more notification saying that you know you have accepted data from one of my colleagues let us look at yeah let us look at the first analysis so just click on it click on the first analysis you will see four analysis in the you know once you put in that address so this is essentially going to be the output of the analysis that we just started okay so this is the output of the analysis that we just started okay and what this is showing you is some metrics of the data that we have analyzed so you can see that for this particular sample more than 98% reads aligned back to the reference genome okay again which means majority of the data is usable more than 98% of the data read data that was generated for this particular sample is mapping back to the reference genome which means it is usable so it gives you information like read level enrichment base level enrichment target level enrichment so these are all really quality metrics that you want to use to make sure of two things one is that your read data is of high quality and B you have very very specific sequence data that has been generated in so if you see here the target level enrichment is close to 100% 99.77% which means that you have the sequences that you have are from your target region majority of the sequences that you have are from your target region if this number is low which means that you have off target sequence data in your file okay so there may be something wrong with the way the data was generated you know there may be something wrong in the way the library was prepared and so on and so forth so you can use this to make sure that your workflow was sort of you know error free so to speak what it will also give you is the number of SNBs structural variations okay sorry SNB is a single nucleotide variations or one base changes that were identified in this particular sample it will also give you the number of indels that is insertions and deletions that were called in this particular sample it gives you coverage summary and also the depth of coverage in the targeted region so what you see here on the horizontal axis is the depth of sequencing coverage so I think Mukesh talked about the x coverage right so when you are doing next generation sequencing you are actually sequencing just two minutes you are sequencing every base multiple times okay and then depending on the application you are running your depth of sequencing can be as low as 30x so as Mukesh mentioned for whole genome sequencing we are really looking at 30x sequencing for job for somatic mutations you may want to sequence as deep as 5000x 10000x depending on the frequency of the mutation right if it is a rare variant and you know in cancer samples especially if it is a heterogeneous tissue sample you may have to sequence deeply liquid biopsy is another example where you have to sequence deeply because you are really trying to identify those cancer DNA cell free or you know CTCs in your blood which will have you know DNA from your normal cells right so the depth of sequencing in this case is very high and the median fragment length and so on and so forth so you are really getting a lot of metrics you can also actually somewhere that is not shown here but see the specific mutations that are called so if you download some of this data you will be able to actually see the specific variations that are called you know we have looked at NGS data generated on the Illumina platform we have seen how we can share data amongst you know our collaborators run some analysis and looked at what the output you know may look like depending on the application that you have run since you all now have base space accounts there are many public data sets that are available right so if you go to the public data section on base space you will see that there are thousands of data sets that are available so you know you can look at those at your free time and you know reach out to us if you have any questions hope you guys found this session useful thank you I hope today's session by Dr. Aarti Desai was really informative where you learnt how base space can be used in sequencing analysis data sharing and how data output looks like she also showed that this platform is a collection of multiple applications and demonstrated you a single application that is variant calling she gave a detailed information of parameters launching of the app and how to run it I suggest you to play with other applications available there and you will find them really interesting as Dr. Desai also told you that you will learn more when you will play with the parameters of each application with new data set I'd like to mention you there are large amount of data set available which is publicly accessible on various portals such as the cancer genome atlas or TCGA and there are various ways one could download those data set and by using these tools try to understand and analyze the data so more and more you are playing with these tools and you are familiar with the software features you can then make use of large amount of public accessible data set from various large genome sequencing projects in the next supplementary lecture we will learn about iron torrent a bench top next generation sequencing technology by Dr. Atima Agarwal thank you