 Welcome to MOOC course on Introduction to Proteogenomics. In last lecture, a guest scientist Dr. Ratna Thangudu, he started providing you an overview for the large scale data sciences. Today, he is going to continue his talk and mainly focus on the Clinical Proteomic Tumor Analysis Consortium or CPTAC. I think it is really important to know that there are some resources publicly available, which are sharing large amount of data set. I think TCGA, the Cancer Genome Atlas from National Cancer Institute was once such initiative, which has really provided large data set to the broad community. Thousands of patients entire genome for various type of tumors were sequenced and then that data made publicly available. It was so interesting to note that many of those data set while the original data was published already in nature, but from the same data set, many people started probing specific questions that what could be the effect of let us say survival for based on different genes and then you know they did metadata analysis, published papers based on that. So, not only the original papers which generated data, those are published, but also many associated papers came out solely based on the data available in public repository. So, the Cancer Genome Atlas really made a huge impact to the broad community for sharing the data and making the publicly available. So, in this slide, the CPTAC provides a very good resource for the community to look at the proteomic data of different tumor types. So, Dr. Ratna will talk about the Common Data Analysis Pipeline of CPTAC and Cancer Imaging Archives. Various omega data repository such as even imaging data for various cancers, genomic data from the NCI GDC portal, as well as the DB GAP some of these are good resources repository is available for obtaining large data set which is already available from the you know very nicely done experiments using next generation sequencing as well as mass spectrometry. The huge amount of resource have been already put in to obtain these data and now the data is made available to public for doing further analysis. So, Dr. Ratna will actually eliminate you, provide you more information about how these repositories these resources could be accessed and what kind of features are there and how you can make use of them for your own research in which way you can analyze the data or use the data for your various comparisons. So, let us welcome Dr. Ratna for his second lecture on the Big Data Sciences. So, the interesting thing with CPTAC program is we do read all of the data even before the publications come out that is a pretty interesting way to look at the data right. So, if you are generating the data you are most of the times you are already worried if I put my data out there somebody else will publish, but that is actually hindering the progress of the research in terms of how the government sees. So, there is some guidelines that come put forth saying that we will give you the data, but there is an embargo date. So, I mean I am just showing like we are pretty active like every other month we will release some one study we harmonize all the study and at least on the top these 1, 2, 3, 4 those studies they do not have any publications yet. The groups are working Dr. Mani is working on one publication, Dr. David Fenyo is working on another publication right now and Bing Zhang is working on another publication alright here. So, there is something called an embargo date. So, the expectation is that all the data is free, but we expect use it, but you wait until that date before you publish because that means basically we are giving credit to the data generators from the consortium to publish first. So, this is the common data analysis pipeline that we done. Every piece of data that comes into CPTEC data portal we have all of the Amazon based, Galaxy based infrastructure to run a pipeline on that. This is basically as you can see here it is MSGF search and then we have a custom built protein report summary generation and at the end the files that I am pointing out earlier is basically identified peptides, identified proteins and the quantitation data and also the study sample consistency report which is basically QC matrix. So, there is a lot of information out there. So, without downloading any of these raw files you can just use this information, but if you are not satisfied with the results you can always get those raw files and generate them. And another thing you would appreciate if you notice is basically all of the mass spectrometry files they are proprietary that they come from the instrument. So, you need to have a windows machine there is a lot of complications how you convert them into open source. So, most of the pipelines run very smoothly and seamlessly in Linux environment. So, there is a bottleneck that we have to do the first steps in the windows environment. So, what we do we follow open standards and we make all of the we convert all of the raw data into MZML. MZML is the re-representation of the proprietary formats in the open data formats. So, all of that information is available as MZML also. And all the other information that comes as a pre-spectral matches and other things we convert them into the open source formats MZML, MZID, and so on wherever it is possible. So, this is once you are on a particular study page you see a lot of 43 studies listed there you can click on a study once you go there. So, the first thing is because it is all cancer related patient derived. So, we take utmost care to actually curate all of the data the clinical information and also the experimental design. So, we put all that information in these files they are just simple excel files you can just download, but it has a lot of information that connects you to the files in here. And then we have also metadata packet. So, that has all the protocol information past protocol information and also we have the protein reports that I mentioned earlier. So, they are all packaged into one single packet that you can just download those two for example, we do not need anything else the bottom ones are raw files. And as you might have noticed here it is like R dot R1. So, what that means is we have we version all of the protein report results what happens sometimes we find something interesting that we need to update the pipeline itself. So, then we will read on the pipeline and we report the results as a new version. So, always you can trace back what that means alright. So, I talked a lot about all this mass spectrometry data I showed, but where is all this other omics data, where is the genomics data, where is the clinical data, where is the imaging data. So, that is all the CP tag is producing where is it sitting let us see. The imaging data sets in cancer imaging archive the genomic data is sitting in GDC data portal. The other part of the genomic data that is a SNP aerodata it is sitting in DB gap that is the database of genotypes and phenotypes and the proteomic data is sitting here excellent. So, we have data from one single patient sitting in 4 different areas. So, how do you connect all of them? So, we are generating data it is extremely difficult even for me to connect all the dots and it is not helpful right. So, a lot of information being lost. So, what to do? So, Dr. Henry actually mentioned about a precision medicine initiative that came to light in the last 4 years you say Joe Biden's cancer moonshot initiative. So, as a part of that the National Cancer Institute it is developing a cancer research data commons. It is a big ecosystem in theoretical ecosystem where all the repositories here these tags are basically kind of repositories. So, genomics proteomics imaging so on they all coexist together. So, physically they are at different locations on different servers, but in the ecosystem. So, they are together and then we provide the ecosystem provides analysis of all of the information. So, now I talked only about earlier the common data analysis pipeline for proteomic data, but we will have tools the expectation that we will have tools to analyze the proteogenomic component. And then we will have data models and dictionaries to represent the data. So, that I when I call a patient I am calling that patient from all different resources at one time alright. So, then we have visualization we have quality basically if you go to any portal you have all these features. And the kind of users that we are expecting is you see that wide variety of users we are looking at. The patients, clinicians, computer scientists and tool dollar person and biomedical researchers. So, everyone's export is different everyone's expectations are different. So, such an ecosystem trying to support so many different kinds of users it is a it is made known as F4 it is needed alright. So, and then there are cloud resources. So, I have all these data sitting there, but I want to analyze the data myself. So, there are 100 data sets sitting I will pick and choose 3 or 4 different data sets and I want to analyze by myself how will I do it. So, that is where we have cloud resources. So, Dr. Mani has talked about the fire cloud the other day. So, fire cloud is one such resource. So, you just need to have a login you do not have to take anything other than your credit card alright. So, you go there and you log in and you have the tools existing there this is pipeline and you have data sitting there everything is there you pick a pipeline we attach your data and you are on the pipeline that is it you click a button only thing is I mean you had to understand what you are doing but all the tools are available that is the that is the vision NCI has in terms of this research data commons. So, already there is one NCI genomic data commons which came to write 3 years ago. So, most of you might have heard about the TCGA the cancer genomic class and all of the data used to be available in a TCGA data portal that is very specific to TCGA. So, but now there are so many programs coming up NCI third we have to bring all of this genomic data together at one resource it is not just one program specific resource but rather a common resource where all the genomic data is needed. So, that is what they did and right now there are about 40 plus programs there are among us program that there is so much data there or there and this is a free resource there is no login required and you can see there are so many different cancer types available already. So, this is all about the genomic data and then I talked about the NCI cloud resources. So, Dr. Mani already talked about the broads fire cloud. So, cloud is it is a public cloud see somebody else is actually offering you services. So, you do not have to do anything you do not need to have a cluster right. So, cluster yesterday I mean you are asking me I have this much data. So, I do not have any disk space what do I do. So, I buy a new disk I will attach it reconfigure it and then 10 days later more clients come and you generate more data. So, what do you do and there is no solution. So, people are slowly moving towards public clouds these are Amazon web services and Google cloud I can I mean anyone can actually have a free account you can log in try to explore the good thing with that is you do not need to have a lot of informatics expertise to begin with do not be scared just go out there see what is there lot of them are made as services. So, I want to do this. So, you click a button and you pay a price for that is minimal. So, so broad institute of system biology they have cancer genomic cloud and then 7 bridges 7 bridges is a private company is a commercial partner with NCI in developing, but they did determine AWS. The fire cloud and the 7 bridges I would recommend you to take a look at those 2 resources. Institute of system biologies is more of a program programmer centric approach they took. So, they made all the data available in a certain way and you can access through APIs programmatic. So, you need to have little expertise in understanding that, but they do have some kind of a UI where you can actually look at the results. So, now the genomic data commons and the cloud resources are the only things from the ecosystem I they showed they are existing as of today. So, last year this proteomic so, NCI thought all the CP tech data is sitting just like the ways TCGA used to say it is in its own specific resource. So, what do we do about that? That is why they came up with why do not we have a proteomic data commons because there are more programs that are coming out it is not just a CP tech, but there are Apollo and ICPC that I will talk a little bit later on and Dr. Henry talked about that. So, we need to have a similar resource, but we are asked to actually have both the both the GDC and the cloud resources combined. So, we have the data and also analysis tools at one place for the proteomic data that is pretty ambitious yeah that is what I mean. So, now that is the proteomic node, but it needs to do both the things that the GDC and the cloud resources doing together. So, the very high level goals of the PDC and the proteomic data commons, unsilow the mass spectrometry data everybody is storing on their own local spaces right do not do that share it publicly and the move from a situation where people move the data to the local tools. So, same thing I am telling you do not do not download the data you bring your tool to the data because data is so humongous it is not just your local storage it is about how you transfer the data does IITS network allow you to transfer so much of data right I do not think so it is not worth it. So, instead you bring your tool go there to the cloud and analyze and then this is an interesting thing I always shift from a data graveyard model to a data workspace model what this means is graveyard the one I talked about earlier you go you deposit your data after the life cycle of your research that is like you are dumping it somewhere I am done with this I do not want this anymore right that is data graveyard. So, nobody looks at that kind of data, but the workspace model that we are proposing with the proteomic data commons is you connect your instrument directly to the workspace on the PDC. So, the data directly moves from the instrument directly to the workspace. So, there you attach all your metadata when you when you are ready to analyze the data at that point you actually attach all the metadata all the samples and the study design and which tools you want to use and you click a button and on the pipeline. There is no such I mean there is no. So, in English case it depends very much like you first store the data on the cloud then you to download it analyze and again upload it back. So, it takes you back to the same problem. So, is there any initiative that has been taken to make more and more company based softwares like to develop any such cloud system where the analysis could also be done and then there is a need for a physical drive where they can actually explore. Yeah. So, that is the cloud thing actually doing. So, you can whatever the tool that you develop you can dockerize that. So, dockerization maybe you have to learn a little bit about that, but basically it is not just the tool, but the environment the computer system where it runs you package the whole thing and put it on somewhere. It is like a another database it is called dock store. So, once you have the tool you can take that particular tool to the cloud and run it. And like here I said these are the high level goals they are goals. So, we are not there yet there are a lot of we have to cross a lot of hurdles to reach to those goals. And in PDC we are starting small. So, we will make a couple of pipelines available initially and based on the users interest we will add more. So, that way we know what is interest in the community just I mean I cannot we cannot have a resource that tries to do so much and and then give it to you and you see are the users come and see like 90 percent of the stuff ok we are not interested in this. So, that is not the good use of money or time or therefore, right. Did I answer your question? And then we have to improve the metadata annotations and ensure the data is annotated well like the standards I was talking about. All right just to give you an idea of what kind of data you will see in future. So, CPTAC data that is I already talked about then we have Apollo and the international cancer proteogenomic consortium that is Dr. Shiva's way is part of. And then human tumor at last and last I said user generated data that is your data to start generating the data you can approve. All right. So, what we did about 6 months ago we started building a prototype that is what we call proteomic data commas pilot or in other words we call MVP or a minimally viable product. So, what we did is like I talked about a lot of things right now. What we did is like build a minimal product and put it out there. It has some basic features that I am talking about. It does it does it will not do everything, but bits and pieces of everything I talked about. And then see take the feedback from the community and develop on those right. So, rather than I developed the whole product and I put it there like I said and nobody is interested in that it is rest of our effort. So, we will the portal like the commons will have a data submission system a portal where you will see all the information and a workspace that I talked about where you can actually upload your data and run pipelines. And then programmatically or the computationally access that information to the problems. Yeah data processing and harmonization is the common data analysis pipeline that we run all of the data. So, what happens like within CPTAC. So, each of the cancer types is being the data is being generated by different groups and it is analyzed by different softwares and different protocols. So, when we start putting out the data and the through the portal. So, it is we are one single consortium, but we are actually providing results from so many different kinds of pipelines. So, what the even when you go to the genomic data commons. So, we have both these portals what we try to do is this called data harmonization process. Any data that comes through these resources we will use a one single common pipeline and all of it. So, it is not an ideal way, but it is one way to get all the information on one single. So, that is something what the pride is doing what preparatory class are doing. So, you are depositing all the data in one resource of your publication is done and they are running through the one common pipeline. Because when you have so much data if it is not harmonized as per one single pipeline it is very difficult to make sense out of it. And then that is that is what I call processing and harmonization and the processing workspace is basically you would be able to run the pipeline by yourself right. We run the pipeline as a starting step, but you want to change something some parameters. So, I am not very happy with this you want to just change some parameters you should be able to do that. So, as an example we will have encyclopedia as a one DIA pipeline and also the DDA pipeline that I described earlier which is just showing the software architecture of our system. And you do not have to actually spend a lot of time here, but just to give you an idea of this is called S3 this is a hard drive you can say the cloud hard drive where you put all your data. So, it is scalable. So, as your data is increasing they will just make available how much ever you want. So, there is no limitation here the same thing like you want more compute power with a click of a button you can add as many processors as you want. And this thing it is called authorization. The idea of this is basically because there are so many ports I talked about genomic data commons, proteomic data commons, imaging and the cloud pilots there are so many things are there, but the user has to remember all his user IDs and passwords and how do these commons actually talk to each other if it is not a single sign on. So, the purpose of this box is basically telling that we will have a single sign on once you sign into any of these these resources. So, you will be able to access the data from the other resources seamlessly. So, in conclusion today you have learnt about one of the large initiative from NCI about the CPTAC which has contributed the scientific knowledge for the cancer research immensely. You also learnt that to obtain omic data for a single patient you need to search for different repositories. Hence, the NCI has taken initiative to make a common data portal. So, that you do not have to look for variety of you know different lists to search for data and all data could be commonly accessed from the cancer research data commons or CRDC. You also got a glimpse of how CRDC has different features such as visualization, analysis, query and many more features. We have also learnt that how cloud platform which is being freely available to everyone can be very useful in handling and analyzing big data or even meta data analysis. In the next lecture, Dr. Ratna will continue his talk about large-scale data sciences and inform us more about different publicly available portals. Thank you.