 Welcome to MOOC course on Introduction to Proteogenomics. We have heard Dr. Ratna in last two lectures about large scale data sciences. Today in the last lecture he is going to continue sharing more information about large scale data sciences. Dr. Ratna will talk about proteomics data commons PDC and about its various dimensions such as ownership of the data, the quality management and life of the data. He will also talk about fair principle for developing a proteomics data commons. Fair stands for Findable, Accessible, Interoperable and Reusable FAIR. It includes assigning data and patient a UID, unique identification which remains same across the world and hence reusable by the users. The importance of management and upgradation of the repository with unique UID and versions with reason. So, to understand in depth let us now welcome Dr. Ratna for his last lecture. This is the minimally viable product I talked about, pdc.ecyking.com. This is an out for program. So, it is even before beta like I said. So, feel free to log in. I will show, I mean I will run you through that some screenshots of how the portal looks. Yeah, but feel free to log in and see this is ok. Like I said this is the minimally viable product. All of the CPTEC data portal data sets that you see the colorectal cancer, ovarian and breast cancer data sets that are available where you just download the information. All that is actually harmonized to pdc. So, you could actually explore all of those data sets here to much greater extent. Just to give you an idea of what goes into the management there is stewardship who owns the data. So, once you put it there it is done. Am I the I mean is pdc the owner of the data or you still have it that is called stewardship and then data governance. What is the life cycle of the data? Who will what kind of policies that will guide the open access of the particular data and then all other things about the standards and the processing and quality management all those things are attached to the large scale programs. So, we have to represent the data in a certain way right. If I am getting data from so many different programs, so many people are submitting how do I represent rather than a common model. So, I need to have a conceptual model. So, this is our conceptual model. So, because all of this is cancer related data to begin with. So, this is patient centric not talking about any model organisms and so on at this point, but even from model organisms the data can fit here it is pretty easy. So, you have a program project a case is basically a patient or a donor who gives the tissue and then from that you get a sample. So, there is a lot of clinical data that is attached to a case right. So, clinical information is attached to the patient and the next part I am showing this part right now. So, then you have an eliquid that is what actually goes into your MOSPEC right. So, that from there you generate. So, you group a bunch of samples run as a run it is an experimental study and then you have all of the run metadata that is nothing but that your experimental design where we are capturing it. And then you generate the raw files and then we run the workflow. So, when I run the workflow it generates other information. So, every piece of information that the software generates we index it. So, it is captured in the model. So, when I ask a question about which proteins are expressed in this particular eliquid or a sample I have an answer for you right. That is because we have a model. So, we are trying we are working hard to actually define this. We try to fit all of the CPTAC data in this model and it is working quite nice, but there are lot of a lot of expectations have happened right. So, if there is something that comes up that actually needs change in the model that we have to do it, but for now it is working alright. So, then there is this data dictionary. So, every node in this graph is basically we have a description of it. So, you can go there and try to understand what it is and each of these terms are actually we apply standards to that. So, for example, here diagnosis and demography. So, these are all clinical terms there. So, are we using certain standards for example, ICD codes to define the disease or the SNOMED CT terms alright. So, then I will briefly talk about the fair principles. So, when you are developing a data commons the guiding principles for such a kind of effort are called fair, findable, accessible, interoperable and reusable. So, what is findable? So, like I said every piece of data that comes in whether a patient or the file. So, we will assign a unique identifier to that as we call it the UID global ID and then we attach lot of metadata. So, it is unique in the system so that nothing will change ever. So, even if there is a change we will version that. So, how wide change and how we change so that you can always refer back to that particular patient successfully. And then there is at the file level most of the times when you start downloading or uploading files they get corrupted alright. So, what you start uploading and what is received at the end you do not know if they are the same or not. So, you think you uploaded some file and, but it is not it is not fully uploaded it gets corrupted. The same thing when you download and you do not realize that there is some error occur somewhere and you start processing those files and it will your software will throw error, but it is very difficult for you to understand why the software is throwing error right in the first place. So, that is why we recall lot of metadata that is called MD5 or Shaw these are called checksum values it is a small datum that is attached to a digital object. So, you generate that locally on your computer you upload the file and the receiver you will give within this MD5 value. So, on the receiver on his computer he will generate the same value and compare this value the what you are giving and what he received is the same or not. Basically it confirms that the data is end to end transfer completely. So, all that information will be captured as a metadata. So, basically you can find any file or any entity in this model by a particular unique ID and with the attached metadata and then accessibility. So, who owns the data and what is the life cycle of the data. So, when you first generate the data and put it in a system like in a PDC. So, you still own the data until you make it public right. So, so user has the authority to tell this is the time I want to make it public or my program requires me to publish in a particular journal then I will have to make it public. So, and some of it can be protected especially in genomics we know the all the germline mutations are protected and you cannot just download. So, there is a data access committee and you have to go through that whole cycle to actually get the information. And then interoperable. So, interoperability is basically about the control vocabulary how we are defining the terms like if you are calling trypsin are you calling trypsin all the time the same way. So, it is confusing sometimes. So, what people do is the control vocabulary are standards they will assign IDs to those terms. So, that you can just use the ID instead of the word. So, the programmatically we will access that information when you use the ID we will get the information ok this is called trypsin and then that is called metadata. And so, with that kind of information these are for example, these are three different resources GDC proteomic data commons and imaging data commons. If we are all using the same kind of terminology to describe the common terms across these resources any tool I can use without much hassle and finally, reusable. So, reusability comes with the standards like that I mentioned. So, for PDC we use a lot of standards from the proteomic standards initiative. So, PSI is a special body within the HUPO that formulates guidelines how to how do you represent the data and also both the formats of the files and also the control vocabularies that needs to be used. It is not as mature as in genomics, but it is slowly getting there I think in the next several years. So, we will probably have more structure terms that people will start using. So, at this time we are I mean the that itself actually has about 100 plus terms that you can use, but at this time we could only map about 10 of them to the metadata that we are collecting just a general example of why we need standards I just want to show you this. So, that point points yesterday I came I came from US. So, I do not have a converter I do not need a converter I just need an adapter. So, I was looking everywhere I did not find one, but it is ok when I go back to hotel like I mean just charging fully that my computer is not fully charged, but anyways. So, all these plugs they are actually based on some standards they are not random right each country has different kind of standards and they are using. So, in such kind of situation maybe you can have a converter or you can have an adapter as a solution. So, it is possible because you know those people are using standard and here on the right I am be showing an example of how a date is being represented you cannot write it in any way right. So, probably it will make sense for some of you, but when you give it to the computer already get confused. So, then we come up with there you come up with the standards right. So, now we have some converter plugs and then we have ISO code you have to represent the date in certain way. So, when you are submitting data to for example, PDC or GDC we will tell if you are putting a date it has to be like this right that is that is an example simple example of standards. Alright. So, why do we need standards because bioinformations who are into bioinformatics they would actually understand this much easier. So, it is very for the same data is sitting in different different forms it is a nightmare for them. So, we just have to I mean the idea is like you just you want to compare the protein quantitation data with the gene expression data that you are receiving from GDC thing is like you have two files why don't we just compare. But if they are in different formats we are calling the different gene names and you have to write parts and programs and there is so much effort. So, you waste all your time alright. So, just I put something here to give an idea of different resources these are this is not very extensive just a sample. For example, we use CADSR the cancer data standards registry and repository for representing all of the clinical data information in the PDC and GDC. So, every term has a as I mentioned in the CADSR and then we know for the next generation sequencing we all know there are some very well adapted formats fast use bands and PCF fast and so on alright. So, proteomic standards. So, like I mentioned there are there are a few already there. So, there is something called probably some of you might have heard about Miami. So, minimum information about a microarray experiment that was came like several years ago that basically tells it is a recommendation it is not a it is not forced on you it is just a recommendation when you are submitting some microarray data to some geo or dbGaP. So, this is the minimum information you should provide. So, because it is just a recommendation or a guideline if you go to geo database that is gene expression omnioms it is a mess there is so much data out there, but if you want to use it there are actually papers meta analysis people just did correcting the data in geo that itself is a scientific article right. So, that is not worth it. So, the same way I mean in proteomics also they started off with we do not want to force anybody with standards at the to begin with, but we will tell ok these are the minimal guidelines we have to follow. So, that is my opinion minimum information about proteomic experiment alright. So, then we already have a lot of partners I will show in the next screen and also we have some control vocabularies from the proteomic standards in the initiative from PSI alright. So, these are some standards like the representation MZML, MZIDENT MLO. So, a lot of things came out, but the widely adapted one is the MZML so far. Because each piece of software that you are running each pipeline it generates a different kind of output. So, it is thus the field is not as mature as in genomics yet. So, but we are slowly going there. So, the expectation is that going forward may be in the next several years all these will be adopted. So, if they are out like whether it is pride or perpetrators or proteomic data commons if you have the same format of the file you can easily compare two things is much easier. So, with all that in consideration we did build that minimally viable product that I mentioned earlier. So, this as you can see there is only one program right now that is the CPTAC we started off with that. We have about this much 6 terabytes of data and this many proteins and peptides identified a lot of summary information on the home page. We also have application programming interface basically whatever you do on the UI you should be able to do programmatically also if you are efficient in the program. And this is the workspace I was talking about I will show all this and this is the kind of common data analysis pipeline that we would implement within the system alright. So, before I end I just want to touch base on the proteogenic integration because that is what it is CPTAC is about and the proteomic data commons is about. So, at the system level when two systems are interacting like a PDC and GDC what do we need right. So, there are some sequence centric proteogenic approaches. So, that is where if because all of this information is patient derived patient centric. So, there is genomic information available for the patient on the GDC side right. So, you want to use that information to run your such as against right. So, how would you get that information? So, in the model that we are developing we are closely working with the a framework team from the NCI and also the genomic data commons and the cloud pilots we are trying to figure out how to bring that information together. So, that when I say a patient like when the PDC when I look for a particular patient it will automatically tell hey there is some genomic data available for this patient at this resource you want it right. So, that way you can get that information and put as input for your proteogenic experiment sorry database search. And then the other thing so, that is at the very high level even before you start your pipeline right. So, then if you actually generate the quantitation data already. So, it is in the gene matrix and then on the PD on the GDC side they already have the fpkm rpkm information gene expression if they have the bed files they have the genomic variants. So, you bring them and it start correlating them right. So, that means, here you are actually comparing the results from two harmonization pipelines you are not comparing the raw data you are not doing anything there you are using the information that came out of the pipelines and just comparing them. So, the R examples that you are trying they are trying to basically do that. And finally, can we see all the peptides that we identified in the genome browser. So, yeah you can you can download the information and you can generate the bed file and you can upload and do all those things, but we are trying to do that for you for any given dataset it will be automatically available. So, these are the some visions I will show some data what is already there later. Yeah so, the idea is PDC will do some basic analysis for you. So, when you go there you already have all these reports generated, but if you want to change something the personal workspace I talked about earlier. So, you can do all those things in your own workspace without affecting the public side right. So, whatever we provide is actually for everyone yeah. So, the high level goal is we have all this information on the PDC data portal which does not have a login you go and just go there and analyze, but you have some data that you want to actually see correlate with what is already there. So, you want to analyze your data along side of the data that is already there you should be able to do it in the workspace. So, you load all your data provide all the metadata so that there will be tools available for you to do that kind of analysis alright. So, this is one use case we actually got somebody asked us I know there is a genomic information available in PDD proteomics study and how can I actually seamlessly integrate all this information this is the use cases I was talking about. So, the way we implemented the systems level is like find all the projects in PDC that have genomic data that is a very easy way to ask, but we had this find all the programs that have in GDC that have proteomic data. Like you can ask that question whichever way you want right in different combinations the system should be able to do that. So, that is where we are trying to lead to. So, some more examples of the same thing alright. So, I will just summarize here. So, we right now the proteomic data commons is in build phase. So, like I said six months ago we build the MAP we started building the MAP and we released that on at HUPO meeting in October. So, we got some feedback, but in terms of system like I talked about all of these things, but basically we will have storage workspaces tools and containers models and orchestration. So, you plan early what can you do with this information right now. So, if you have data you have some ideas now I think I convinced you to what to do. If you have no data that is okay, I guess some people came to me and said like we are interested, but I do not have any data where do I start. So, there is so much data out there you can start looking at that. And where the data will live at least in the PDC side we say it is in the cloud and if you do not make it public you are the owner of the data. So, I hope today you have learned that the PDC Dota Portal consists of all the omics data on a single platform with UID given to each data set or patient which will remain same across the world. It enables users across the globe to access and reuse the data. You also learned about proteomics standard initiatives or PSI and importance of such initiatives. You also got a glimpse that how difficulties is the same data exist in different formats. Hence, a converter or a standard notion could play a major role in developing repositories which are accessible to all. We also learned about proteomic standards being used currently such as MZML, MZIdentML or MZQuantML and others. So, the next lecture we will shift topics. We will talk about data independent acquisition and SWATH Atlas by another guest speaker. Thank you.