 Welcome to MOOC course on Interaction to Proteogenomics. Today we are going to have a new speaker, Dr. Ratna Thangudu. He is a bioinformatics lead in enterprise science and computing at Rockville, MD, USA. His company deals with large scale data management and provides bioinformatics solutions to various institutes and companies. Dr. Ratna is going to talk to us about large scale data sciences. He will explain what exactly the term big data refers to and how it can be managed. He will also talk about the major issue with the big data and how one can overcome it by sharing the data from all the fields whether it is academia or industries. He will also talk about the importance of multi-omics data in understanding biology especially in context of precision medicine. So, let us welcome Dr. Ratna to talk to us about large scale data sciences. My name is Rajesh Thangudu. I am the bioinformatics lead for the company called Enterprise Science and Computing. So, we are located in Rockville, MD, USA. So, we are kind of 10 miles away from the National Institutes of Health. So, that is about 20 minutes drive and to put it in the context we are about 30 minutes from the White House. That is where we are. All right. So, what are we doing actually as a company? So, we are into a large scale data management and provide bioinformatics solutions for government clients, academics and also industry. And we work pretty closely with National Cancer Institute. So, you heard a lot about the CPTAC program for the last few days. So, we actually built and manage all of the resources that they were using. So, I will talk a little bit about that. So, how many of you are actually aware of what is big data? I see a few hands just raising. So, I think I mean you are all part of big data. Every day you are contributing to big data. So, I will start from there and try to fit that into the perspective of big data in biology and what we are actually doing with the proteomics and how it is all going there. So, say the word basically big data means it is an extremely large data set that may be analyzed computationally to reveal patterns, crimes and also associations which are not easily seen by our regular kind of day to day analysis. And as the data or the name it actually implies it is very large that is understandable that is big data and then it is very dynamic. It keeps growing, it keeps changing and it is very complex to actually use any of your traditional data processing techniques. For example, take your excel file. So, excel can handle at the most 10 million rows that is it. So, if it crosses that what would you do? So, that is called big data. I think that crosses the path that you cannot actually analyze with the tools that you have on your desktop that is actually big data. So, with large amount of data it offers lot of statistical power and the complexity actually leads to some kind of false discovery rates, but that is ok, but we are getting lot of statistical power there. So, to give you an idea of what exactly is big data this is the social media big data that we all contribute to on a daily basis every minute. This is the number that is coming from 2017 for how many what kind of interactions that we do on a daily basis that contribute to the big data. So, I will not show you a lot of things, but for example, Google you see in every minute we almost do 3 million searches. So, every search that you do is a data point. It is not necessarily data it is returning, but the search itself is a data point. Similarly, you watch lot of YouTube videos. So, the amount of time you spend on YouTube that is a data point and the amount of video content that you upload it is a data point. So, every interaction that we do on a day to day basis is a data point that actually contributes to the big data alright. So, what is big data again? It certainly involves large quantities of data, but it has some characteristics everything if I give you some I say something like a big file I say this is data that is not big data right. Just because the file is big it is not big data. So, it has some characteristics some features to it. So, this actually started off with describing volume that we just discussed and then there is velocity. How fast you can access it? How fast the data can actually move from point A to point B? So, for example, the video streaming. So, Netflix, YouTube that you see and also the variety of the data. So, is it all the same kind of data? No, there is a lot of structured data there is a lot of unstructured data. Structured data refers to basically anything the kind of interactions that you do the airline ticketing system. So, the banking transactions all of the e-commerce things you do they are all kind of structured data you know which person is doing what at what time and what amount he is spending that is kind of structure. But what is unstructured? So, all the emails that you send every day it is unstructured because it is all text. So, you cannot actually assign them to categorize them by words. So, that is just one example. But as people started looking into the big data this I mean this is not enough. They came up with more descriptions. They added veracity. Is the data actually valid? I mean does it make any sense at all? It should be some valid data right. So, then some more people came and said okay let us add one more. What is that? It is called veracity which did veracity I think value. So, does it actually add any value? Right at the end of the day you have so much data I am talking so much data like I showed so many points. If that is all not generating money for the companies Google or YouTube they have no value right. So, it has to have value also. And then they added I mean people came up with more things that is like they said after veracity they said visualization. So, can we actually look at the data and say something about it? And then they have something called gem missing viscosity. Does it stick? Is it with you? I mean does it make any sense at the end of the day? So, all this is like what I am trying to say is basically big data is not just one big file that you have or a bunch of files that you have. It has to have some meaning to attach to it when you analyze. All right. So, now I jump to big data and biology. So, we all know I mean next generation sequencing you are all aware of. We have been discussing for the last several days. Maybe you are all doing your own research on in that area. So, biologists we joined a big data club long time ago. I think I said the advent of the human genome about 10 years ago or 20 years ago. And then so there are many forms of data right. Even when within the biology there is genomics, proteomics, molecular pathways and there is one thing called healthcare. So, all the doctor visits that you do it is recorded in electronic health records. So, it is pretty popular back in US and western world. I am not very sure. I mean probably the new corporate hospitals they are actually doing it already. So, there is a lot of EMR data. And all the visits that you do all the tracking the healthcare tracking devices that you use Fitbit it records lot of data. See it all if you bring all of them together it tells something about you. You are predisposing to or you say particular disease right. So, it allows us to develop new tools, new techniques, new ways of understanding the data. So, just to give you an idea of where we are actually going with the data that we have all right. So, this is 2000 that is around the first Sanger sequencing came. So, then the human genome project came and then we have TCGA we heard about that. So, that is about 10th we have first 1000 genome project that is about 1000 genomes. And then we have TCGA that is about 11000 patients data. So, then the exome sequencing came that is about 68000. Now, there is the Gaysingoring Institute back in US that does actually 100000 genomes. And now people are talking about a million genome. And then there is the new initiative started back in US it is called all of us. It is basically looking at almost a million and plus cases across the country. So, basically when we are adding data day by day so, as one more slide has to increase like this. So, every 2 years the data will double, but Illumina when they are developed actually developing this NGS machines they basically said oh no it will go like this, but actually if you see it is actually going at that rate that is a projection. So, the exabytes of data I think I convinced you about how much data is there. And coming back to the actual actual multi-omics data so, it is not the example I showed before in the earlier slide is just the genomic data I am talking about. But there is so many facets to the multi-omics data we have transcriptome proteome metabolome and exposome and epigenome and also the social graph with the demography of the patient and the people it is not just a patient. So, then there is imaging data there is biosensors. So, you have to bring all of the together to actually make sense to if you want to achieve the goal of the precision medicine. So, that is the personalized medicine all right. So, where is this all data coming from? So, people are actually generating lot of data if people are not sharing it it is not big data anymore right. So, you generate for 10,000 patients your group generates the data and you keep it with you. You do not share with anyone else. So, then it is not a 1000 genome that is your genome your your labs. But it is actually rapidly change it actually change I mean all of the databases that we have now just because of public sharing of the data. So, if you are running a blast on NCBI you are running again as a reference genome where is it coming from because people share the data publicly because it is all funded by the governments so, whichever I mean all of the European and American country US subcontinent basically they whatever the data that is funded by the government that has to be in the public domain that is a requirement all right. So, so, but we are all into proteomics right. So, where does proteomics stand here? So, there is a lot of proteomic data out there in the public domain I am not sure how many of you actually aware of that I will show some slides there probably you are aware you might have seen it or some of you actually used it. But I mean with the advent of the high resolution mass spectrometry and collecting and sharing is actually a big challenge right you run the instruments and you saw how much time it took to understand the data. So, after that what you do? So, you run your experiment you analyze it probably you will have a publication and then what do you do? So, the publisher basically now wants you to actually put that somewhere that is accessible by both the publisher and also from all the people right. So, that is all actually contributing to the public data. And likewise actually even though the publisher says he submits here actually for the people who are managing the data there is a repository it is a Herculean task it is extremely difficult to manage it that is coming from so many different places all right. So, we did a very good progress in terms of the where we stand in the proteomics community. So, there is this consortium proteome exchange probably lost a lot of you are already heard. So, this is a consortium of about 6 groups there are the 6 repositories. So, the pride is the central one it is the largest public data resource. So, that is sitting in Europe and then we have massive and peptidolus I think from ISB David Campbell is here he is part of that and then you have panorama most of you are actually access it and then we have from China and also from Japan. So, there are different resources out there. So, they have all of the data publicly available that is a very good question. So, I have a slide next slide I will talk about that. Earlier like when these resources started they actually started taking your raw files and your results right. So, the people when they so, we do not know the format of it. So, we do not know the validity of it, but now let me go to the next slide. So, many data sets are coming from so many different places right. In 2016 there are about 4000 data sets available there that will show you the growth of the data. So, they started with just take you whatever you have and also the results also. So, just see here it started in 2012 where we are and just this 2015 this is a dated slide. This is a 2015 it went to that level and as of few days back I checked just a couple of days back there are 10 spaces and they are about almost 7000 data sets public available through the proteome exchange. It is not just from one resource, but all the 6 or 7 resources I showed you. And I will be answering your question in a bit. So, here is another list of public databases that are there. So, some already I already listed there and then some these are actually derived databases. These are the post translational modifications how do you capture. So, there are separate specialized databases. And like I said it is not so easy to manage all the data resources it needs lot of money, lot of manpower and lot of expertise. For example, here the tranche or tranche is one data resource it does not exist anymore. It is not there anymore because it is lack of funding. So, that is one place actually we took some of the data for the CPTAC. CPTAC used to deposit data in tranche before. So, as a part of the CPTAC data from the data coordinate center I will discuss later, but we got all of the data and made it available again. So, this is I think I will try to answer your question here. So, what are the uses of this data out there? One thing you get a publication. So, a publisher requires you to submit the data somewhere so that he can validate that. So, the primary use is the why you generated the data. So, you have the publication that is the primary use of the data. So, and then it adds evidence. For example, in the UniPro you have all these manually curated validated protein sequences how is it coming? So, it takes evidence from all the publications people are actually publishing right. So, that adds value that is the primary use. And then there is reuse meta analysis. So, you can take data from the 10 different data sets which are similar to your work, but you never knew that they exist. That is because you went to one of the resources you could find you just search for for example, colorectal cancer maybe you will find something there right. So, you go to a publication and you go to PubMed you search for something. So, you get literature you do not get data. So, there is a difference so things are slowly changing, but as time goes by the idea is when you search in PubMed it is not just the publication that comes out it also tells where your data is sitting, what pipeline has been used and how the pipeline has been run and can you actually reproduce the results yourself with a click of a button. So, that is where people are trying to head. So, we long way to go there, but the vision is there. So, to answer your question basically some of these resources they actually reanalyze all of your data. So, Massive for example, they will reanalyze all of your data through their own pipeline and if they find something interesting that did not come through your results they will they are nice to send you an email saying that ok we did find it maybe it will be interesting to you that is from the UCSD it is very difficult to go back and reanalyze all of the data right. So, sometime they try to analyze all of the data it is together. So, that needs a lot of computing power alright. So, I talked about the meta-analysis the reanalysis for example, here reprocessing. So, all these resources I mean these are the in the central you see all the primary consortium members, podium exchange, deposit base once you deposit there there are other resources like the credit laws and all the other data actually take the data and reprocess them. So, that gives you insights and it allows for example, this I already discussed and then it provides value to the adds value to the phosphor protein post translational modifications data sets data repositories. And then finally, there is repurposing. So, you start with the purpose of your experiment is something, but you use that to add value in a different way. So, one example is basically the proteogenomics approach right. So, you have all this data you take it I mean you deposit the data, but as a user I will see this is very interesting I will try to find if there are any novel splice junctions I can find. So, I will find. So, then I will add that value back to the either NCBI or somewhere I deposit that. So, so basically I am trying to say that data sharing actually helps in use of the data and reuse and reprocessing and also repurposing I did not talk about reprocessing. Here it helps when you always use some pipeline. So, the pipelines continuously and constantly evolve new algorithms are coming. So, in the first attempt you might miss it, but the new algorithms actually might find a new information from the same data set. So, that is one thing the massive does. So, we talked about I mean everybody is generating data depositing there I showed almost 6000 data sets there, but if you actually take a look back at any of these resources they collect very minimal metadata because it is very onerous metadata is data about that data. So, you describe your data what patient is coming home, what samples, what protocols you use, how did you do that, what is the experimental design. So, if you do not provide all that context about your data sets it is all useless right. So, it is just sits in there and I go there and I try to get all the data and I cannot do anything I do not even know which patient is coming from. So, there is not the problem here is always the data submission making available making them available in the public domain it starts after you and your research goal right. So, you finish your research objective you achieve that and then you go there because somebody else tells you ok the general says you have to deposit somewhere. So, that is like a burden for a lot of people right. So, I am done with this now I have to do all this like if you actually go and submit in any of these resources it is not so easy you have to collect the data metadata in a certain way and reformat it and submit to them. So, then they validate if there is something missing they will come back to you and ask ok this is missing we cannot actually support unless you provide this. So, the amount of metadata that they require actually they try to shrink it because it is becoming burden and people will stop submitting. So, we want the data to come. So, let us I mean give us a minimal metadata and we will just keep it there and see how we can process it. So, that is not really helping and then there is shortage of expertise. So, as the data sets is growing so many data sets so much volume. So, we need export people to handle that. So, now the export is limited to the resources like pride and peptidate class and all these and then lack of adoption of standards I will talk a lot more about the standards in a little later, but standards is how do you represent your data. So, are you using any of the existing control vocabularies to define what it is. So, you are saying TMT and somebody says capital letters or small letters that is a very basic example I am giving, but if you actually go into the clinical data side the same disease breast cancer there are so many sub types. If you just say breast cancer it does not help you have to tell exactly what it is. So, there are standards to help you do that, but when you submit you have to do that actually. So, I talked a lot about what is already there I did not talk anything about the CPTAC right CPTAC produces a lot of data. So, that is the consortium we are part of it we develop the we manage the data coordinating center and we also distribute all of the data. So, we started pretty low. So, the CPTAC 2 program 3 cancer types breast colon and ovarian about 600 patients. So, they started off with some TCGA samples which are already there ok there are genomics data is already existing. So, why do not we take them some 100 samples from each of these patients and reanalyze with proteomics and I try to combine. So, it is a phase 1 after that they realize ok these samples are not optimized for proteomics. So, we need to collect more. So, then they collected 300 more for each of those cancer types. So, there is a lot of success and now the current running program is CPTAC 3. So, there are more at least 10 more cancer types they added. So, it is a very ambitious and large program it is not in terms of the volume of the data that they are generating because proteomic data is much smaller compared to the genomic data, but the breadth of the coverage of the cancer types and what they are trying to do in terms of the proteogenomics it is it is pretty big all right. So, we manage the CPTAC data coordinating center this is basically the consortium has about 15 to 20 different groups or institutions. For example, for the last 3 days you saw at least 3 groups you are representing them. So, one is Broad Institute you have NYU you are you are University and also Bing Zhang is from Vanderbilt University. So, there are 3 different groups they are working. So, they just represented 3 groups, but they we have another 15 groups sitting there. So, all is all these people are actually generating data. So, we have to actually coordinate that. So, the data actually refers to clinical data, the biospecimen data, genomic data, proteomic data, imaging and so many aspects to it. So, the private portal is specifically for the consortium members. So, it is a control access they can only log in and they exchange the data. And then we have a public portal. So, can you how many of you actually went to use this resource? So, the idea is I really encourage you to go to that resource. So, I mean the purpose of this talk is basically to introduce you to all these resources there is so much data was already there start taking a look and you do not have to generate anything it is already some so much is sitting there just to try you do not even have to have a new discovery just to try. And then we also have a SAP portal. So, we did this in collaboration with the Georgetown University we put a lot of effort on SAP portal to make it. Because for each asset there is a stamp from NCI. So, you might have seen there is a check mark there saying that it is kind of branded. So, if there is some information missing we will definitely take a look. All right. So, this is the portal's landing page you can go to the proteomics. cancer.gov and it will have some information there and you can click a link. So, it is we use the Aspera technology to transfer Aspera which supports very high volume data transfers at very high speeds. So, but it uses the UDP technology. So, most of the university's academic institutions they block it. All right. So, it is not too hard like if you reach your IT department for IIT and ask them to open a particular port. So, we can provide sufficient information if it is a problem like you cannot just go and ask them to open a port as a security hole. But if you want some information from outside we can actually write to them saying that because all of the years we use that resource. All right. So, we have about 13 terabytes of data right now. So, that is about 43 studies. So, I talked only about three cancer types, but there are 40 the data is organized into 43 different studies and so far from the last six years we have so many visits like people coming and clicking and browsing our resource and we have only 13 terabytes data, but it has been downloaded the actual download amount is about 400 terabytes close to 400 terabytes. That is not a lot when you compare to compare with the genomics. But what is interesting here is this number the number of files that are downloaded is almost 2 million or close to 3 million. What it means is along with the raw data we actually provide the results from the common data analysis pipeline that we run on the portal. So, these files are just the gene sample matrices that you are using for all the Morpheus hands-on session. All right. So, what I will just come there basically for each study we run a common data analysis pipeline. So, that generates the result files these are the summary reports the protein parsimony results the identified peptides, identified proteins at a certain threshold. So, most people will probably want those files they do not need raw files. If you download raw files it is not helping you unless you have a established pipeline and resources to actually analyze reanalyze the data. So, what I am trying to say here is basically just the volume of the downloads you see it is basically a lot of people are actually interested in the results files that is what you want you do not just go there and you see this is a lot of data I have to download. It is always there you do not have to worry about it it is not going anywhere and you do not have to actually download if you do not want to you just go through the results and use that information. All right. So, this is just trying to show how many people actually access the data from throughout the world and I see very few dots from here, but I think that will increase when I go back home I think I will see a lot of dots. I hope you have learned that why sharing correct data is important and how it can help people across the word to find solutions to the problems where individuals fail to solve. You got a glimpse of how contributing to big data also helps in obtaining the reproducible datasets which could help in finding the most reliable candidates or even potential biomarkers for various datasets from different studies. I hope you have also learned about proteome exchange and its evolution in terms of data with time. Hence, one should emphasize further on sharing the correct data with society. In the next lecture, we will continue Dr. Ratna's lecture where he will talk to you about large-scale data sciences and give you few examples. Thank you.