 Hello, my name is Pratik Jatthap, I am a research assistant professor at the University of Minnesota, part of the Galaxy P team wherein we develop proteomics tools to be deployed within Galaxy. The team at University of Minnesota led by Professor Tim Griffin basically works on adding tools and workflows for proteogenomics research as well as metaproteomics research. And we have been doing this for last 6 years, we conduct tutorials as well as we have started using these workflows for multiple research projects. And I am very excited to be part of this cancer proteogenomics workshop here in Mumbai basically because of the fact that that is the research that we work on and gives us an ability to reach out to the audience here in India as well as the rest of the world. So I will try to answer this question in two parts. The first part that you mentioned was about a common platform and that is exactly what we do with Galaxy platform and I will be giving a talk very soon about that. But Galaxy platform basically helps both genomics, transcriptomics as well as proteomics researchers to use a common tool or platform to use it. But in general I mean one of the main barriers that I see is you know most of the times researchers or developers are kind of specialized in one field and not the rest of them. And hopefully a platform like Galaxy or anything else helps to bring that common playground or a common place where in all these researchers or developers can develop tools and help integrate the data. I think there is also a need for researchers to understand the fact that developers and users need to work together because things that are developed by a developer might or might not be useful to users. So it could be a great algorithm, those fantastic things. But if it is not something that aligns with the question that is asked by the researcher then it just becomes an academic tool. Why servers are the user also has to understand the possibilities and challenges that a developer faces and try to achieve tools that work and give you a multiomic or a systems biology perspective to the data. One of the observations that researchers have started making now with multiomics or transomics research as it is called when we comparing let us say a transcriptomic data to proteomic data. At least in early days one found that the correspondence to each other in terms of quantitative expression was not exactly 100 percent. In fact the concurrence was much, much lower and that was little bit of a concern earlier but now it is understood that the way RNA expression works or protein expression works is not exactly instantaneous in the sense you could have RNA expression and the protein expression could lag behind right or you could have the stability of your RNA molecule determining you know how much of protein is going to be expressed. So I think it is really important that researchers start undertaking temporal or time dependent expression studies for both transcriptomics and proteomics to kind of make a much more much more studied conclusion on the expression of both protein as well as RNA because if you find that the RNA is low in protein is more it does not mean that you know it is just giving you a particular snapshot and not the cycle of that particular expression. So I guess the answer to that is time dependent studies and the technology is getting that I mean RNA sequence is already there and I think with newer developments in mass spectrometry the scan speeds are getting really fast. We hope that we can get deeper as well as lot more data from mass spectrometry data so that one can match that with transcriptomic data. I have a little bit of concern with that given that you know and again I have personally had I would not really studied that as much. I have worked on domains and I have looked at using domains how one can predict the function of a protein. But in terms of interaction I either one do not have enough information or secondly if you have predictive models and if this predictive models are backed up by experimental data then yes one can say that but until we actually have a good correlation between those two I think experimental data is going to be lot more dependent or lot more determinant of what actually interacts with each other rather than you know computational modeling but that is my opinion. I think it is important to start with RNA-seq data and I will be covering a little bit of that in our in the talk that I am giving today. So if you start with RNA-seq data obviously you have your genomic coordinates or you can go back to your genomic coordinates and if you use that as a template and transfer that information to your proteomics data or at least have database schema to kind of go back and find out your genomic coordinates or gene centric approach to that. I think that is possible we have shown that and obviously there are tools and workflows that need to be first developed and then optimized and made robust enough so that one can do this on a more consistent basis not only for known organisms for but also for organisms that are you know getting sequenced. But it definitely is possible going through the RNA-seq data. I think for the proteogenomics field to develop one would actually have to make this almost a requirement because if you do not correlate your protein to your DNA or to your RNA you are most losing that information and you want to maintain that because you kind of know it is coming from you know from DNA to RNA to protein just the fact that if you do not have the tools available or the coordinates available is not a good enough excuse to lose that information. So I think it is going to be necessary as the field of proteogenomics becomes a more established field as it is emerging. So I have basically worked on two workflows or two areas of research one is proteogenomics. So when we started working on galaxy or using galaxy for proteomics what we did not want to do was just develop another platform for proteomics research you know taking your mass spectrata your protein fast file you get peptides and proteins there are many tools which do that. What we wanted to do was take some challenging areas and this was six years ago so proteogenomics was very new metaproteomics was very new and I would not say they were very new but they were emerging and we saw a promise as well as a challenge there. So what we did there was we worked with the post processing of the peptides identified and trying to make it easier for a user to use it and that is where I mean the challenge there is working with a user or a project and a developer. And as I was mentioning during the break you know a developer is extremely you know enthusiastic about his work the user is very focused on his questions asked and sometimes these do not meet and that leads to a program or a workflow which you know which is great in its own field but it is not usable right. So I think the developers and users of the work together on a project with a specific questions in mind and then creativity starts coming in once you have the basic blocks in place I think that is how you know tools are going to develop. In terms of its current status I think it is in pretty good shape I would say MZML format MZ identical format are kind of getting accepted. The only part I think I see a need for development is the MZ quant or the quantitation portion of the MZML format and I know there are developments taking place there but I think making this more robust and optimized is going to be the need because there I think are going to be many quantitative studies and especially quantitative studies that correlate to RNSEQ data or any other quantitative analysis data. So getting the quantitative portion of the proteomics or mass spectrometer data is going to be important. I will again answer this in two parts one is you know you want to develop something to show it works right and we have been doing that we take small data sets generate workflows develop it on a cloud you know build a docker instance share it with the world we give presentation saying this works but that is almost like playing a sport in a small little you know backyard or something. The real data that comes out is not going to be just few raw files or a few fast few files but it's going to be many and maybe multiple replicates and hopefully many time points as well. So you need to have something at the back end the infrastructure needs to be such that it can support that but it's also important that the tools and the workflows can run on that they use the ability to use you know the vast resources that are available either in the cloud like Amazon and Google or you know any super computing infrastructure that you might have. So I think it would need to go through steps you need to make it work first and then you know just like child develops you want to see that it graduates from that you know it's cool to college and maybe into the real world. So there is definitely need for that in academic setting sometimes it's not possible because teams get funded for five years and then focus changes. But I hope to feel in general kind of understands that and makes it possible because it's really not much fun to go back develop a new workflow and start again and again so hopefully there is you know this momentum keeps going on. At multiple levels definitely the ability to ask good questions and that doesn't come easily right. And some people are naturally talented they'll ask good questions after you know after really early in their education. But one thing that you developed as you do your advanced studies is you start asking really good sharp questions I mean as a young student you always have ten questions but you're not able to decide which of these ten are good you think all ten of these are good right. So asking good questions and then secondly designing experiments it's always good to have great ideas but to put it into practical stepwise manner is important. The third part is you know sample preparation I mean I know I mean I work in the bioinformatics area and I know that many researchers kind of kind of take it as sample preparation is going to be good or you either blame the sample preparation or you know you kind of taking it for granted but I think there is a lot of quality control that needs to be done a lot of things that need to be you know considered and that's true for data acquisition as well. You need to have QC parameters to ensure that if you have replicate generated on day one and then on day 10 you can compare them or if you cannot compare them what is the reason you cannot you know so the measures to have that and there are tools in place to do that as well. But I think most importantly it's important that the researcher gets to interpret the data right and interpretation could be on various levels it could be by using programming you or it could be just a biological interpretation where in saying good I've seen this data I know what it means but I need tools to do that and that's where the person can work with the developer which is what I do I work with developers because I kind of get a sense of where the data is going or what could be important but you know you can start in one area and then develop into any other area or you could just become an expert in one area and then you know once you start having that ability to look at overview of the project asking good questions and publishing good science then I think you've achieved the ability to do you know things while I was answering one of the things I kind of learnt during during my career was also ability to communicate and this is not just through manuscripts and through you know through anything that's public or I think you need to clearly mention to the person you're collaborating with because team science is going to be very important now you cannot be an expert in mass spec and an expert in and you might be but you might not have time to do that so you need to communicate your expectations to your you know collaborator and get the best out of them and also offer the best that they want from you so I think communication is very important I think obviously the younger generation already has kind of a strength in that because of the amount of social media that's available but I think it's it's important to have effective communication while avoiding the noise you know how do I get this across to somebody which is a signal that could be useful rather than giving you ten pages of data and say go find your answer so I think these are a few things that that that develop and I'm sure there will be new skill sets that will come up as as the you know as the field develops. Hi my name is Ratnarajesh Thangadu I am from a company called ESAC we work closely with the National Institutes of Health USA so we are a bioinformatics and health IT company we do provide a lot of services to both the National Institutes of Health and also the Office of National Coordinator in the USA. I think we can start with the sheer volume of the data and then there is lack of resources and also the infrastructure to manage that volume so everyone has all their kind of tools and the processes and pipelines but not to handle the big volume of data and the other thing I would stress is the lack of data standards or the lack of adherence to it. So what happens is you have a large amount of data but there is no data standard so you can actually come you don't you you miss the ability to actually combine that analysis with other data sets that are pre-existing or that are coming from the other programs. So I think these are the main three things that that strikes my mind to begin with. Data standards are basically a set of rules or agreed upon rules so that you represent your data in certain way so annotate that and also represent it. So that actually helps the data harmonization part so what happens when you start generating data which is specific to a particular program or a particular country or a particular disease type or particular population and when you want to actually integrate the data into a bigger platform and bring the data from other platforms for example you you call the same disease or same gene by different names and you know that it is the same but the computer doesn't understand until you tell that. So data standards actually help you get to that point and the other point is in the in the data harmonization so you you analyze all of the data through a particular pipeline so that allows you to see all the data through the same eyes. So even though you can analyze each of those data set independently with a different pipeline harmonization actually brings them together and helps understanding on a larger scale. Earlier I mentioned about the lack of infrastructure or the lack of resources so cloud computing actually removes that barrier so it's on demand pretty elastic so there are a lot of companies out there which are pretty established something like Amazon, Google Cloud and now Microsoft Azure is there so for that you don't need to have a data center on your premises. So you don't need to have a set of IT crew helping you out adding more disk space and networking and all these things so what you need is basically a good internet connection. So you based on the volume of the data it will scale up pretty quickly so you can increase the size of the disk space that you are using and also kind of resources so the compute power that you need so it's pretty helpful in that sense. So in the precision merits and so the question is specific to that so what happens is like it's not the data about a particular program but it's data about the individual patient or individual candidate or a subject it depends on how you call that so within the personalized genomic space so every day the data is growing exponentially so the moods law it doesn't apply anymore so cloud computing comes into picture to handle that kind of data. I don't see it as such a big problem because the connections between the genes and the proteins at the level of accession numbers between the most common databases out there are for example RefSeq and UniPro they're very well annotated and they're very well maintained and most of the proteomic pipelines are actually rolling up the final protein parsimony results to the gene level so it's pretty easy to actually map back to the individual isoforms and also if you start from isoform you can easily come back to the genes. So now the big buzz word everywhere at least at the level of the governments that are involved the US and I would say the European Union is the big data and the data commons. So data commons actually brings together everything at one place so the resources in terms of the storage the tools the compute power everything at one place that's called data commons. So what the user needs is simple login so like the way you log into your email account so the user can actually just log in and he doesn't have to bring anything to the table and the only thing that's there is he can take back some results from the analysis directly from the cloud computing platform. Like I said earlier data standards standards standards that's very very important in achieving the goal that we have in front of us there's a precision medicine the large volumes of data and if you and the so the siloed nature of the data actually never helps so you have this genomics proteomics imaging the immunology all these data are sitting there side by side but they can't talk to each other because there are no standards there I'm not let me let me take that back there are standards there but they're not adhered to so what happens is you're calling the same thing with different names like I said earlier so you cannot actually do the integrated analysis keep your eyes wide open because there is a lot of open source data already available so you do not have to actually generate the data you can start looking into the existing resources bring the data onto your laptop and start analyzing that so to analyze the simplest tools I would recommend currently are R and Python so they're pretty easy to learn and all the public data that's available there you could process that data through the so those kind of parsing tools and the statistical packages and you're ready to go. There are some portals that that I know of which collate all the data from the literature and also the nucleic acid research it's a journal which publishes the available databases and the tools on a yearly basis but at the same time now it's a requirement with most journals that you you make your data available somewhere on a public repository and also the tools that are that you have used and and also the versions of the tools so it actually a lot of things that are already there that provide the metadata of the analysis that's performed on the particular dataset so the data is available the tools that are used are available and the versions are available so it actually helps to reanalyze the data on a different settings or on a different dataset or just simply reanalyze and to see cross check the validity of the reports from the publication that's critical to the all of the data commons efforts that we talked about right so if you do not share the data so there is there is no data commons so if you want to share the data that's always welcome and and now the governments actually require you to I mean as long as the research funding is coming from the public money they are required to submit their data to one of these resources for example in U.S. there is something called genomic data commons where they make available all of the genomic data and now they're building a proteomic data commons that we are building in our group and then the other point is so it actually helps innovation so sometimes you start with a small dataset for example the TCGA pan cancer analysis is one such an example so there are about 30 plus different cancer types that they haven't analyzed and now after so many years the program closed I think at least three or four years ago but now there is research coming going into that and just because the data is shared everything is available in the public domain and people are using the data so a lot of proteomic data is it's not protected it's in other words it's open access so that's a good thing but the if you are looking at specifically at the proteogenomic kind of integration so some of the genomic data is actually a controlled access so there are data access committees so you need to submit an application they will review your application and if there is no cost factor involved here other than your research interest so once you submit that so they will review your application and see the research statement and they come back to you with their decision and there is the expectation that you adhere to the guidelines proposed by those kind of repositories so what happens this is all patient related data so patient privacy is primal to the all of the data sharing