 Yeah, so my name is David Fenger and I'm a professor at New York University and my group works on integrating different types of biomedical data and with a focus on integrating proteomics and genomics. Yeah, so by looking at the correlations between these different data types, we can better understand the underlying biology and for example, find modules that or proteins that are working together in complexes, we can find the transcriptional regulation and also signaling on the phosphorylation level. We can by looking at how these different measurements are correlated, we can then understand how the biological function of the cell and try to elucidate the complexity of these functions and also in cancer we can then see how these cellular networks are dysregulated and can lead to cancer. So when we're building these predictive models, we always have to make a trade-off between having a complex model or a more simple model and what we when we make it the model too complex, then we risk overfitting our data and then the model won't generalize very well. But on the other hand, when we make it too simple, it won't have very good predictive power. So we have to find this balance between simplicity and complexity and this is something that will depend very much on our data that depending on also both the size and the quality of the data and how complex we can make our models. Yes, so I think this is a question that we get a lot, but the mainly it is that they are very complementary the measurements. When we measure do genomics and transcriptomic measurements in tumors, we see a lot of things that change and it's very difficult to prioritize which of these changes are important, which of the changes drive the cancer. So by adding in the proteomics, what we can do is use the proteomics to prioritize which of the genomic changes are actually important. And the main reason for that is the proteomics, I mean the proteins are the functional gene products, so those are the ones that are closest to the phenotype. So if for example we have genomic changes that don't result in any changes in the protein, proteins then probably they are less important than the ones that lead to dramatic protein changes. Yes, so people have done that, other groups including my group have worked on applying proteogenomics techniques to different infectious diseases including HIV and malaria. And the angle that we and our collaborators did was to look at the immune response to these infections. So there the proteomics approach was to first do targeted sequencing of the variable regions of antibodies and then to get a survey of the immune response and then do a targeted mass spectrometry approach to find which of those antibodies have high affinity to the infectious agent. Yes, so we are very interested in actually expanding bioinformatics and computational education into the many different aspects of medical education. Both in during medical school that doctors should already then learn to be able to handle and analyze both clinical and molecular data and as we have seen there is more and more data that, more and more measurements that are done on patients that result in data. And it is very important that the medical doctors understand how, what are the possibilities of computational methods that can help analyze these data sets because they are the ones that are best placed. They see what is needed in the clinic and the data is available. So if they also understand what is possible with computational methods I think then we can really much faster change and improve the healthcare based and really make it sort of predictive and based on data. So I think in some aspects we are already achieving it but it is a very special areas and like one thing there is we in treating diabetes for example we do personalized, practice personalized medicine today because we based on measurements we adjust the medication on maybe even a daily basis but it's still these are very limited cases where we can do it and to really do it on a larger scale I think we, for that we still need a lot more research but we are slowly getting there and at least today we can imagine how we would get there even though it might be quite a few years away. So about 50% of our genome is made out of remnants of retro transposon sequence. So this is a very large part of our genome and most of these are truncated so they are not active anymore but they at some point they were active but there are about 100 positions in the genome where there is full length of the one retro transposon called line one and that have the potential of being active and the activity in this case means that they are transcribed, two proteins are made, one protein binds to its own RNA and actually both bind to its own RNA and one of them is also an endonuclease and the reverse transcriptase. So after this rubber nucleoparticle is imported it can, the one protein can cut, the endonuclease can cut the genome and in reverse transcriptase and insert itself in new position and this of course can, if it's inserted into a gene or a promoter region this can cause all sorts of problems because it can disrupt the gene or so it really the host has developed really a very efficient suppression mechanism. So the retro transposons are not, are in most somatic cells not active but what happens in a lot of tumors what people have observed is that we get transcription and so we approach this with the proteogenomics approach where we look at both the transcription and the proteins. So what we've seen is that we can see very lot of transcription factor binding to the, to the run boundary transposon and we can see that it has a lot of transcription in certain tumors and it also has the proteins are produced and finally we can, we have also developed a method to look at novel insertions in the genome. So, so taking this together we are now looking also at what host proteins are needed for this process, what proteins they interact with and we are looking at what regulates this both on the transcription factor level but also on the translation level. So, and the nice thing is that the proteogenomics data that's coming out of several labs nowadays on tumors we can actually do these studies with existing public data. So my name is Karl Klauser. I am a principal scientist at the Broad Institute of MIT and Harvard in Boston and I have been for quite a few years now doing research in the proteogenomics mainly oriented around cancer. The research that we do is mostly trying to analyze tumors that come directly from patients where we're trying to get an overview of the the proteomics and the genomics of cancer and so what we seek to do is have tumors from from over a hundred patients and we want to have depth of coverage that in the proteomics side is say 10,000 proteins or more. So in order to make effective use of instrument time we use a multiplexing strategy that involves TMT labeling, peptide fractionation and then automated mass spectrometry that does LC, MS, MS and doing that we get both proteomic information and phosphoproteo information and the phosphoproteome information comes from a step that of isolating and enriching for phosphopeptides used using immobilized metal affinity chromatography that gives us one set that we do proteomic work and one set that we do phosphoproteomic work then we that collects millions of mass spectra that software is used to interpret the spectra I'm responsible for building some of that software and that creates large amounts of information that we then seek to integrate with genomic information and learn things about the cancer processes and how to put different types of cancer into better classifications and help ultimately to get better treatments and diagnostics for patients. Well so DIA is data independent acquisition and DDA is data dependent acquisition and I think amongst the practitioners in the field there's a bit of controversy at this point so the DDA is a bit of older technique and the engineers that build and design instruments have been for many years working to do that very well and DIA has emerged in the last few years as people with those instruments basically trying to run them in a way that gets what they like to think of as more comprehensive data by collecting many things at a time and some people would say that it makes the data a bit more of a mess. Simply put I think it's like people who are fans of DIA are a bit like New York Yankees fans that live in Boston. I myself am a Boston Red Sox fan and as far as where I stand let's just say the Red Sox won the world series this year okay now if I take and put back on my scientist hat I would say that we're probably headed for an area say with the next generation of instruments or so where the two techniques are going to become more merged right so the instrumentation is already becoming faster and I think DIA makes not enough use of that information in order to trigger acquisition and some of the compromises that are currently made to collect data in a DIA fashion will no longer be limiting with say one or one more generation of instruments and you will you will have instruments being fast and sensitive as well as being specific and I think that combination will look a bit more like what traditional DDA is well I think my lab and has is already producing data that's of high quality and gets published in top notch journals but at the same time it's a bit like being a homeowner you got to live in the home and your family grows and you want to make the home a better place right so things are constantly being improved but I think we're already to the point where we we can claim to be robust and reproducible but that's not to say that we're satisfied right we would like to be be able to do things more efficiently we'd like them to be more robust and more reproducible so that we can get even higher quality data well so so it's already possible to do in a high throughput manner to do phosphoproteome analysis we are now doing acetyloam analysis where we enriched for acetylated peptides peptides that are acetylated on lysine residues and we do that using antibody enrichment where then anti acetyl lysine antibody the phosphopeptide enrichment today is done by using an immobilized metal affinity chromatography approach when we have a complete data set those data sets often have significant numbers of missing values that make and drawing conclusions from those data sets harder if we can improve it I think the enrichment process is probably one of the most limiting things at this point in the most room for improvement sensitivity any way you can get it always helps these things and I guess that's that's sort of the major features that inevitably I think right now we also have a certain amount of uncertainty with with regard to localizing the sites of modifications when you have multiple residues that are possible to be modified in the same peptide and the limitations in doing that are not really software they're more the underlying data so msms fragmentation tends not to give complete sequence and so you often end up with uncertainty so in today when we do phospho proteomic analysis maybe 70% of the phosphopeptides we identify we can confidently localize the site to a particular serine 3-inear tyrosine residue and if we had more complete fragmentation that would improve okay for acetyl lysine containing peptides that's much less of a problem because there's not as much potential for there to be multiple lysines in a peptide okay and if there is it's going to be one lysine at the sea terminus because of triptych peptide and another lysine that is somewhere else that's probably the acetylated one so it's less of a problem with localization I think in order to do good do good science you always need to have good samples to start with right most of the work that I am involved in these days is related to the CPTAC program that is in the United States run by the National Institutes of Health and at this point the the tumors are collected under a protocol that has been optimized to make sure that we can get as good a proteomic data quality as possible those samples come from different places in the world and and as it's critical I think to have to have an effective program to have partnerships with hospitals and cancer centers that can provide those materials now if we were to improve the technical aspects of our work particularly by being able to work with less and less material and still get the as information the data quality out that we want we could do even better so right now we tend to require a bit larger tumors that are often easy for cert surgeons to obtain from patients and what we're actively trying to do is reduce the amount of material that it takes for us to generate data so that we can work effectively with just biopsies of tumors and then I think that is going to open up areas to larger studies that hopefully can produce even better data. Hi so I'm Kelly Ruggles I am an assistant professor at NYU School of Medicine I'm also the director of academic programs for the Sackler Institute also at the NYU School of Medicine so I'm involved in both research and education at NYU and my lab really focuses on multi-omics integration proteogenomics and microbiome and cancer lots of different areas but really interested specifically in looking at how we can integrate these diverse data types to understand human disease yeah so a lot has happened since the human genome project and mostly because the technology to become so much better so we're able to assess genomics at a much higher depth we have much higher coverage to become much cheaper to do sequencing so we have many more organisms that have been sequenced and we can look at things like epigenetics and the transcriptome and all sorts of levels of omics data so there's a bit tremendous amount that's been done in terms of the limitations you know we still have only sequenced a small percentage of the total organisms that are on the world so there's a lot more we could do but we're limited with that and also we to get really good depth with like whole genome sequencing is still very expensive so I think with time we may see even more improvements as the technology improves yeah so mutation status we deal with a lot specifically with our cancer data because it's something that we're really interested in and is understanding how somatic and germline mutations are identified and and how they affect tumors and so the actual identification of variants occurs through several different pipelines the TCGA so the cancer genome atlas has been instrumental in coming up with these informatics pipelines that allow for variant calling from either whole genome or whole exome sequencing and there's also snip snip arrays that are available so there's a couple of different ways that you can do this and it's been really well developed because of all the work that's been put into this and like I said a lot of it's been done by the cancer genome atlas as well as other big consortiums and smaller groups as well so we've made a lot of progress with that and I think it's become a really interesting way to to understand cancer so microarrays have sort of come gone out of style as RNA seek has has become the primary method for for measuring transcriptomics and the the main reason for this is because with microarrays you really you need to choose your genes before you measure them so there are specific probes that you pick so you pick a certain number of genes essentially that you're able to measure and then you only can measure those but with RNA seek you're able to measure everything that's in your sample so it's a it's an unbiased approach and it's something that has really pushed the feel forward having the ability to not choose beforehand what you're measuring and really being able to measure whatever you want in your sample it's a good question you know they're both they're both really important and and and they're good methods I think that the main reason to use one or the other is usually cost and the question that you're asking so whole exome sequencing with exomes especially so if we're talking about the human genome for example it's about two percent of the genome is an exome so if you want to have really high depth and and really high coverage of of what you're you're sequencing you would would choose exome sequencing if if you don't have you know an unlimited number of funds so if you're lucky enough to have a ton of money then whole genome sequencing is a great way to go because you can get a lot of information about the non-coding regions which we're learning more and more are extremely important to understanding the cellular process so it really depends on your question and it also depends on how much how much money you have to spend single cell RNA seek is is a really hot field and technique right now it's something that I am not doing specifically but I work with and collaborate with a lot of people who are doing single cell RNA seek and it's it's a method where you're able to measure to to separate cells out one by one and measure specifically the gene expression within that cell so there are a couple of methods that you can use to do this one of which is to to use droplets and you're able to put cell each cell in a different droplet and actually do the whole library prep within that droplet and barcode the the the RNA for each cell within the droplet and then sequence all of them together and pull out afterwards the specific cell specific RNA expression and it's a really interesting and great method for understanding heterogeneity of samples and it's something that the coverage right now is not very high I think it's about a thousand genes depending on how people do it so I think as again as our tech the technology improves I think this is going to become an even more exciting field sure yeah so CPTAC the clinical proteomic tumor analysis consortium is is is funded by the NCI and within the NIH and it's a it's a great consortium because we're we're working it's a large group of us who are working together to try and understand cancer using proteogenomics so you were using proteomics phosphoproteomics genomics transcriptomics to really try and understand if we integrate this data can we identify biomarkers can we find signatures can we understand drug toxicity and predict how people will respond to different drugs so trying to harness all of this data to really understand the clinical aspects of cancer and come up with new ways of treating and diagnosing people so it we're working on a lot of different tumor types right now and it's something that's going to continue to go on I'm part of one of the data analysis consort the data analysis teams so we're really working we're in we're in the data and we're trying to figure out how best to analyze this data and how best to understand cancer using even more levels of data than we've used before that's a good question yeah so I I would love if we also had um metabolomics data so um metabolomics data you know really complements proteomics and genomics data in that you can see exactly what sort of enzymatic reactions are occurring and you can try and figure out if there's certain things that are building up in the cell or if there's certain parts pathways that are up or down that are causing um different metabolites to change in terms of their concentration so that's something that I think um you know is is another data type that we can really benefit from and it's something that's becoming more and more popular in these multi omics analyses that I personally would be really excited to work with as well my lab works on a lot of um creating a lot of open source tools and we work with a lot of people who create open source tools and you know I think it's it's a really um important thing for us as a scientific community to contribute um so open source meaning you know we um we create these pipelines that we we can make public something that my lab and something I'm particularly interested is making things um uh interactive so having it be on a web server and um having it be available for people who who aren't computational who can upload their data and and really explore it in an interactive way so that they're able to ask their own questions um and not relying on someone who is a bioinformatics expert to always be taking their data and doing something with it and giving it back and having this iterative process I think having um this sort of um having it available to the scientists themselves to ask their own questions and play around with the data is something that I think is really important and something that we should all work towards um especially the computational field if allowing other scientists to really who don't have their same skills to really um be able to look at their data themselves so I I'm very very involved in our computational biology program at NYU I'm I'm part of the master I help lead the lab master's program and I'm very involved in the PhD program and you know training our scientists at this point to really understand how to also do some programming or to at least understand how the programming works they don't have to become experts you know we can't all be experts in these fields and I think also really teaching people how to be collaborative I think we're we're at a point in science where we all rely on each other and we it's hard to run a lab and just be insular and do everything yourself so having you know people who do the wet lab and people who do the informatics and people who sort of translate between those two and training our next generation of scientists to understand this and be able to work in better in groups I think is something that's really important and also um just training them to have the statistical and computational background that is required to to drive the field forward the entire scientific field forward I think is something that we all need to think about and and invest in