 My name is David Campbell. I work for the Institute for Systems Biology in the lab of Dr. Rob Moritz. We do a variety of proteomics techniques there. I work in the lab of Dr. Rob Moritz. The lab does a number of proteomics techniques. I myself am a software engineer, and so I work on a number of projects such as Peptide Atlas and the TPP. I think there are several different barriers that need to be overcome. First of which, there are many different file formats. All the different data types produce their own distinct file types. They have different fields. So even within the proteomics field, there are many different data types for protein expression. And therefore, merging those disparate data types can be difficult. You essentially need a translator between the different types. Other barriers are that the various instrumentation can be expensive. The knowledge required to successfully analyze the data takes some time to obtain. And so it's difficult for any lab to have all the expertise and the instrumentation needed to do these types of analyses. So protein levels do fluctuate with cell cycle and disease state and things like that. The sort of more transient fluctuations, say with cell cycle, should be averaged out if you sample, say, a population of cells or a whole tissue. This may change in the future when we actually do proteomics on single cells. But if you are interested in such cycle-related transient differences, you can basically start with a synchronized cell population and do a time course experiment, for instance, to see the rise and fall. But basically it's just good to keep in mind that these differences do occur and include this in your interpretation. I think this is very difficult to do because with a primary sequence it's difficult to tell exactly how it will fold and once folded what the characteristics of that particular charge space is. You can approximate this by looking for conserved regions and homology to existing proteins or sub-sequences that have been solved. So you can approximate it but never really de novo. I guess ultimately I think that proteomics data can be thought of as gene-centric as well because every protein is the product of some gene and possibly some post-translational modifications or splicing. So yes, proteins all come from genes and so therefore there is a one-to-one mapping. One difficulty is that there are different accession spaces between genetics and proteomics and so coming up with a robust and reliable way to translate these back and forth is useful. So the main projects I've been involved with at ISB are the peptide atlas and the TPP, more so the peptide atlas. They are both evolving in response to user demand. But I think that's something very important in general in software development is that a lot of times software developers want to develop what they think will be a good solution without really contemplating, without really getting feedback all the time from the users. So I think that's the most important thing is to listen and get feedback and develop what is necessary, not what you think you want to develop. So the TPP is pretty mature in the context of data-dependent analysis but we're expanding it in the realm of DIA and other techniques. We're continuing educational outreach, having TPP courses literally all over the world. In the last two years we've had courses in Brazil, in Ireland, in Taiwan, in India and other places, several in the United States. So it really is an educational effort to basically help people understand what the tool is and how to use it. So one of the main challenges with peptide atlas is that it depends on public data sets. And so as soon as we make a build it's almost obsolete. It's time to go collect more data, reprocess it and remake the atlas. And we're talking about pretty vast amounts of data. There's a scientist at ISB named Zsun that does most of this data wrangling and she's very good at basically processing large amounts of data in a consistent and efficient way. One interesting technical issue from a programming standpoint is the protein supergroup issue. So basically one of the problems I meant to lead to earlier between proteomics and genomics is with proteomics you typically get peptide sequence. And peptides can map to multiple proteins. So it's sometimes difficult to infer from just the peptide sequence exactly which proteins you have. So there's a program called Protein Profit and there's other ones that basically solve this protein inference problem. You've identified these peptides so what proteins, the minimal group that is explained by all your experimental data. So it turns out protein profit is good at doing this by applying Occam's razor which is basically the simplest explanation is the correct one. In the peptide atlas there are so many peptides that we've gotten what is called a supergroup. So basically there's enough, there are peptides that map to multiple proteins and they sort of tie together these different groups and so it's a little bit hard to explain. But basically because of the sequence homology we end up, and the massive coverage in the peptide atlas we end up with this huge group of proteins which we think we've seen but it's difficult for the existing tools to de-convolute and decide exactly what we have seen. I think the most important thing is to take the time to learn about the data technique and the tools available before jumping into your analysis. It's too easy to be eager to push your project forward and basically order some sequencing to be done or what have you and then start running tools without really understanding them. The other thing to do is to read the literature and see how other people that have analyzed the same data are approaching this problem. And finally one of the most powerful ways for doing any data analysis is from the command line. It allows you to string together pipelines of programs as we heard discussed today. It allows you to do your analysis on the cloud which is very scalable. So learning to use programs at sort of the expert level and this is generally the command line I think is very useful especially for a graduate student. When you first started the question I thought you were talking about commercial packages which basically purport which claim that they can knit together these different data types and I think often they can for very specific data types. I think in the open source community which TPP is part of and there's genomics open projects like SAM tools and others there are efforts to make a common data language. So if you have a common data language then pretty much any tool that you use can be shared or can be extended to use other types of data. So I think in the open source space there is a desire and a recognition that interoperability is important. I think from a commercial company they're more interested in having you buy their software and so they're less interested in making everything intercovertible. Another scientist at ISB named Eric Deutsch does a lot of work on various standards initiatives like the proteome standards initiative and so basically they come up with defined file types and defined ontologies descriptive languages for communicating information and I think that sort of coming up with common formats and language is the most important part of interoperability. My name is Luis Mendoza, I'm a software engineer at the Institute for Systems Biology in the proteomics lab of Dr. Robert Moritz. I've been working there for about almost 15 years. My main goal has been to develop the software for the transpeutomic pipeline which is an open source free collection of analysis tools information quantification and integration tools of visualization that enable the advanced analysis of high throughput proteomic data from many kinds of instruments under many conditions. We've had great success, we have thousands of people over hundreds of labs around the world from small labs all the way to big pharmaceutical companies that to some extent use our software and so that's what I've been doing there for the last 15 years. Well there are many barriers, I see it, one of them is part of the barriers are just purely technical obviously there's all kinds of data being acquired in different omics platforms but bringing all this data together is still kind of a challenge mostly because it starts with the researchers many researchers specialize in one area or another and so when they try to integrate their data there's no easy way to do it so it's maybe too software to enable maybe doing this connecting genes to proteins to transcriptome data and many other data so I think at the moment just being able to provide a good software platform or even portals where we can integrate the data and this has already been done to some extent but that's probably the biggest barrier also learning about the different data and how to interpret different data a genomicist may not really be able to understand proteomic data for example very well so that's also a common barrier if you don't have a good collaboration with someone else that may make it a little bit more difficult For TPP specifically for our software even though yes, definitely a lot of efforts are being done to move towards progenomics there will still always be a need to some extent to do the identification and validation of just pure peptides and proteins so that part of it might not go away With these new techniques over the years we have so far proven that TPP has been able to evolve to accommodate different kinds of data sets different kinds of techniques and analysis we're doing RNA-seq already we're doing other things so there is very much a possibility that if this is where the field is moving that we will expand our tool sets expand create new software if this is what is required for the field that we'll be able to provide those or in some case perhaps have third-party tools that we integrate as we already do to have a more complete solution and a single perhaps software platform that people can rely on This is always a tricky question and we actually have many users who use TPP for organisms that have poorly characterized proteomes it is obviously indeed fairly difficult to do TPP specifically does require some sort of reference this is the basis of just sequence database searching but there are other tools out there that we don't specifically have within TPP other than doing a simple gene translation to protein that allow you to generate a customized database for example you can have RNA-seq data you can from your organism that you're studying you can generate from that using a set of tools a pipeline that is not something we have developed but that we are using at ISB to then generate a customized database that very much looks like a sequence database that you will use to then search against so at the moment and the reason we don't have this in TPP yet is for two reasons one of them other groups have already written these tools so we'll write the tool again but the second most important one is because at the moment this data requires a very large amount of processing power and memory and time and most normal computers are not able to even do this in several days time we have access to large computers and even then it takes easily one full 24-hour day to even analyze so we're trying to figure out ways that we can make this a little more efficient and faster and we're still working on that I think in general in science you want a combination of collaboration and competition so there are other teams that develop certain software that in some ways maybe the community or even we think is better than something we have or maybe something we don't have and we're able to integrate it with ours to make the whole platform better obviously there are other teams that make similar software to TPP that allows us to have a little bit of competition or a lot of competition I think that keeps us all providing a better software product or even if it's free to everyone and so I don't think it's necessarily one is better than the other a tool is as good as your ability to use it so even a very fast car if you don't know how to drive it is of no good to you so I think for a large amount all of the tools that are still out there that are popular all score perhaps fairly evenly in the end it's really up to the researchers to figure out in their hands if they can use it to get to their answers and if it's easily available to a large audience obviously we think our tools are worth a look since they work very well for us and we often compare them to the other ones and we think they are very competitive in our hands in our hands they do perform the best at least in the free software environment so TPP is like some other software out there it's great because it has been evolving over the last more than 15 years what started as just a simple program to perhaps validate just peptide assignments has grown into a number of software that can do end-to-end all kinds of analysis validation, quantification, different methods visualization in different ways, alignment and so it has been great to be able to provide the researchers mostly in our lab initially and then of course to the entire or a large part of the community worldwide community tools that will enable them to do great things I guess as a software developer I feel like a very small part of someone else's success when they actually do something very interesting with our tools and that's where the true value lies in the tools it's not exactly something I do but that enables researchers to find cures for disease or new ways to maybe eradicate some other organisms that are affecting some type of virus or something so in order to make it relevant over the years we've had to collaborate with our scientists and external scientists and being out here to even when you for example reach out and teach or present a conference as we get a lot of feedback and that enables the tools to become more mature and perhaps more useful to everyone some of the capabilities students need to do this is definitely a familiarity just because we have now a high throughput very large data sets you have to be able to have a basic understanding of just basic statistics and experiment design but also basics of how to use several computer software and be able to analyze their or evaluate the results to see which one works best for you at the most basic level most of the or many of the software out there will have a fairly easy to use interface however there can be many ways many reasons why it can be a little difficult to use if they have different data formats so you have to familiarize yourself with those pipelines it's also very useful for students if they can learn even a little bit to just use things on the command line or things like R the statistical language R because then you can really unlock other features that may not be obvious just on a simple graphical user interface especially if you're doing high throughput studies with many many many samples this makes your life far more efficient and easier than doing this so there are little things that you can do that perhaps wouldn't be too difficult to learn that will really give you a lot of value for your time in order that obviously talk to other students talk to your professors talk to maybe people that develop tools to help you figure out how to best use them and how to get to the results that you're looking for so that you spend more time doing interesting research than just trying to run some software I mean the challenges are still around so there's always a challenge and you know like they say you always consider that an opportunity so you know there's always times when something doesn't work or a new data set that gives you now a strange result that you weren't expecting because you didn't have that data set before and so this always come up constantly and that makes the tools more robust so we were able to solve that problem then that means that after that anyone with this type of data will hopefully have that problem solved for them with the tools so one of the challenges is just not having all the data around and yeah it's but that makes it fun for us to try to solve the problem and provide an answer