 I guess we will stop. So welcome everybody to this safe computational biology seminar series. Today, we have the pleasure to have Kristin Hans from the Agroscope. So Kristin has studied the technical biology at the University of Stuttgart in Germany. And he earns his PhD in genetics from the Oregon State University in the US, I-95, working on the identification of essential stimulatory genes involving bacterial virus denerification. Then from 1995 to 1997, he went on to his postdoc work at the Oregon State University and then at the Institute for Genetic Research Center in Karlsruhe in Germany. Kristin then worked on target discovery and bioinformatics research of several companies in Germany and in Denmark from 1999 to 2004. Then he moved to Switzerland at the University of Zurich and at the ETH, where he worked as scientist in proteomics bioinformatics. And since 2013, he's based at the Agroscope, leading the molecular diagnostics and genomics bioinformatics group. And since 2014, he's also a group leader at the Stuttgart Institute of Bioinformatics. So Kristin's research group centers around the bioinformatics integration and analysis of data sets from state-of-the-art omics technologies. The data sets they're working on comes from close collaboration with experimental biologists. Those data sets include genome sequence, gene and protein expression, as well as metabolomic data. And one particular focus of the group is to exploit the strategies to achieve complete genome coverage, including the membrane protein, as well as to identify all the proteins and coding in a genome, which is called proteogenomics. So the group contributed also to the development of several software. One of them is called Proter, which is a software tool that allows the visualization of the topology of the membrane proteins and to integrate annotation and experimental evidences in a form of publication ready plots. And another one, another resource, is called peptid rank, which allows the user to select the best suited peptides to quantitatively measure protein amounts in several organisms. So today, Kristin will share with us some insight on proteome discovery in precarious based on an integrative analysis of genomics, doskiptomics, and proteomics data. Kristin, thanks for coming. And the stage is yours. Thank you very much for the kind introduction. Hello, good afternoon. I will try to give a bit of an overview of the activities of the group in the next 45 minutes or so. Right now I'm employed by Agroscope, a Federal Research Institute for Agriculture and Nutrition. The work that I'm going to be talking about mainly has been done at the University of Zurich as part of a systems biology priority project. And let's go right in. Not yet. Yeah, now it's working. So yeah, thank you. Basically just here from the SIP, you see kind of some of the activities of the group. With the main activity in proteins and proteome, I will also cover some of our efforts integrating gene information, genome and transcriptome information of that and cover some bit of what we do in terms of infrastructure and all this is sort of related to systems biology here. Let recent picture for those attending offline. So give you an outline of the talk. I will briefly try to remind you of the advantages of proteomics data. Why does this actually gives us a lot of unique insights. Then tell you about a generic strategy to cover complete express proteomes. Do that on prokaryotes, where this is feasible. Give you some follow insights on the follow-up studies. We've done exploit this rather unique data set, a complete express condition specific express proteome. And then for the remainder last 15 minutes or so, show you what we're working on right now, proteogenomics that is using the proteomics data to try to find evidence for all the genes encoded in the genomes. And as I will show you currently, I believe all of the genome mutations, even of the best annotated model organisms, including E. coli are still a draft genome annotation, learn it complete. And then I'll finish with summary and outlook. So in an era where basically we can sequence prokaryotic genomes or even lower complexity eukaryotic genomes at ease within days. I still think it's important to remind ourselves that actually there are some major advantages that proteomics data provide that you cannot get from gene expression and genomic data. One of them is that if we look at the correlation of the average mRNA concentration and protein concentration in the cell, typically there's an okay correlation something to our square value of 0.5. Clearly indicating no, the gene expression where you can measure all the elements of the genome, all the genes does not give you the information about the true expression of the proteins which are much closer to function in the cell. Moreover, of course proteins can be modified by different post-insulation modifications, including for example, phospholation. So signaling cascades and so forth will not be able to deduce their activity by using gene expression, but you would need proteomics information for that. So protein modification or also maturation, as for example for the proteases is gonna be information only provided by proteomics. Furthermore, often you would want to get insights about the function of a protein and or a gene and often that's done by guilt association by looking at its interaction partners. Proteins don't work alone, they work in complexes. So this information of which protein works together with another protein is gonna be very important to help you to civil the function of certain processes in the cell. A fourth advantage is called proteogenomics, we'll cover that in more detail later. Basically you use the mass spectrometry data to tend the mass spectrometry peptides that you analyze the mass spectrometer and you match the spectra against a database which can be a protein database, could be also six frame translated genome or any other Appaniscia gene models to kind of provide evidence for expression of these gene models. New evidence that has not been contained, for example, additional axons, more axons that have not been annotated and even completely new gene models. A last advantage I'd like to mention is of course that protein expression also, if you do take your data from the respective subcellular organization that you can get this information and that provides you with a lot of insight into where is a certain protein active which can be of functional relevance as well. Now, in a review that we published a while ago, a couple years ago, we basically looked at how proteomics and proteome coverage had developed over the last 10 years or 10 or 15 years and what is important to notice that a lot of developments in terms of experimental approaches, instrumentation, better mass spectrometers, software, computational approaches have really helped to propel the proteomics field. One way to look at that is, for example, to look at the proteome coverage here from zero to 100% and then the year, and we're looking at an example, yeast agacies as Cerevisae, where by 2DEL, the gel studies in 1999, going over really a long period of time, years of experimentations, about 6% of the proteome were covered. In 2001, there was a publication where 23% of the theoretical yeast proteome were covered by shotgun proteomics. Basically, this gel-free method that didn't rely on 2D gels anymore, but took the proteins, digested them and peptides, separated them on chromatography columns and measured them by mass spectrometry. So the shotgun proteomics provided a big boost. Actually, in 2006, the coverage achieved by one experiment alone was 31%, achieved by Matthias Mann Group. And in 2008, they, using extensive fractionation, really massive effort, were able to define what they called the first completely expressed proteome yeast under two conditions. At the University of Zurich, we had measured proteome catalogs for different model organisms, including Drosophila, Ergotopsis and Cialigans, but had not come to this complete proteome coverage. One of the things that we wanted to achieve in this initiative at the University was basically to learn and take some of the very useful strategies in genomics. There was a two-step strategy in genomics that has been very useful, where in the first discovery phase, you go in and identify all the elements of the system and in terms of the genome, that's a genome sequence, that, as you can remember, for the genome that took years to accomplish. But once you have this initial step, this discovery phase finished, you can actually, and that's what people have done to develop technology and set specific probes to then use bico-rays with specific oligonucleotide probes to now go in and score the system for the quantitative expression of all the elements. We wanted to develop something similar with the proteome going in in the discovery phase, basically go and measure from many different tissues, time points, developmental stages, proteome as comprehensively as possible, and then to use in analogy to these specific oligonucleotides that you're aware of for micro-rays to use specific signature peptides, so called proteotypic peptides, that would uniquely and ambiguously identify one protein, and then use a different mass spectrometer technology, doesn't look at all the proteins, but only the peptides that you tell the mass spectrometer to look at, which is about 100 foot more sensitive and can generate you these complete quantitative data series required for systems biology. One important point with this analogy breaks is that with the genome sequence, except for the maybe like varying percentage of repeats that are very difficult to sequence, you can parallelize this. This is really a process that can be parallelized, sped up, but if you look at trying to discover a complete transcriptome of proteome, you have a lot of transcriptome proteins that are varying abundances, so you will typically always hit the most abundant ones, and if you want to describe a complete transcriptome of proteome, you really need a different strategy, and this is also reflected by the fact that we have maybe something like even more than 30,000 complete genome sequences, as I mentioned before today, prokaryotes, you can sequence some of the low complex ones in a day, less than a day, complete proteins at that time, there was more than used. So, switching years now and introducing the model system that we used prokaryote, botanyl-henselite, this was in a collaboration with of the data from beer biocentrum Barton, basically the model audience was most interesting to us because it had a relatively small genome about two megabase pairs with about 1,500 protein-coding genes. It was a model organism or is a model organism to study host pathogen interaction and importantly, it can be grown in pure culture, so we could isolate pure bacteria, which is important when we try to perform complete proteome. We did have an in vitro model system available and the group also had technologies to do subcellular fractionation. This in vitro model system is based on, this is the life cycle of botanyl-henselite, which here you see the phase in the intestinal and the big gap of the atropode vector, where it's then deposited on the sclerium, gets into the dermis and then basically comes into endothelial cells lining the blood vessels, where it's basically the primary, the first reservoir where replication takes place and then later on it goes into a replicative cycle within erythrocytes and in this, basically in this endothelial cells, we knew that a certain set of genes was required, a surface protein, a Virbide IV type IV secretion system and there's actually two of these secretion systems, one of them is Virbide IV that's required, absolutely required for the infection of these endothelial cells and this is the model system we were trying to replicate, whereas in this negative control, there's this other type IV receptor, the TRW type IV secretion system that would be required for infection of the erythrocytes, which we did not expect to see expressed under our conditions. Importantly, of the many advantages I told you, if you want to now really go and do such a complete proteome study, you have to really focus, we did focus on the express proteome, you cannot cover all these advantages and we developed a strategy where we tried to basically generic strategy how we could achieve to get these conditions specific express proteomes that envisioned to select and isolate from matched samples, both total RNA, sequence them to saturation, find them mouse here, sequence at the saturation to kind of define an endpoint estimate, go in from the same samples, extract the proteins to subcellular fractionation to reduce the complexity to go and get these lower expressed proteins that we were after as well and then go in and the pilot phase measure proteins and then target biases that you see against this expressed endpoint and basically going through this iterative circle to kind of complete this proteome, do some tests to convince you and then once you have that, you could do a lot of interesting experiments. So it's very few to show you so we did from biological replicates from these two stages that I mentioned induced, uninduced induced again the stage where this type four secretion system that straddles both the inner and outer membrane of those bacteria is expressed. We isolated our sequence about 55 to 80 million reads per example, including about 10 to 25 unique reads. So you can see soon when you go, you get soon to this saturation phase I mentioned before we extrapolated that even by sequencing double the amount of reads we could not get much deeper into this expressed transcriptome and then we also took some additional, took some additional measurements and criteria parameters to say we required five times five reads at the five prime end and the RPKM value of greater than 10 to come up with an endpoint estimate of about 1,353 express genes in those conditions and you see that nicely in the induced of course the target genes are several fold up regulated. So that gave us an endpoint against which now we could use, this is the endpoint given by the transcriptomics for both these conditions. Again, we would go in in these matched samples and look at these fractionations of the proteome at different conditions. Basically, we looked at the cytoplasm total membrane and further fractionated that into inner and outer membrane all of that with high mass accuracy mass spectrometers. We went in with a pilot phase where we did eight experiments on the in induced and uninduced both conditions, cytoplasm total membrane and both inner and outer membrane. That gave us about 920 proteins. Again, we did some extrapolation here, some modeling. How would this be continuing if we would not change the experimental approach? But then we knew that of course here from this to this endpoint there are of course other protein classes that we likely would target and this is shown here some density plots where you see for degree the pilot phase, the distribution for example of the protein length. Then in orange it's hard to see is the protein distribution of all and there's a bias for short proteins which is typical of shopping proteomics. We went in with directed experiments in this case gel filtration where you enrich for short proteins and could overcome this and identify many new short proteins using this. We can do this for low button proteins with a certain enrichment method called proteominer which help us to identify low button proteins. Most importantly you see here these last experiments here done with off-gelactose freezes at the peptide level which really helped us to massively increase the under express basic proteins typical of shopping proteomics and the membrane proteome. Another very difficult part of the proteome to get. This led us into 1,250 proteins that we identified overall and we extrapolated. We could also show again there is almost no benefit when we add one quarter more experimentation to this. One of the things for the strategy that I mentioned that is important we also use different search engines one that is commonly used, MassFort. Then there's a statistical combination with MassFort percolator that helps you actually identify about 50% more correct peptide spectrum matches and we use one publicly available software that again added another 6% of correct matches to those. That's of course quite important. At the spectrum level, at the peptide level you can see here the benefit of this MSGF plus software developed by Pavel Pavtis group at UC San Diego. At the peptide level this also gives a much large increase in terms of peptide modifications down to the proteome level then this becomes less pronounced but importantly with very stringent control of the false discovery rate we came up to identify 1250 proteins. Only seven of those were pointing against the decoy database indicating that we had really which you also need for doing this complete protein coverage need very, very stringent false discovery control. We went about 10-fold or 100-fold more stringent than other people do this at the spectrum level which then translated to a very low protein false discovery what we were aiming for. Just one thing to mention, these peptides we also classified them according to the gene content something we had described and of course one in the Snatch Biotech in 2010 this kind of gives you this what I mentioned before how information-rich is a peptide. If you want to go to targeted proteomics we really want to have those peptides that are information-rich. If you're interested you can follow this. Really for you periods there's many classes there's a lot of splice variants that can make this picture much more complicated for prokerts, it's not so difficult but of course when you think about moving into studying host pathogens with such a proteomics technique you would definitely want to do this. We can visualize the data then onto the genome that's the express genes adding the 1250 proteins which is about 90% of the express protein protein genes and then we can start to see that there's certain areas where there's no proteins expressed here, the red ones and when we add genome information we can see several things. Of course our positive control all the members are expressed, both transcribed and protein the TRW operand or negative control not expressed at the protein several of them at the transcript level and then we see that there are certain genome regions that are here annotated as pro-pho-GHE genomic islands or many of the orbs are not expressed and that's actually to be expected because often these genes on these genomic islands are required only under certain conditions. One thing to note also among these 1,408 distinct proteins there were about 50 that did not have any annotation in eggnog which is a very broad annotation and these 50 we only found evidence for two of them that's a really very statistically significant underrepresentation for these orbs and we believe that that's also evidence that some of these orbs are actually over annotated. Comparing when we look at the lab relative to the computer protein abundance that we've seen we compared this to three studies that have been published on Barton lab before and compared this to our proteome, the 12 and 50 and you see that these, this is the relative protein abundance but these previous studies really had identified mainly the highly abundant proteins so that you can easily capture. We did manage to go in and kind of come into the slow-bundant protein range where we wanted to go. Just a couple of things now on the pilot phase I told you we had identified with these first eight experiments 924 proteins when we look at the transmembrane proteins there is a significant underrepresentation as it's typical often of shotgun proteomics we went in and after all these experiments we observed that actually there is no underrepresentation of membrane proteins anymore. We followed this up with many more analyses some of them you can find in the supplement where we did 2D density plots based on the expression and some of the physical chemical parameters to really show this is what we could claim to be a complete expressed membrane proteome likely for the first time. Showing you here now the genomic structure of the Ruby type 4 secretion system with the downstream factor proteins that are secreted into the eukaryotic host cell. You can see that at the RNA-seq level these are largely induced. Many of them are largely induced but the up-regulation at the protein level is much more pronounced. You can visualize this here and this is the first time that complete coverage of this type 4 secretion of such a type 4 secretion has been achieved because typically of course the membrane proteins are very difficult to identify and just to show you that the negative control this type 4 secretion system relevant for the subsequent infection of erythrocytes is not regulated as we expected. So just this last one here we can of course also do differential expression analysis and basically rank the differentially expressed genes according to statistical significance using here db-seq or HR for example developed by Robinson. And what we realized is that in this condition what we really see is a massive reorganization of the membrane proteome but this key target of course of the trans-crypto-fect that we regulate in the individual system auto-transporters healing and binding proteins possibly an overwhelm factor and actually an army called sigma factor known to be involved in membrane control. So these were just to give you a bit of an overview of what we have done with this data set. Again this is published and there's many more analyses that you might find interesting. What we did of course do having such a data set in hand is we tried to work on different aspects and Ulrich Jommasitz who was also responsible for the genome research paper, the first author, he developed this open software proper for the integrated data visualization of both annotations coming for example from Uniprot as well as experimental proteomics information onto the predicted topology of these membrane proteins which just gives you the chance for example to analyze you can load the data and analyze and create these application-ready plots within half an hour for example we did that for the 280-partanella membrane protein. So this is free to use and it's heavily used by the community. There's another aspect I mentioned to you that the idea was to go towards targeted proteomics as you may remember in this discovery phase we're measuring for example this bacteria and bacteria and you will reach to a certain protein coverage we went very high with all this massive effort but typically you may come with few experiments to 50% protein coverage. What you can do is to think well actually without doing more experiments we want to now create a predictor that tells us what would be the best peptides if we wanted to do targeted proteomics for these 50% of the protein that we did not identify and we did this with a learning to rank model developed by Hermann Curie a postdoc several years ago in the group who came up with this idea to rank the best peptides the most often seen peptide and information-adjusted peptide per protein to learn this and predict them the ranking for proteins that were not observed and to that way come to the fact that people could actually use such a software to predict for an organism where there's no proteomics data which would be the peptides if we wanted to do this quantitative aspect of this proteomics. Another study we undertook is we had the subcellular information and Daniel Stechhofen, a postdoc in the lab at the time he went in and basically exploited this information he had from this cytoplasm total membrane inner membrane, outer membrane, and periplasm. It's important to know that at the time we did not use harsh conditions to separate those proteins. So actually many of the cytoplasmic proteins were still attached to their inner membrane proteins and we could see this very nicely when we looked at the spectral counts and looked at, do we find proteins exclusively found in the cytoplasm? There's very few, there's more total membrane fraction but actually when we then look at the ratio of the spectra that we observed in the total membrane over the cytoplasm, we could find then certain that are predominantly cytoplasmic and we looked at marker proteins to find many of these marker proteins and find very high significant enrichment and we can do this also for the predominant total membrane and that way kind of get an idea of where are these proteins perverentially located? And this is here to show you this very easy on this non-Z data, very easy spectral count proportion which is high for the spectral count proportion total membrane over cytoplasmic, it's high for the membrane proteins of course and the cytoplasmic markers are located here interestedly per plasmic markers even lower this is the inner and outer membrane markers that are much higher and actually you can even use these ratios to paint onto the proteins how closely they are associated with the membrane. So another thing you may want to check out and what Dan then went on to do is to basically look at the principal component analysis and K nearest neighbor classification to first do it on the markers then add the predominantly located proteins come up with a classification that was able to assign the predominant subcellularization on our dataset for 94% of the proteins including many for which piece or B one of the key predictors does not have a prediction and he came up with a experimental auto membrane protein catalog and these are of course of great importance if we think about the resurgent of infectious diseases it's gonna be more and more important for brain negatives to be able to identify the entire auto membrane proteome and to look at these proteins and that's what we described in that case. Now for the last part, I'm gonna tell you about proteogenomics basically using this information protein mass spectra to identify information and you've seen this picture before we discussed this before in this peptide spectrum matching process where you match the spectra against protein database and then of course depending on what you add to this protein database, you will find it most extensive database would be six frame translation of the genome very computational intense and there are likely better ways and we basically used our marginella data set and kind of started to look we had relied on the NCBI reference genome annotation and then there's been described before that actually if you look at other reference genome annotation genoscope this is the French reference genome annotation CMR from Craig Van Der Ken genome what's the different principle on thermodynamic prediction of what is a protein coding gene or not you see that there's a lot of differences many of these differences are based on the different start position of the proteins but not all of them so there are many orbs that are only predicted by one this is a common thing that you see that's not just for marginella this is here what I showed you about the start coding position difference so let's look at the RefSeq the RefSeq would call this position here as the start of the protein ensemble, genoscope and Craig Van Der Ken say no starts here another 11 amino acids downstream Ken Genome says no, starts here 27 amino acids more upstream and this is common for many and that's of course something that in a way it would be nice if you could capture this information and if you could use that information to search your proteomics data against this is why we devised this we would think of a novel generic proteogenomics approach now in a first phase bringing together the different reference genome annotations adding on top of these reference genome annotations so here we show you RefSeq, ensemble the ones I've shown you before and adding on top of those for those regions where there is no prediction also in silico orbs and we can basically within silico orbs we can decide which length cutoff we want to take we then create a minimally redundant protein database from these annotations and we keep track of which of the annotations are identical and which ones are different in an identifier then we come with the experimental proteomics data which is high quality proteomics data we come with a very stringent based FDR filtering and we search this proteomics data against this database that we create and what we're working on is to try to work on an expert based system that helps us to basically process this data and prioritize this data in terms of for example show me what are the novel ORF candidates that you find best in the high quality mass proteomics data what are the amino terminal annotation differences you've seen the start sites that often differ and for example other questions could be what about pseudogenes we included also some of the large differences of pseudogenes and we can basically look for that evidence we can integrate this in IGV with the RNASEC data that we have and these predictions and just showing you this what we come up with for Barginella with these 4988 RefSeq predicted genes we found evidence in our data sets which were two conditions so we went quite deep but it's only two conditions ideally you want to do this with third conditions if you have because then you really have much more chance to find those ORFs that are likely expressed under only specific conditions but we found for all of these different reference genome annotations including in Cylical we found evidence for new ORFs and just showing you one here where the RNASEC evidence here strand specific RNASEC the red reads here show that this short ORF is in fact expressed and then when we look at a plot where we plot the protein length of the proteins versus the spectral count as a measure of their roughly a measure of their abundance and then show you two positive controls the transcription factor we upper-regulate that drives the expression of all these green guys that's the verbi D4 system I showed you before you kind of get an idea okay these are quite nicely expressed and now look at all the red ones these are selected new ORFs that we found with this method and what you can see is okay first of all well there's even some that are to one amino acids that were missed by RefSeq they can be small something like 26 amino acids and based on their length though and expression they're not they're not so no expressed so these may indeed have a function okay so moving on here have a function and I'm just showing you now the integration of all this data onto the genome sequence where this is the reference sequence from NCDI for Bartonella that was just one strain here 1.931 base pairs looking at the forward strand RNAseq evidence that this is expressed here and then we have several ORFs that we integrated in our database that was an in silica ORF there's no evidence for that however both for this channel genome and this microscope so the french genoscope site predicted novel ORFs we do find spectral evidence and in this case it looks like spectral evidence does point at that this is the relevant ORF that we can find here so this is a generic thing that we can do and I now want to basically switch gears a bit and tell you about one way of doing it was to integrate this reference genome mutations if you think about it in a more broad way the best way to do it actually is to sequence that genome because you know somebody 10 years ago sequenced a strain of 15 who knows deposited in NCDI and it may have changed so we went actually in to sequence optional handslet this 2 megabase pair genome we thought okay cool it's going to be with the technology with pack bio coming along at that time with 8,000 base pair reads piece of cake walk in the park do it in a month teamed up with Mark Robinson and realized that no it's not so easy we actually then developed the server now in my group in AgroScope that allows you to visualize repeat sequences onto the genome and this is when we realized who's botanilla is not easy you know it's not a walk in the park and this has been described by Sergey Koran one of the key movers and shakers in this field where he visualized the number of repeats of more than 500 base pairs that are about 90 or 95 percent adeptical and then the maximum repeat length and he classified easy genomes into this class one so there the longest repeat is the RDNA operand and then there's these class two genomes that have many repeats not longer than five six k kb about but they have many repeats so very difficult so without pack bio you cannot assemble them not what might seek and then there's this class three the difficult genomes where really you can have larger repeats and then the botanilla here is clocked at something like 12 kb and actually when we did sequence the extra genome it's larger so that's why also one of the differences but clearly now we're moving in this area where we can sequence those genomes rather straightforward and so we develop this pipeline where we can do the new genome assembly we don't do a reference based the age gap turned out for us a good assembly algorithm that has been described so I'm not going to discuss it in detail we do quality assessment of our assemblies try to find the best genome draft by using several metrics then annotate this with for example prokka and then see what we get so now what we can do we can use all this proteomics in RNA sector that I showed you before and we can search it against once the NCBI refsec what we thought was the genome we're dealing with right 1,931 kb kilobase pairs the actual assembly that we believe is the real sequence is about 23 kilobase larger so there's some inserts some more larger repeats we have some measurement of the predictor that's by prokka protein coding ORS and then we can match our spectra this huge data set with very stringent control and we see okay we can match about 10,000 spectra more to this assembly about 4,000 per pet more and about 13 proteins more so we can use this to really then work on these de novo genomes and one of the things we've also developed that we think it's quite helpful if you have two close genomes you could think of basically globally aligning them the assembly to the reference and generate a virtual genome that has basically all the nucleotides from those genomes and then map our experimental data our annotations our reads, proteins, peptides all the experimental data to it to visualize this this helps us to identify for example SNPs and in this case for example you see here this is the assembly because it's a SNP of course you have here the assembly here in green the reference and you have in the you would have one additional nucleotide for each you can see here mapping back to the reference genome there's no spectral counts there's no peptide spectra matches here with the SNP we find this is expressed and we can detect this peptide with a SNP we can do this also with insertions so here that's in the assembly we have this insertion here that's within an open reading frame and if you look at that again there's no spectral peptides matching when we use the reference sequence that doesn't have doesn't know about this insertion when we do have our assembly that we think is correct we see this that of course has implications for using this in a clinical setting just showing you some more examples here one frame shift within a filamentous phenylutinine that's on the outer membrane where we have an insertion in the or a deletion in our assembly and there is no reads mapping in the reference sequence in our assembly we get a read through and longer reads mapping to it another example where in a likely operon three open reading frames here in close vicinity the latter two of which are nicely expressed under the settings we have with eight different peptides 47 spectra five peptides 23 spectra this year it's not expressed potentially this is the cause because there is an insertion into that open reading frame although that's difficult to prove but that could be one thing and just a last example is to actually show another example where here on a very let's create a transcription and location factor create a very highly expressed both on the wild type we have this high expression but then we have a snip here in this position and this is an older version it's where we did not use the virtual genome for this but here it's supported by packed bio data this is a snip and when we look at the real genome our assembly the entire protein sequences is covered by peptides because this is a highly highly expressed protein and clearly now you can see okay we can sequence clinical genomes and we can use proteomics to potentially define and find synoptic type polymorphisms at appropriate level so we have worked with different collaborators with label and and we've applied this for the genomics approach here on local data turkey and obligate symbiont where we could show that there are some or that were missed in the annotation that are important for secondary metabolite they're actually highly expressed and with radiozobium the polychrome that's just basically in a version we were able to integrate transcription start site data 5 gram end transcription start sites so we'll go back here and for example here showing one of the broader so we have polychrome genes this is a large team nine mega bases and there's two different transcription start sites one internal one and we find unique peptide evidence for both of them and basically showing that there's also protein isoforms starting within a predicted reading frame we did find a large number of orfs actually compared to another study that had used this data without the without the internal transcription start sites and we actually had quite a lot more both in terms of new opening frames and new new terminus so this is a generic approach and so far in most of the procurates that we looked we can find evidence for these missed orfs so coming to the end and summarizing this so I've told you about that in the proteomics field there has been this move to also like in genomics go from a discovery phase where you basically catalog what is there to a targeted proteomics phase where you can create quantitative data series with much higher sensitivity and complete series we developed a generic strategy to reach complete condition-specific expressed proteomes in procurates where this strategy is feasible with a manageable effort we basically looked at because we had this available to sub-cellular localization which often provides some interesting insights and could lead to functional predictions of orfs without any functional annotation clearly we did not find expression evidence for all the proteins so we could if we wanted now to go in and for example as I mentioned we if we found a complete membrane proteome we now want to go in with targeted proteomics and look at the surface zone the entire surface zone in many many different conditions we could use our software here predict the best peptides for this targeted proteomics clearly what's important is to realize that the genomes even today still are annotation draft and that this proteomics data can largely help in correcting this I should mention here also the NCBI annotation that we had relied on this 1488 orfs it actually changed a couple of months ago they deleted 104 orfs and added another 140 when we go in with our data we can show that actually this is a new pipeline they're using gen mark 3 we can show that for 14 of the orfs that they removed we still have we have expression evidence so it's likely that they're expressed and for some of them that they added new that we find evidence so I'm just saying this is really a moving field and doing it the noble with your sequence with your strength in your lab is going to give you the best possible setting to really go and clearly this has implications for clinical proteomics then the most importantly many people involved that did all this work I'm going to point out all the insights was the key mastermind behind genome research study subcellular localization Dan Steckhofen statistician from the ETH who helped us a lot with bringing statistical excellence to to these analyses we're grateful to Christoph Dayo and his student Maxime for sending us the protein extracts and the RNA sac the total RNA sac data mark who is a collaborator that we're working on on the pack bio genome assembly Olga who did the first analysis on the on the first assembly efforts and again here we went through several rounds actually and only lately did we get with the latest long reads we got success functional genomic center they had prepared the protein extracts based on the or the protein fractionation based on the extracts and measured the proteomics and they're about the DNA with the pack bio a flop for access and support and then the group at Acroscope where we announced the postdoc and it is a PhD student with us for three years and here on Schmidt civil service worker and a former diploma student clearly we're very grateful for the great collaboration also with Berndt at DTH now moving into collaboration with Beatt to actually look on autogonal data on mutagenesis where he shows he finds genome regions that are essentially where there's no or predicted whether we find the evidence for those which would basically bring this to another level and show okay it's not only new or there are some essential outs that we did not predict and then there's other people that we collaborate with and hear the funding agencies and then I thank you a lot for the attention and hope that that was halfway could be followed what I was telling you about and glad to take any questions you may have thanks