 So welcome all to this SIB Virtual Computational Biology Seminar Series. Today we have the pleasure to host Marcus Müller from the Podium Informatics Group, which is part of the University of Geneva and the SIB. Marcus grew up in Zurich, where he studied physics at the ETH Institute, and after some times at the Institute of Cereltical Physics at the ETH, as a researcher and teaching assistant, he moved to industry as a software engineer for several years working on data streaming, databases, image processing and code reviews. Then Marcus moved back to academia and he got his PhD in bioinformatics at the University of Geneva in the Hohenappels Group in 2003, and he worked the following four years as a group leader in the Houdi-Ebersold Institute for Molecular Systems Biology at the ETH. Since 2003, Marcus is back at the University of Geneva, and the focus of his work has shifted towards the usage of large spectral library to improve protein identification and towards statistical modeling and validation of proteomics data. So the proteomics, proteomeinformatics groups in which Marcus is now a serious scientist, is involved in developing databases and software for the benefit of the proteomics and the glycomics community. These resources are made available through the XPC portal. So today Marcus will share with us his work on mining large-scale proteomics data for protein modifications. Marcus, thank you again for accepting our invitation and the floor is yours. Thank you very much Diana. Protein modifications are central to our most cellular mechanism. To illustrate that, I'll show you this picture here from a publication by the H.A.R.T. adult in MCP in 2014. What they did is they took cells that actually express P53 at the high level, but don't die. They extracted P53 from these cells and isolated it and then they used a combination of open modification and variable modification searches to find as many modifications as possible on this P53 protein. As you can see, you can see two things. First, this was the state above here of before the publication, this is after publication, so there's a lot of things to be discovered on the PTM level. They more than doubled the number of PTMs known for this P53 protein, which is a very well researched group. On the other hand, if you have so many modifications, you can imagine the astronomical number of combinations you have and it is a different task to actually find out which of these combinations are present in the cell, which also these combinations are functional and what or which function they perform. So this is a little illustration here. These PTMs can be seen as devices that sense the environment of a protein. So you have proteins like kinases that can write on these devices, they can write the record. You have protein like phosphatases that can actually erase the record again. There is cross talk between these modifications. For example, this modification here can only be put on the specific site if the other modification is present. Then you also have competition of several writers that try to write on the same modification site and the writer that is more active as preference and will be able to write modification. As a result of all this, you get like a code, so called PTM code. One talks a lot about histone codes, there are the codes as well. And that code actually defines the function, represents somehow the state of the cell and defines the function of the protein. There's a large number of functions one can implement with this type of PTM mechanism. So you can implement switches, very robust switches, which are for example used for cell division or to start cell division. You can define amplifiers, you can define signal integrators that integrate different signals at the same time or over different time points. Mass spectrometry is, I guess, the method of choice to find the post-translational modifications. I don't think there's any other method that can do that in an unbiased way on a large scale. So very briefly in this standard shotgun mass spectrometry approach, what you do is you extract different proteins from a cell, I say it, you digest them, you separate the peptides in the LC column, you inject it to a mass spec and the most intense signals are then chosen for fragmentation. And you get thousands of these peptide-fragment fingerprints, which are then used to identify the peptide in a sequence database. Now if you have a modification on this peptide here, for example, you have three-betulation sitting on this arginine, all the fragment ion peaks that contain arginine will shift by the mass of the three-betulation and all the fragment ions that do not contain arginine will remain the same. With that, you can detect the three-betulation, you can detect the modification and you can pinpoint it with quite high accuracy on the right residue in your peptide. This is somehow a motivation for what we're doing. This is the publication from Steve Geige's lab. It's just very recently, it actually came out just briefly before our own publication, where they analyzed about a million, more than a million spectra from a cell line and they concluded that at least one third of the unassigned spectra arise from peptides with substircumetric modifications. So there's a lot of modification present in this MSMS spectra, which are usually undetected. A different approach and they use the so-called open modification search to achieve that. The different approach is, but also very interesting, if you, and this becoming more and more common, is to reuse or reanalyze existing proteomic data. For example, in this publication they make use of the fact that in a, if you have a phosphor isolation study, where you isolate for phosphates, that you usually do not only isolate for phosphates, but they usually also isolate for other molecules that contain a phosphate. In this case, it was ADP resolation and you can actually use this already measured existing data to pull out this, a lot of these ADP resolation sites and this was a very characterized modification and they could largely increase the knowledge about this modification just by reanalyzing this data. By the way, most people not analyze or not use open modification searches or similar things, do not extensively look for phosphorization modifications. The reason is simple. If you look for phosphorization modifications, you're in the situation of that picture here on the lower right. It is kind of difficult to do. It's a slow process and there's sometimes a lot of manual work involved. So why is it a slow process? This has to do with the search space. Let's assume you try to find, you have human proteins, about these 20,000 proteins. You have your fast file, which is your to use as a search space and then you digest the proteins here in silicone and you get a list of peptides. In that case, it's about 1.3 million peptides with a mass in a certain mass range here. However, if you allow a modification, let's assume we want to look for phosphorylation on serine or trionine and for oxidation or metronine and we allow only two modifications per peptide. Then the number of peptides in our search space, now each serine you have to count as either modified or not. Each metronine has to be considered as either modified or not. So the number of peptides in our search space goes up to almost 5 million. Now this is only for two modifications. If you have more modifications, it gets more complex. If you don't know at all which modifications are present in your sample, then you might have an idea just to take a database like Unimog, which lists all the potential modifications and look for those, but then it becomes prohibitive and in the search time will be too slow and you also have to deal with force positives and another effect is that if you do such a search here is from a modification search with six variable modifications that was run at the global FDR of 1%, that actually the peptides that have one modification already have a higher error of about 2.5% and the peptides with two modifications have an error of about 8%. So the more modifications you include, the more error you include as well. Now instead of configuring as it is usually the case configuring the modification you want to look for prior to your search, in our lab we use a different approach which is an open modification search approach where we actually reach the modifications directly from the sample. So we don't need any prior knowledge which modifications are present, we just use the spectra that are given to us. And we have a very special approach here which is used in some other labs as well, but it was mainly developed in our own lab and we give some justification for this approach later. At the moment I just like to describe it, so it is based on a spectrum library approach. A spectrum library is a assembly of spectra that are identified with a high significance. So we take a sample, we run a standard identification tool with no or just a few modifications. We identify that for example with Maxfont or Mascot or whatever. Then we take the identified PSMs and spectra, merge them together, so we take all the spectra to the match to the same peptide. It is the peptide with the same modification state and the same chart. And then for this PSM we calculate a consensus spectrum which usually has the better quality than an individual spectrum. We put aside the non-identified spectra and then we build this spectrum library. We'll also give more detail on this. Since you want to control the error rate in this approach, we calculate this decoy spectra as well and then we use our tool. This liberator builds a spectrum library, the liberator builds decoy spectra and MZ mod. Thus the open modification search on this spectrum library using the unidentified spectra. So how these open modification searches are based on spectral alignments and these spectrum alignments actually allow to extract the modifications directly from your data. So how do they work? They're actually very simple. Assuming you have a gravity spectrum with a given mass and you have a mass tolerance. So you assume that the modifications that you are looking for are smaller than a certain value delta M which is usually in the range of 100 to maybe 500 delta. So most of the modifications fall in this mass range. Then you have your gravity spectrum and you have a database which are either sequences or a spectra from the spectrum library and you extract all the spectra that are within this mass tolerance here. For each of these extracted spectrum you check whether actually open or modification could explain the mass difference between the gravity spectrum and the library spectrum. How do you do that? It's actually quite simple. You just take the two precursor masses so the gravity spectrum is a precursor mass of 772. The library spectrum of 692 the difference is 80 delta. Then you assume that this 80 delta is due to a modification. So you try to put the modification on each of the residues in these peptides and you shift the ions accordingly. So you do that. And if you find a residue where this shifted ions give a large overlap with the gravity spectrum, a significantly larger overlap, then you assume that you found the modification here because this explains, modification explains the difference between these two spectra and you assume that this modification this 80 delta on trionine is actually, sorry, on tyrosine is actually present in your sample and without any prior knowledge on that. No, just by comparing spectra. So the alignment we do is of course slightly different than that. I mean we work a lot on the performance to speed these things up. But in principle this is what we're doing. Then we also spend quite some time on trying to create decoy spectra that work because it's not that easy to create a decoy spectrum because it still has to be a spectrum. It still has to contain peptides somehow. It still has to have peptide fragments. Also the most of the peaks are actually not matched to peptide fragments but they also have a structure and you have to try to retain this structure as well. What you have to do is you try to create a random spectrum or a randomized spectrum that is still a spectrum. Additionally what is often the case if you randomize too much, so the randomized spectrum is actually quite similar to the original spectrum so you have to make sure that the similarity to the original spectrum is so you have to be really careful how you created this decoy spectra but I guess we found something that more or less works. We also regularly when we do these searches we check whether our decoy calculation is actually right. We do this in the following way. So we have our standard sequence search for example Xtendem. We search for a couple of modifications which are quite abundant and some of them are then also picked up by the open modification search in AnsetMod. Both the Xtendem search and the AnsetMod search are run at a forced discovery rate of 1% so there's about 1% error in these results. Now in the overlap some of the spectra will be identified by both Xtendem and AnsetMod and the disagreement between the two in this overlap region cannot be larger than 2% because 1% maximally will be wrong here. 1% will be wrong here but the disagreement can't be larger than 2%. If it's larger than 2% something is wrong. You don't know whether the FDR calculation Xtendem is wrong or the FDR calculation AnsetMod is wrong but you know something is wrong. And this is the way we actually check whether our FDR calculations are more or less okay and they seem to be more or less okay but we also have cases for example we use this mod A tool with a reverse sequence database and the reverse sequence database just doesn't work for this type of modification searches. So it pays FDR values that were quite drastically wrong. So we did a lot of these type of analysis of these open modification searches. We tried to identify modification introduced by sample processing protocols. We tried to identify modifications introduced by certain chemicals and most of the time these projects were actually quite successful and what we try to do in this work here is we try to do this type of open modification search on a very large scale. So we took a data that is very heterogeneous. This is from the quite famous publication from Pandis Lab where they try to obtain a fairly complete map of the human proteome. This was published in Nature. So they took a lot of tissues, 17 adult tissues, seven fetal tissues and some blood cells. They extracted the protein, separated them by SCS page, digested by trypsine, separated the tryptic peptides by reverse phase and obtained the MS-MS back draw in a hcd orbit trap and this resulted in about 25 million spectra. Okay this is fairly large dataset. At the time this was about one tenth of the the spectra that were available in public repository. Also interesting is that the dataset is quite heterogeneous so it's not only one cell line or one tissue, it's an assembly of different cell lines or it's only different cells and different tissues. And we wanted to find out whether open modification searches actually work and whether they're feasible for this type of first thing we did is we just want to check how long this actually takes to search this data. So we used Xtendem as a tool like very much. It is fast, it is reliable and you can configure many things with it. So we used Xtendem on the vital IT cluster and without modification this took about 40 hours. With six more variable modifications this took about 240 hours. So without modification this is feasible. You can almost do it on your own laptop. With modification it becomes a bit more difficult and especially if you have many modifications. You need to have a large cluster. The open modification search we tried this modate tool that needed it a little bit longer so that needed about 1700 hours to run on this 25 million spectra. At the time we had a quick mode which was the tool we developed before, mthend mode and we quickly saw that quick mode didn't scale. It was difficult to be deployed on the cluster and so that since we like developing software this led us to the development of a new software tool this mthend mode and mthend mode this is not yet the fully optimized version here but it performed quite well. It had about the performance of modate. Another difficulty that you have if you analyze large data in this pundit data we had about 2,200 LC-MS runs is that if you combine these different runs together that they're statistically not independent and if you calculate an FDR of 1% for each run combine the result list of 2,200 runs together what you get at the end is an FDR of maybe 20 or 30%. Just because of that one reason that in the true hits here they come they actually maybe 100,000 peptides that come from a fairly small space and they saturate fairly quickly after maybe 20 runs or so they start saturating and you can do more and more experiments but you won't find more and more true hits. The wrong hits is completely different because especially if you have modifications most of the peptides you have in your search space will never match a real peptide in your sample that's just there in the search space and if you randomly pick one of these random hits here they won't saturate there quickly because that wrong hit search space is much larger than the true hits. So they won't saturate and as a result your true hits will saturate your wrong hits won't saturate and you will force discovery rate. Okay that's a standard thing in proteomics as something you encounter all the time and you have to be careful dealing with that. Another thing is also if you calculate this is the statistical variation if you calculate the score threshold at 1% FDR for each round separately you see this threshold histogram here you see a large variation of these thresholds which introduces non-repressibility and which introduces additional noise in your data. However if you do that for all the data together you get a very clear picture and more precise threshold. So therefore it makes sense to at least my view to analyze all the data in one go and for that you need software. Now fortunately we like Scala, we like Java, unfortunately we have now this Apaches Park framework which was recently developed it came actually out of a thesis work by Matthe Tsararia and I think it's credible indeed for its thesis so I highly recommend to actually read the thesis if you want to have an introduction into Spark and the satellite projects around Spark. Now it's part of the Apache Institution so it is developed by many developers. It has several objectives so it is easy to use so it allows writing parallel and distributed code with little overhead as you will see it runs on the Java virtual machine so you can write your code in Scala, Java or Python. It has easy IO so you can easily read Hadoop files, Hadoop key value files, you can read HBase files which are very large distributed tables, you can read, you have an SQL interface that allows you to extract data from databases, you can read things from no SQL databases as well. It's very advantage compared to all the things like the Hadoop mat reduce framework is probably that it is fast and it is fast because it allows caching your data in memory so you can distribute your data on different nodes you can keep it in memory. It also has a fast serialization and it has something like mechanism to avoid stragglers these are very slow processes that sometimes happen on a very large cluster and you can just pass around them and finish them quickly. It is safe so if a cluster node fails where a part of your job runs that I mean it matters it slows down your process but it doesn't kill your whole job. Spark has a system to restart this job as we see and it's very configurable so you can configure your data persistency, your serialization, your data partitioning and so on. The main data structure in Spark is the so-called resilient distributed dataset or RDV and this is a read only collection so you can only read it you can't change it of immutable objects in the objects that are in RDV you can't change and this collection here is partitioned across a cluster of machine. RDV can be built in mainly following ways so you can maybe have a Java collection, you can paralyze them, turn them into an RDD, you can read an RDD directly from stable storage from your disk space, maybe a huddle file system or local disk or from a database or you can build a new RDD by transforming an old RDD. And Spark offers quite a large set of possible transformations these are reduced map you can sample from an RDD, you can show in different RDDs, you can filter an RDD, you can build the union of two RDDs or the intersections of two RDDs and so on. These are all offered by Spark and it also offers some actions so these transformations are lazy they actually only executed after you call an action here and these actions actually collect the data from the distributed RDD from the distributed machines and bring it back on the master node. This Spark has full tolerance which is very important if you work on large clusters so if one of the cluster nodes fails Spark kind of stores the transformation graph how this partition on that cluster was built and it will just rebuild this partition from scratch and deploy it on a new cluster. It also has two types of global variables these are broadcasts and accumulators broadcasts are read only variables that you distribute to the slave nodes and accumulators are variables where you can only write to or add things to but they can only be read by the master. This is a bit maybe more visual form of this RDD so you have your RDD here you read it from this you partition it each partition is sent to a slave node together with the global variable for example a broadcast variable then on each node you perform the transformations in the easy case it's a map transformation which only uses data from the same node in a more complicated case you do for example global sorting then these nodes need to exchange data which obviously then slows down the whole process and at the end you have an action and this action will collect all the data or some of the data and transfer it back to the master. So this is Spark and we use that now to build our tool so the first tool we have is this library creation tool this is liberator we store our spectra and our peptide spectrum matches in Hadoop files which can be read by Spark and turned into a spectrum RDD and the PSM RDD direct. Then we join the two RDDs you can do that with this Spark operation here so each spectrum is joined to its peptide spectrum match. Then we put aside the filter operation the unidentified spectra and keep the identified spectra we group them so we put all spectra together that belong to the same peptide spectrum same peptide mean same peptide same modification state same charge state and then we use this map function here to actually calculate the consensus spectra and to also calculate the decode spectrum. We put a lot of effort just into this part here I can't go into details for that but this most of the work actually went in here but the Spark gives you a nice framework to parallelize this code here. Then we take the non-identified spectra as described earlier and use msubnodes to identify or to the open modification search this is not the real msubnode code but the real msubnode code is actually no much more complicated than this one here you have a little bit of initialization you read your spectrum library you distribute the spectrum library on the nodes with this broadcast for example then you do the actual open modification search so you read your prairie spectra you filter them for example you don't the only one spectra with a charge larger than one and smaller than five then in this map function you do the actual open modification search on this spectrum library and the result of this open modification search are stored in this rdd and you can choose the persistence of this rdd you can tell whether you want to keep it in memory and disk for example for later use. So Spark is really I mean that's about all you have to do and it's really an easy framework. All you need to do is to make sure that these things this broadcast and this object here will be sent over the network so you have to make sure that they are serializable and obviously this all the intelligence in the open modification searches within this object here and you have to implement this intelligence as well. This is an open modification search like it would run on one single node and using Spark you can just deploy this open modification search all over. So these are some results these are just some tests we did so we also worked a little bit on the scoring. We used the classical scores that are used in proteomics within to any wonders here so we used kind of a normalized dot product score to evaluate the correlation between the 20 spectrum and the library spectrum then we have a positioning score that evaluates how well the alignment works taking the sequence information into account and we have a score that evaluates how well the aligned spectrum covers actually the sequence of the pattern. All these scores are essential they improve the results you have here the number of spectra sorted by their score and here you have the number of wrong hits so the flatter this curve is the better it is the better wrong hits number the smaller the number of wrong hits for a certain number of identified spectrum. We also have a delta score which is also quite known in proteomics it is like an A score or a score that evaluates how well the best position distates itself from the from the runner up positions and you can see that by adding this course we can improve the whole performance of MZ mod and we get actually right times. One thing that is kind of the curse of these open modification searches are these ambiguous hits so each time you do these things you fall over them there's nothing you can do to avoid them or there are actually things you can do to avoid them but most of the tools actually don't do that because these two are somehow blind. Now let's assume in your search space you have two peptides this one and this one here they only differ in this two in this aspartic acid and aspartic acid and glutamic acid here in these two positions. So the two peptides have almost the same mass so they have a mass difference of 14 delta. Now for an open modification search tool which is kind of a blind tool it is somehow the same thing whether you identify this peptide here or whether you add modulation on lysine here and identify this. Apart from that one peak that matches here there's no difference. So the open modification search tool is blind so it doesn't know for us somehow clear okay this is more likely here maybe also this belongs to a protein which is more abundant than this one here that makes things clear but for the open modification search tool these things are almost equivalent and it has to be the same. And this is actually something where spectrum libraries help as can be seen in this example here so we have again some ambiguous hits we have the true hit here this is a peptide with an oxidized metionine on myosin then Amphibnode also found this oxidized peptide here because the non-oxidized version is present in the spectrum library. And Amphibnode found this peptide here which is very very similar instead of a metionine it has a lysine and then it has to correct for the mass shift and it just adds a mass difference of 33 Alton. Now for mod A this is almost the same thing again it doesn't know that a mass of 33 Alton 0.9 Alton on lysine does not really exist it just looks for mass shift and finds this. And this is wrong in this case and Amphibnode was saved because this peptide here was not in the spectrum library so it just did not even consider it. And the reason for why Amphibnode works often better than Mod A is that it is much more likely if you have a modification the modification is much more likely to be real if the non-modified peptide is also present as simple as this. I mean there are some differences as well in sporing and all that but I guess one of the main differences is due to that. It doesn't always work for example if the wrong peptide is also in the spectrum library Amphibnode will also pick the wrong one if that gives the higher score. But in many cases you are in this opposite situation here where Amphibnode picks the right. So this is the summary of this comparison between Mod A and Amphibnode M score. The Mod A curve here is in red and Amphibnode curve in blue so the flatter these curves are the better it is. If you had a means to identify this ambiguous hit and to pick the right hit among these ambiguous hits the performance of Amphibnode would improve quite largely and those of the performance of sorry of Mod A would improve largely and those of the performance of Amphibnode. But right now we don't have this criteria to identify the right hit among the ambiguous ones and I think we should work on trying to do that and not improve other things. If you do open modification searches replace standard sequence searches the short answer is no. If you compare for a specific modification how many modifications are picked up by extend them if you configure that as a variable modification compared to mz mod for example you see that in all cases extend them finds more hits than mz mod and mod A. Just because xonem has an easier task it only has one modifications to search for whereas mz mod or mod A have the whole space of allowed modifications so that creates a lot of false positives and requires more restrictive thresholds. So if you know the modifications you want to search for you don't need an open modification search which is configure them and extend them and you run extend them and you're done. However the open modification search tool makes sense if you don't know that or if you have too many of these modifications then they become useful and I think we think that mz mod is quite a good tool for that so in general it gives you more identifications than mod A on this very large dataset. So this is kind of a summary of the modifications that were picked up by all three tools you see there's about two million spectra could be matched to a modification. xonem searched for the standard modifications like phosphorylation acetylation and so on and mod A and mz mod looked for all the modifications. Each tool finds modifications the other ones don't find all of the tools were on at AFTR of one percent. It's about two million so that's quite a bit so at the end about 30 percent of your PSMs are modified PSM so that's a significant. However if you look at these modifications if you do the modification mass histograms where you just count how many times that modification mass shift is detected you will see that the vast majority of these modifications are chemical. These are the lone things like metonyne oxidation, deamidation, double oxidation, carbamidamylation, carbamino methylation. These are the things that are just there to sample processing they react with the proteins and we pick them up because these searches are quite sensitive. Here you have as a little tiny peak here phosphorylations. It looks tiny compared to this huge peak but it's actually not that tiny only mz mod alone did pick up about six thousand unique peptides that were phosphorylated which is comparable to a standard phosphorylation study where you enrich phosphor peptides and it found quite a lot of these phosphor peptides just due to the sheer volume of the data. Are these modifications these chemical modifications completely useless? You could say okay we calculate over two thousand hours just to find mainly chemical modifications. This is a recent work done by Debo Hoba and he showed actually that it includes the most abundant chemical modifications in your sequence search that you actually find some more peptides than if you do a sequence search without any modifications and since tools like mz mod or mod A give us information which are abundant chemical modifications you can extract this information configure your sequence search accordingly and research your data with these modifications and in this case this is from human sperm data. There are a lot of this project together with French groups and we could identify quite a lot of proteins that don't have experimental evidence and if you add these chemical modifications to your search it could increase the number of peptides that match to these proteins here quite significantly. So they're not completely useless these chemical modifications the question is whether you want to take into account the increase in search time and maybe other factors in order to find them. So at the end just some results briefly these are things I found with mz mod. For example you have something like oxidation of proline. You could imagine that oxidation of proline might not be the most interesting thing maybe it is not interesting but the question here was more whether oxidation of proline compared to oxidation of metadiene is actually an artifact or whether it is actually real. And the data seems to indicate that at least part of this proline oxidation is real. For example if you have this graph here you have from the x-axis the tissues on the y-axis you have the modification counts or how many times this modification was detected in a specific tissue you see the two there's no correlation between proline oxidation and metadiene oxidation. If that would be an artifact due to the amount of oxygen you have in your sample there should be at least some correlation but there's nothing. It's also quite interesting to see that the proline oxidation is very down on the cells on the single cells. Proline oxidation happens in polar gene I guess and the polar gene forms the extracellular matrix so in cells where you don't have extracellular matrix you almost have no proline oxidation. The overlap with the annotation in uniprot is very small and the main reason is that most of the oxidations we actually found was in polar gene and these are not annotated in uniprot. They only have a very small overlap but I think at least a large part of these proline oxidations might be real. Another quite important modification is citrullination on arginine. This is basically deamidation of arginine which changes the mass by one dalton and which removes the charge from arginine. So removing the charge from arginine can have a very drastic effect. It is often used in scaffold proteins like myelin, TT, polygane and so on and the removing of the charge can either condense or deconsent these large protein complexes. Recently it was also discovered that it has a function in the decondensation of chromatin so if you citrullate arginine on I think histone one the arginine or the positively charged arginine becomes neutral it attaches from this negatively charged DNA and it leads to a decondensation of the chromatin. So this can have some recent papers out on that and this can have quite a drastic effect. We found here again we have the tissues we found most of the citrullination in the spinal cord which makes sense there's a lot of myelin there and myelin is known to be citrullated. Other modifications were on TT and other structural proteins. Here this co-analysis showed that we have a lot of these structural proteins in our list of citrullated proteins. The overlap with unipot was this time a little bit better about 18 or sorry about 30% of the unipot proteins that have a citrullination annotation were also found in our sample. However we found a lot of these citrullation which were not annotated. In unipot and also quite interestingly we found a lot of citrullination on proteins annotated in unipot but on different sites. Whatever that means that you can speculate or but it needs to be verified what the meaning of that. As a last example there's another modification here this is also quite a common modification to chelination on lysine. It's a mass shift of about 100 dalton which also had a very clear spike here. It occurs both in liver and frontal cortex. Again this makes sense. We detected the liver chelination on this alcohol the hydrogenase which is a protein which is predominantly found in liver. This chelination is not annotated in unipot so there might be something new. We also found it on the albumin where it is annotated but the albumin chelination was only found in the liver and not in the other tissues. We found it in the brain as well on many of these brain proteins here and I mean one has to go through each one of them and look in detail whether these modifications actually make sense at these positions. So as conclusions open modification searches on very large datasets they are feasible and you actually get some results out of them and you can add about 40 percent of PSMs modified PSMs to a list of non-modified PSMs. If you search for specific modifications don't use open modification searches. It doesn't make sense then you just configure these as variable modifications and search that way. Our spectrum library approach has some advantages over maybe model A in terms of sensitivity and accuracy. This Apache Spark framework is really a great framework and I think it's very well suited to deploy these calculations on a cluster. Most modifications you find are chemical that's the way it is, it's not necessarily bad and there's still a lot of work needs to be done on post-processing these results. There's lots of manual validation still involved and it would be nice to automate at least part of this validation process. So at the end I just like to thank the people involved in this project. So this is especially Oliver Hoelachter who did most of the coding. He's really a great programmer and he implemented, re-implemented liberated tool, meant a dance and mod tool and he did really a great job there. Then Thibet Robin who uses now answered mod tool to look for modifications in spam data and which is a collaboration with Nexpro, especially Lili Lane and Labs in France. This is Charles Pino, Yves van der Brug, Fliga are the main people involved. Then I would also like to thank the proteome informatics group. It's a very nice group to work in and we have a lot of discussions on these subjects. Most of the people here do either databases or like comics but there's a lot of overlap between the two fields and we have a lot of comment there. Okay and I would like to thank you for your attention.