 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture, Dr. Karl Klauser introduced you to the basics of mass spectrometry based proteomics. Today's lecture will focus on the crucial steps in sample preparation for mass spectrometry based proteomics and also to provide a glimpse of label based quantitative proteomic approaches. Further, the concepts of peptide spectrum match or PSM's and spectrum library matching will be covered. So, let us welcome Dr. Klauser for his second lecture. Extra things and that might be possible to use in reducing your list of the confidently assigned peptides, okay. Some scoring systems are going to be, but we are almost never these days going to collect one MS spectrum at a time. We are going to stick it in a machine and the machine is going to go and go generating mass spectra as long as it can, alright. So, you start out with to make that automated workflow happen, you are going to start out with a source of material which could be tissue, could be cell lines. You are going to extract that into proteins, then we most often will digest the proteins into peptides, the peptides then go into a mass spectrometer and then this automated system does three basic steps, okay. It is going to separate the peptides chromatographically, eluding them over time based on their hydrophobicity and so this in this description here that run time takes about 120 minutes, okay. At some given point in time, a scan is going to happen. This takes about 10 to 100 milliseconds now. The first thing you would do in a cycle is take an MS scan, you measure the masses of all of the peptides that are present and then you will very quickly collect some number of MS-MS spectrum, a common number to do now is 10. So, it will take the 10 biggest things that it was observed in this MS scan and do MS-MS on them, okay. Today this cycle like this can happen in a second, okay, alright and then this huge collection of spectra that are generated automatically then we will get put into a software program and it will try to match up assigned peptide sequences to each of the mass spectra and then you will have some additional software that will try to take the peptides that belong to the same protein and you get out a list of proteins that were observed and all of the peptides that you have observed with those, okay. This is one of the, I want to say most desirable instrument but that it is most desirable only if you can afford it. It is one of the most expensive instruments and it can do a whole lot of things, okay. The CPTAC program is currently generating almost all of its data on this kind of an instrument but we do not use the entire capability of the instrument, okay. So this instrument, a fusion lumos from thermo, you put ions in here, you then have a way of isolating precursor ions here and then you can do MS-MS by three different techniques. You can do something called high energy, higher energy collision dissociation, you can do collision induced dissociation or you can do electron transfer dissociation, okay. And then you measure things, the spectra in the orbit trap. It is also possible to measure them in the ion trap out here at lower resolution. You can go faster with lower resolution if you go out here, okay. In practice in the CPTAC program for generating proteogenomic data, we generate only HCD spectra and we collect mass spectra only in the orbit trap, okay. So we are not taking full advantage of the instrument. The reason we are using those instruments is because this instrument which does only the things that we really want to, did not become available until just earlier in this year and we started the grants that we were doing a year before that, okay. All right, so this I think of as today the workhorse type of instrument that if one was setting up a lab to do proteogenomics in the way that we are going to describe having done it in the CPTAC program, this instrument would be the one that one would get today. Sample preparation, okay. So in proteomics, these are some of the basic considerations that you have to do in designing your experiment, okay. So quite often going to start with either cells, tissue, fluid, fluid might be blood for example and then there is going to be some set of separations that we are going to choose to do, okay. You have the choice of maybe you want to do some fractionation at the protein level. You might want to do some enrichment or depletion at the protein level. If you are working on cells and you care about mitochondria, you might do a preparation that gives you an enrichment of that subcellular fraction that you are interested in, okay. From the standpoint of proteogenomics, we do not do any fractionation at the protein level, okay. The first thing we do is digest in the peptides and then it is all about separation of peptides after that, okay. If you are going to do some fractionation at the protein level, it is usually because you are after some particular subset of things or let us say you are doing plasma. Plasma, the most abundant protein in plasma is albumin, right and it is the least interesting protein. But it is the first thing you want to do is get rid of it, okay. So you use a depletion step to get rid of albumin before you go to peptides, right. But for the purpose of doing cancer proteogenomics, we take our tissue, grind it up, go to peptides and then we are going to fractionate peptides. Typically if you are going to do it offline before you go to the instrument, what you want to do is choose a methodology that is going to give you a different kind of separation than the one that is going into the instrument. So two common ways of doing that are either ion exchange or what we most commonly do now which is basic reverse phase, okay. So that means we are running a reverse phase separation but we run it at pH 10, okay. So separation that goes into the instrument goes at pH 3, okay. Another thing that you want to do is enrichment, okay. So if you are after phosphopeptides, you do not have to sequence everything else to get to your phosphopeptides. So you use something to pull them out, we use a mobilized metal affinity chromatography. If you are interested in lysine acetylated peptides, you can isolate them with anti-acetyl lysine antibody, okay. All right, so in choosing what you want to do, you are looking to make a trade-off among these criteria, okay. All right, most proteomics today is done in a way where there is going to be a digestion step to peptides, okay. Trypsin is by far the most common enzyme to use and it gives you convenient lengths of peptides that are generally tend to work well in a mastectrometer. They have the property that they have a basic amino acid at the C-terminus which is going to give you somewhat better fragmentation than if it is not at the C-terminus, okay. Cysteines can be disulfide linked when they are in a protein. If you just reduce them, they are very hard to chemically maintain throughout your process. So what we typically do is reduce the dis, break the disulfide bonds and then alkylate them with some agent like iotacetamide that then makes them readily detectable, okay. All right, so when you are doing enrichment, here you want to think about whether you are doing enrichment or depletion at the protein level here, then there would be a digestion step and then consider fractionation or affinity enrichment. The reason that you would make all of these kinds of strategic decisions is probably with some goal of increasing your depth of coverage. So if you want to start out with a complex sample and you are only interested in these things that are low abundance, there is going to be typically some form of infinity enrichment involved or depletion of more abundant components, all right. All right, so the ones that are the most common post translation modifications that are, that people can work on that are do by large scale, methods, phosphorylation of course is the most significant one. In our lab we also do a lot of ubiquitination work. This happens, this is done by having a glycine-glycine which is the, starts out on the ubiquitin. So the ubiquitin is covalently bonded to a lysine in a protein. When you treat it with trypsin, it cuts off the ubiquitin but leaves two glycines that were the C-terminus of the ubiquitin, okay. Satellated lysines are something else that we also now are doing routinely in the CPTAC program, okay. So and you can do these by using an anti-stial lysine to anybody, all right. So we also have done this in a way where we don't have to split up the sample and lose, require more sample and then dedicate only some of it to one modification, only some of it to another, some of it to a third. Instead, we do the enrichments one after the other, okay. So the supernatant of what comes through a IMAT column can then be used to, so the things that don't bind to the IMAT column come through. You can then do a next step of enrichment for something else and in our case we've published work on doing those three things in cereal, okay. So you start with less total sample and achieve each of those items, all right. Okay, quantitation and multiplexing. The almost anything that we do in our lab today in proteomics and quite a lot of labs are trying to do things that are quantitative, okay. And the basis of doing something quantitative and having some statistical power requires that you have replicates, okay. So not only do you want to have replicates but you typically want to compare two conditions, at least two conditions. In examples of this might include wild type versus mutant expression, treatment with a drug or without a drug or capturing something with a bait or not. And then most of what you detect is probably going to be unchanged between the conditions and you're looking to do statistics to recognize some subset of things which change between the conditions that you do, okay, all right. What sort of experimental design considerations that you put into this, okay. I'm going to show you three different techniques. One here is called label-free where you basically combine the samples at the end, okay. And then I'm going to show you two labeled techniques. One is silak, stable isotope labeling of amino acids in cell culture. And then the third one is something called a TMT or ITRAC where you're using a chemical labeling agent. You then are going to combine the samples and then put them into mass spectrometer. You do MSMS, the quantitation comes at the level of the MSMS spectrum, okay. And this technique here you combine earlier in the process and the quantitation comes at the MS level, okay. Multiplexing-wise you can do three things, three samples at a time pictured are only two, okay. You can do light and heavy. The third one would be medium, okay, all right. The TMT-10 reagent you can put together 10 samples. If you have an ITRAC reagent, there's actually two ITRAC reagents. One is called ITRAC-4 and the other one is called ITRAC-8, okay. So that tells you how many samples you can put together. There's also something called TMT-6, okay. And I will get into some of the differences in what you have to have to be able to do those kinds of experiments, okay. So here are some of the features about this. There's a lot more time to do an experiment this way, a lot more instrument time. Here there is some loss of accuracy in the quantitation due to compression that I'll talk to you about. The reagents can be expensive, okay, all right. Here you have less potential to do flexing and in order to get a heavy label, you have to be able to add that to the cell culture that is going on. So that means you can't label humans, okay. So you can, this is really suited to working with cells in cell culture, okay. All right. And the quality of the quantitation is shown here and would be highest over here, okay. Why is it highest over there, okay. Ideally when you're going to mix things together, you'd like to mix them as early in the process as possible so that any of the experimental variable, variability that happens, happens to all the samples together, okay. But because of the way you do the experiment, you can't necessarily mix things until a later stage, okay. So in the case of chemical labeling, you have to mix after you have done digestion. But if you do it in cell culture, you get to add, do the combination way back when they, just after the cells have grown, okay, right. And so I think I've already said some of the pros and cons about this. So let's move on, all right. Let's go straight to what happens when you do chemical labeling approach, okay. So the idea here is this, you might, here I'm illustrating TMT-6, okay. So you would have six samples. You lice each of those, the samples, that gives you proteins. Then you reduce and alkylate, trypsin, digest the peptides. After you have peptides, you use the TMT labeling agent. These are amine chemistry-based reagents, so they're going to put a label on the side chain of lysine and on the end terminus of the peptide, okay. And so the reagent normally comes in six colors, okay. These are actually masses, and the masses shown here are the reporter ion masses that are present in the MS-MS spectrum, okay. So then after you do the labeling, you mix the samples, and then you have six different things labeled. The purpose of doing it this way is the labeling reagent causes all of the samples to have the same mass, okay. And the label is going to have a different mass, but only after you do MS-MS, okay. So the signal that you see in an MS-1 scan is the sum of all of the six samples, okay. Which is good, right? It means you get more signal when you combine the samples, okay. And then after you fragment, you're going to have reporter ions that allow you to get the peak height that is shown here is going to enable you to do the quantitation back to the samples that they came from. All right, this is the chemical structure of the label, right. And shown where the asterisks are is where you would put C13 or N15, okay. And in order to do the labeling, you're going to put in five labels. But depending upon where you put them, you can end up with a 126 ion. If you put, I have another slide that will show you where. But the idea is you're going to put them in different places so that you have different labeling capability, okay. Now, because I told you at the very beginning, you could tell the difference between N15 and C13, right. If you have high enough resolution, you can separate and you can get 10 different things, okay. And that is all going to go back to whether there was a label on this nitrogen or C13 in this position, okay. So now this slide is harder to see. But the dots show you where the labels are for each of the different reagents, okay. All right, and then the things here colored in black correspond to the reagents that are for TMT6. The additional things in red are the extra channels that you use for TMT10, okay. Unfortunately, it's a little bit complicated, okay. So you have some impurities that you have to deal with in this thing. And there are two types of impurities, okay. One sort of an impurity comes from how pure is the C13 that you start to put in the label, okay. You can get over 99% pure C13 to incorporate these days. But there's some level of impurity. Same is true of nitrogen 15, okay. But there's a second set of impurity which is, this is in the unlabeled positions, okay. So this is over here. There's these carbons over here that are naturally occurring levels of C13, okay. And so if you end up with a C13 in one of these positions, it's going to be one carbon higher in mass than it would be, okay. So when you obtain the reagents, they also give you a set of correction factors, okay. That software will apply to correct the intensities to account for the impurities present in the label, okay. If you obtain data from some public repository and you want to reprocess it all from scratch, make sure you get the correction factors that are provided by the people who generated the data. Unfortunately, they don't always remember to give you the correction factors when they deposit the data somewhere and you might have to send email asking for them and hopefully someone can write back and give them to you, okay. One of the things that we try to do from our lab is always provide these, but I often have to chase down the people who did the experiment and say you need to provide these before we can put the data in a public repository, okay. Those correction factors are then used with an algorithm to apply them and correct the intensities. This is a publication that's about ten years old. This is, we use the same method that they describe. We don't use the exact same software because this publication is old enough that it only applied to iTrack4. We've modified it to be able to work with TMT6 and TMT10, okay. All right, there's another sort of set part of complication in working with TMT quantitation. And that comes down to interference, which goes back to if you fragment more than one thing at a time, okay. And so what I've tried to do is draw a cartoon here to illustrate how this works, okay. If you had two peptides of very similar precursor mass that were present at the same time, and let's say one of them is uninteresting. There's no one of the six samples that has either up or down regulated levels of protein. But in the sample that's right next to it, it has up regulated levels of the red, okay. So there's way more red in this one than there is in the other one. If you're using an older instrument, you might have to set the precursor mass window to be two Dalton's wide, which would cause both of these things to be transmitted at the same time, okay. The labels produce the same reporter ion masses, and you can't tell which one they came from, all right. So if you were to transmit this whole thing, then you would have a reporter ion set that looked like this only when the data came off the instrument, it wouldn't have this white line through it that allowed you to tell which one it was. You would just have the sum of these things, okay. If you were able to use an instrument that had a narrow precursor window, then what information you would get would be just derived from this one peptide, okay. All right, so if you put the quantitation together and you combine these things, the ratio that you would calculate, if you calculated the red divided by the pink, I'm sorry, let's call that orange, you would get a ratio of 2.5. If you had only the one together, you would get a ratio of four, okay. So the ratio of four is what you wanted to observe, but it is compressed to 2.5 because of this effect, okay. All right, so this is just an example of what might happen, okay. And so that there's a couple of things you can do to deal with this, right. The first is you could do a better experiment, right. If you have an instrument that allows you to do a better transmission, this all of the CPTAC work that is gonna be presented later in the week and is already published is all iTrack data run on a Q-exactive instruments that at the time had a window of two developments for precursor transmission. What you can now routinely do on a Lumos instrument or a Q-exactive HFX is run 0.7 MZ tolerance or window width. And so you would be able to in this kind of case transmit only the one thing, okay. The second thing you could do is you could have data analysis that would go back and look at all your MS scans and say, ah, if we've got this thing, let's throw away that data point, okay. And because we're expecting most of our proteins to be detected by multiple peptides, we have some way of taking and recognizing that some data points are better than others and so we can exclude those, okay. So a common thing to do, different people do this, but they don't all call it the same thing, which is to take some measure of whether how many things are here and what's the relative abundance of those things that are there and when the relative abundance of those things is high, then you throw away the data point, okay. Now that's an approximation because although I have shown you in this cartoon example that the ratios of the MS1 peaks is the same as the relative ratios of the reporter ions, that's not always what actually happens, okay. When an individual peptide fragments, you're gonna get some reporter ion signal and some sequence ion signal. But sometimes the balance is like this, sometimes it's like that, okay. And so even if this peak right here in the MS1 scan is taller, it doesn't necessarily mean it's gonna contribute more reporter ion signal, okay. All right, so that's some of the uncertainty that's present in this type of data and getting better at this is there's room for improving our data analysis, okay. All right, all right. Scoring peptide spectrum matches, all right. All right, so this slide I already showed you once. It was several equipment failures earlier. All right, but the idea is that you're gonna take a sequence database and your experimental spectrum. You have programs that are gonna approximate what the spectrum is expected to look like and then score them. These are some examples of some names of programs that do this, okay. If you are gonna design an algorithm to do this, these are the kinds of things that you would have to think about, okay. And when you start to just look at other programs, these are some of the things that you could think about in terms of evaluating or reading about what they do, right. So, but they're all gonna one way or another have to deal with these kinds of things, okay. So there is going to have to be some step. Maybe it's not within the search program itself. Might be a program that you can run ahead of time that will do peak detection, okay. And it's gonna do these kinds of things. It's gonna do deisotoping. It could assign fragment charge and do some sort of signal to noise processing so that you're hopefully trying to only use a signal peaks, all right. You have to have, when you design the algorithm, know what fragment ion types are possible, okay. And when you start to use a program, you often have to choose what instrument type it is that you use to generate that spectrum. And when you've done that, it's gonna behind the scenes be consulting a configuration file that's got appropriate things like what ion types are possible for that instrument and potentially some different scoring values for the different ion types, okay. All right, then, when your algorithm also, not only does it have mass information, it has intensity information. Today, search programs generally make not very much use of intensity other than to say present or not present, okay. With some of the machine learning approaches that are starting to be imposed, one of the goals of those is to make better use of intensity information, okay. All right, you're gonna have to choose some fragment tolerance units, okay. I told you the resolution was different across the mass range in certain instruments that is particularly true in orbit traps and time of flight instruments. The mass accuracy is also different across the measure, across the mass range. And so we use different units. If you use parts per million unit, a typical value of a good mass accuracy on a high-res instrument ought to be plus or minus five parts per million, okay. And you would say that across the entire mass range. But when you convert that parts per million into Dalton's, that means it's a wider mass, I'm sorry, wider mass window at high mass and a narrower mass window in Dalton's at low mass, okay. So if your instrument data has your mass accuracy specifically in units of PPM, ideally you would like to use a search program that could also support mass accuracy in PPM units, okay. But it is actually quite common to use a program where it only has Dalton mass accuracy. And so what you have to do is compromise and set the tolerance to only use the high mass one when you should be able to in principle use it at lower mass and have a narrower tolerance, okay. All right, most search engines produce a score that is the primary score that's used to make most decisions, but along the way they might calculate extra things and that might be possible to use in reducing your list to the confidently assigned peptides. Okay, some scoring systems are gonna be dependent upon the size of the database. Others are gonna be only dependent upon the scoring of the ions and a particular sequence. And if you take that sequence and put it in a big database or a little debated database, the score is gonna be the same, okay. Some search engines will, however, take the size of the database into account. All right, so that's what you have to do if you're designing an algorithm, you consider all those things. If you're gonna use one, you have to consider these kinds of things, okay. You have to choose a database, okay. Most of the time today, there's also an opportunity to somehow do a decoy database that is used to calculate false discovery rate, okay. As you read literature, you will find that there are certain groups that always allow for partial enzyme specificity, okay. While other people may require that fully specific. So the trypsin had to cleave on both ends of the peptide. If you're using a partial enzyme specificity, that increases the search space that the spectra are gonna be matched against. The program is gonna run slower and you usually have to have higher score thresholds to meet your FDR, okay. When you're gonna choose fixed and variable modifications, you wanna choose things that you can expect to find in your sample. And if you are interested in these things that are rare, especially if you choose many of them, it's gonna slow down the search. And I have a slide that a little bit, we'll talk about expansion of search space, okay. Then you have to choose, like I said, precursor ion tolerance and fragment ion tolerance, okay. All right, this is how the spectrum is scored in my software called Spectromil, right. And this up here is shown with the, all of the peaks that are present in the spectrum as it's generated from the instrument. Instrument doesn't have it colored blue, red and green though, it's all black, okay. All right, there's a preprocessing step that does peak detection that does these three, these several things, deisotoping, signal noise thresholding, removes apparent reliance, neutral peaks. So these are the only peaks that are left, that are subject to the scoring, okay. The scoring has three components to it. There is a positive component that means the mass matches a fragment ion type. That match, the score of that is independent of the intensity, but it is weighted by what ion type it is, okay. There is a bonus for having composition information like ammonium ions. And then there is a negative portion of the score that is for peaks that are not assigned, okay. And so basically a tall peak in a spectrum that's unassigned, that's bad, right. That suggests that you have an incorrect interpretation or you've got multiple things that are being fragmented at the same time, okay. The different ion types have different scores, B and Y have the highest score, they have score of one. Things that are B minus water, Y minus water, A ions, those give you less information about the sequence because you've already got information from the presence of B and Y ions so the A ions, B minus water, B minus ammonia, they score less, they score a half, okay. All right, so you do all of those things and you end up with a score. In this particular case the score is 12. The peak detection will produce no more than 25 peaks. The maximum score is 25, okay. All right, now something that's quite a bit more different and less intuitive is something like one of these scores that is used a probability-based approach and this is the binomial probability equation. It is the basis for scoring in the Andromeda search engine that's part of MaxQuant. Roughly the same approach is used in mascot. And the way this works is that all ion types are given the same weight, okay. And in order to calculate the probability you have to account for the chance of there being a randomly matched peak, okay. And the way that this is put into the binomial probability essentially comes down to breaking up the mass range into 100, I mean 100 Dalton chunks and you say if we're going to look for say six peaks then the chance of randomly matching would be six out of 100, okay. It may not be immediately obvious, but that also suggests that the mass tolerance you are allowing was plus or minus a half a Dalton, okay. All right, now in practice MaxQuant has allowed you to specify a fragment ion tolerance, okay. But that is not used as part of the scoring, okay. And up until about one or two years ago mascot did not allow you to use part per million as a fragment tolerance, you had to use Dalton's, okay. And it's because of the way that the scoring is built into the probability, okay. So from my point of view, the probability is not true probability but the scores are still effective, okay. And the reason that the probability is not true are for the reasons that I've listed here and I've already talked through, okay. All right, now what I wanna do to show you here is a contrast in this is what you would do if you knew what that peptide fragmentation was gonna look like, okay. And you would know what the spectrum's gonna look like because you already had a spectrum that you trust and is used as the reference, presumably because you knew you had the peptide, maybe you made it synthetically, generate a spectrum, a spectrum becomes put in the library and then all the experimental spectrum you generate, you just match to the library, okay. The particular case that I'm showing actually is one of these things where somebody is trying to demonstrate that the thing that they observed in a complicated experiment, they made the synthetic peptide, the spectrum looks almost the same, you can calculate a spectral similarity metric and it passes the threshold and they can say see, this is what we said it was, okay. All right, so the equation that gets used here is a dot product score, there are a few different variations on this and I'm not gonna go through the math but the point is that you're really taking advantage of the intensities and you're not allowing for all possible fragment ion types that could occur to a peptide, you're only allowing for the ones that actually occurred to generate the reference spectrum, okay. So software programs that do this kind of spectral library search are listed right here, okay. The FDR method that's calculated is sort of today not thought of as being as statistically rigorous as what is used for database searching and as I would characterize that as a work in progress to be able to do good false discovery rate calculations, okay. Now it's also the case that in order to do this effectively you have to have a good set of reference spectra to match to, okay. And one of the things we found in our lab is that once we've got just a good reference library somebody came up with a new chemical labeling agent and we switched it over it and they all they fragment all differently and now we have to start over, okay. All right, but after we've done a lot of work somebody can collect all of our spectra and then use that as the basis for creating a library, okay. All right. Now let's talk a little bit about localizing and post-translational modification set. All right, so what I've got here is a MSMS spectrum of a phospho peptide. This is not two spectra, this is one spectrum. It's just labeled two different ways, okay. The same, you can see it's the same peptide sequence. Hold on. All right, and the only difference is whether the phospho side is on this serine or the phospho side is on this threaning. I want you to raise your hand if you think it's on the threaning. Now I want you to raise your hand if it's on the serine. You have to pick one. Come on, come on. Okay, serine? Okay, who else wants to go serine? Anybody who doesn't vote doesn't get the lunch coupon tomorrow. All right, so the answer is yes, there's a serine. And you should have been able to vote, okay, because you don't have to know anything to see that when you look at the labeling, there's something that's not assigned here and it is assigned here, okay. All right, but let's talk about what is assigned and why, okay. So the fundamental premise of being able to pick where the thing is is you have to have fragmentation between the possibilities, all right. So in this case, you have this single ion right here in the spectrum, which can be interpreted as the Y7 ion for cleavage right there, where 101 would be the mass of threaning in its unmodified form. 167 is the mass of serine, which is 87, plus 80, which is the phosphate, okay. All right, so if you were instead to allow 87 for the threaning or for the serine, that would stick this in here in a sort of messy part of the spectrum. And then the gap out here would be shown as this ion to this ion and then that would leave that unexplained, okay. But because those two residues are right next to each other, you're not gonna get much information to work with in order to make your decision, okay. So in cases like this, and it's gonna be often the case that if you have to determine two choices that are right next to each other, you're gonna have to make that decision on maybe one or two peaks in the spectrum, okay. All right, let's talk about the range of possibilities now that could happen here, okay. If you have, if you're looking for phosphorylation sites, the precursor mass is 80 Dalton's higher, so you know you've got phosphate, and then you look at the sequence, can I, what's going on, I don't, what's here, okay. All right, if you look at that sequence, there's only one serine, threaning or thyracine in it. So you don't even need to look at the mass spectrum to figure out which one is labeled, or which is phosphorylated. It's gonna be that one. All right, I am having, I'm gonna switch to the pointer here, okay. It takes a lot of time to figure that back to getting the red spot on the thing, so I'm just gonna switch, okay. So in this case, you will have a peptide sequence where you have a serine or a threaning, and so you could, if you have enough information, you could confidently say that the phosphate is on the serine, and we would call that a 99% chance of being correct, okay. Let's suppose here you have one phosphate, and it could be on any of these three serines out here. If you have fragmentation between them, you can tell the difference, okay. And I'm gonna show you a spectrum where there's fragmentation between serines two and three, so we can say it's not on serine three, but we can't tell the difference between first and second serine, okay. When you get multiple phosphocytes in the same peptide, that gets a bit trickier, and this is illustrating all the possible places, the combinations that you could put them, and then I'm gonna show you a spectrum that gives you the ability to tell that there has to be one on this serine, not on that threaning, but then the second one we can't tell where it is, okay. So this is how complicated this kind of stuff gets, and when you are doing proteogenomic work, and you wanna look at the phosphodata set, and you look at the list, and you're like, there's all these things in this list that don't have clear assignments of the serine threaning. Well, that's a feature of the data that you gotta deal with. One of the ways you might deal with this, throw out everything that's not confidently indicated to a particular position, okay. Here are the spectra that give you the cases that I just described, okay. So here is a spectrum where we can confidently put the phosphate on the serine, and these ions in the spectrum, Y5, 6, and 7, are separated by the right masses, oh, they should have been labeled, okay. This is gonna be a 113 gap, this is gonna be a 167 gap, and then 97, okay. So that can place the phosphate on that serine, not on that threaning. This gap over here is gonna be 101, okay. Here's the example where the Y13 ion, Y13 doubly charge right here, allows us to fragment between the second and third serine. So we know that the third serine now is not phosphorylated, but there is no fragmentation between the first two, okay. So we can't tell where that is. All right, here's the complicated one where there is two phosphorylation sites. The precursor mass is 160 greater than the unmodified version for this sequence. We have a fragment ion, Y9 and 10, that gap there is 187, which is gonna say that that's phospho-serine, and then Y5 and 9 here, there's not very good fragmentation between those, and so we can't tell where the localization is, okay. All right, so when you're gonna write something, so I tried to show you graphically, this is when you look at the spectrum, can you have the information? If you're gonna write a program to do this, right, these are some of the things that you got to put into the design of your experiment. You're gonna think about all of these things, okay. I think the most important of those things are shown here. The choice about how you decide what peaks are gonna be used to make your decision, and then how do you clearly represent the certainty or ambiguity in the localization decisions that the program has made, okay. There will be different choices made by people that write the programs about how to deal with the rest of these issues, okay. And then today, there is not a universally applied way of determining a false localization rate from these scoring things, whereas the target decoy calculation for identification is practiced throughout the field, okay. All right, this is one of the first automated scoring approaches and it is again using this binomial probability theorem, but instead of using the calculation based on all the possible fragmentation of the peptide, it's limited to just the fragmentation between the sites that you're trying to distinguish which have the localization, okay. But otherwise it uses the same framework, the same mass accuracy assumptions. And when you get down to the, what score threshold you're gonna use, it comes down to essentially saying that there has to be two good peaks that meet the scoring threshold, okay. At the time that this was published, the authors used a particular score threshold, to forget exactly what the value was. And then like a year or two later, they decided they could say they have more identifications if they made the threshold lower, okay. And it was essentially by saying instead of two peaks, you were gonna allow one peak to make the decision, okay. All right, but you have this nice descriptive way of using a mathematical calculation, okay. When I wrote the calculation, I've tried to think of it more intuitively and I calculate the score difference in the identification scores given various possible places and the decisions I made were on the quality of the information that gave those score distinctions, okay. I said that I want the ion type that you allow to make the decision has to be one of the highest information ion types. It's gotta be a B or a Y ion. You're not gonna make the decision based on one ion that's a B minus water ion. You're also not gonna make the decision based on a tiny little peak that could be mistaken for noise, okay. So what I sought to do is say that it's gonna be a B or Y ion and the relative intensity has to be at least 10% of the base peak. So it's a solid peak, it's not noise. That works out to giving a score threshold that's of 1.1. In conclusion, we hope that this lecture and the series of five lectures so far has helped you to appreciate the importance of sample preparation for mass spectrometry-based identification of peptides. The need for enrichment of post-translational modified forms of peptide prior to MS analysis. The lecture has also provided you the glimpse of how impurities in the sample can lead to the errors during the identification of peptides. And additionally, you are introduced to the concepts of PSMs and how a specific software like Spectrum Mill uses PSM to score the hits. Lastly, you were explained the concepts of phosphor site localization and scoring using suitable examples. In the next lecture, Dr. Karl Klauser will conduct hands-on sessions to help you interpret the MSMS spectra manually. Thank you.