 Hello, my name is Lanak Martens and this lecture in the statistical genomics course will be focused on giving you a broad introduction into mass spectrometry-based proteomics data analysis. So first let's have a quick look at where we can situate proteomics in the central paradigm of biology. So as you are all very familiar with, we start from a genome, transcription then takes care of writing this genome into RNA. RNA is then processed in eukaryotes to create messenger RNA, which is then translocated outside of the nucleus and in the cytosol is connected to ribosomes and these ribosomes then translate the mRNA sequence into protein. Protein sometimes called polypeptide, as a peptide is essentially a shorter stretch of amino acids. If you link many of these together, you obtain a polypeptide or protein. And of course proteins are not just linear sequences. Proteins also undergo several types of post-translational changes. First and foremost, the fact that a protein actually has a certain fold, has a certain three-dimensional structure, which means that the protein then becomes functional in most cases. A linear stretch of amino acids would probably not have the desired function. And then finally, and this happens very often with many proteins, that they are post-translationally modified in a chemical way. For instance, the addition of phosphorylation, phospho groups in the process known as phosphorylation or acetyl groups in the process of acetylation and similar things such as glycosylation or ubiquitinylation, which are also quite well known modifications. Some form of processing is also possible and in those cases, particular parts of the protein may be cleaved off. Very often this is an N-terminal segment, so the front of the protein, the first bits are being cut off because these are either targeting sequences that target the protein to a particular subcellular localization or this could be activator segments. So what this means is that we can characterize the protein at several different levels. So we have the primary structure of its sequence as given here in single letter notation where every letter represents an amino acid. There's secondary structure like an alpha helix, which is shown here, a beta sheet is the other typical secondary structure, and there is of course a loop, which essentially means an area without a particular fixed structure in it. Usually these loops serve as connecting elements between the beta sheets and the alpha helixes. And then finally, there's tertiary structure and so the 3D shape of the finalized protein after the folding has completed. As we mentioned, there's modifications like phosphorylation. These tend to be dynamic and tend to influence the function of the protein. Dynamic means they can be put on or taken off. Very often that either putting a modification on or taking the modification off means that the protein gets activated or deactivated depending on the particular modification and the way the protein works. Finally processing can happen where as I said we can target the protein to a subcellular location or we can activate it. Very common example of activation is trypsin, which is a protein that we will also use in proteomics in the experimental setup. This protein is actually a protease, which means it's a protein that cleaves other proteins. It's active in our intestines. And because it is a protein that cleaves other proteins, it's very important to make sure that it is not active in the cell or it would start destroying the proteins in the cell. So there is a little flap if you like that covers the active site of trypsin. And as soon as the trypsin is pushed out of the cell into the intestine, this bit gets cleaved off, which means that the protein now becomes active because the active site is now accessible for function. And another well-known example is the clotting of our blood, which is actually tightly regulated via a very specific cascade of proteins that all have to be cleaved in very specific ways prior to the clotting taking place. And the reason for this is also obvious. You do not want the blood to clot in your arteries or veins. You want the blood to clot only in the case of tissue damage and bleeding, essentially. Right. So what is being studied by mass spectrometry-based proteomics is predominantly the primary structure, so determining the sequence of the protein and determining its modifications, potentially also processing sites. Increasingly people are also studying secondary structure elements and tertiary structure, very specific types of experiments, but in this basic introduction, we will not cover those particular means of employing mass spectrometry-based proteomics, although it is gaining in relevance and you may come across it. Now in this lecture, what we will do is we will focus first on the building blocks of proteins, which are the amino acids, and we'll talk about the peptides that come from that. And then if you make a long stretch of amino acids, it becomes a protein. So a short stretch is a peptide, a long stretch is a protein. We'll then talk a little bit about the basics of mass spectrometry, but there's a more extensive set of lectures on YouTube that you can find. And that is linked from the statistical genomics website. And you're warmly recommended to actually watch these videos as well because they really do teach a lot more about the specifics of how the instruments work. Knowing about these instruments is important because it allows you to understand the data that comes out of the instruments. And these are the MSMS spectra. And that in turn helps us understand how we identify the spectra, how we extract sequence from the acquired information, and all the limitations that come with that because the instruments are unfortunately far from perfect. The way we do this identification is then the topic of the remainder of the lecture where we talk about database search algorithms and we'll go through the evolution of these algorithms in three phases. And the reason for that is that it'll actually allow us to get a good idea of how the field has progressed over time. And then sequential search algorithms are a particular type of algorithm. So rather than use a database as these database search algorithms do, sequential search algorithms have a different approach. We'll talk about that. And then we'll discuss how the field is trying to correct for multiple testing because of course, every time we test the spectrum for an identification, we do a statistical test of some sort. And of course, we have more than one spectrum. In fact, we have tens of thousands to millions of spectra these days in experiments, which means that we effectively run into the multiple testing problem. So each time we try to identify a spectrum, we have a certain P value, a certain probability. And then if you do this a million times, of course, the probability of having a significant P value by random chance shoots up dramatically. And we have to correct for that. So in proteomics, we do that empirically using decoys. And we then do a false discovery rate calculation based on that. Finally, we'll talk a little bit about protein inference because, as we will see later on, our mass spectrometer never really sees proteins. The mass spectrometer sees peptides. And because it sees only fragments or sections of proteins, we then have to match the sections back to the proteins, which in fact is a really error-prone and difficult process to control, which is why I don't have many good things to say about this particular problem. But it is a relevant problem because it does come back when you later on want to quantify the proteins. But we'll come back to that at the end of this lecture. Let's first start with the basics, amino acids, peptides and proteins. Now, what I'm showing you here is the structures of the 20 common amino acids that are found across the tree of life in the proteins that every living organism makes. You can see that they're actually split up in several groups. And the reason for this is because they are chemically different. Now, before we look at the differences, let's first look at what they all have in common. They are called amino acids because they have an amino group, which is this NH3 plus here. And they have an acidic group, a carboxylic group, that's this carboxyl here. So it's an amino acid. That's something that all amino acids have in common. And of course, there are many more than these 20 amino acids. But these 20 amino acids are the ones that we find throughout the entire tree of life, we find these 20 amino acids back in the proteins. Now, apart from this common backbone, if you like, which is identical for every single one of these 20 amino acids, you can see that there is a side chain. And this side chain can vary quite dramatically between different amino acids. So for instance, in glycine, which is the smallest amino acid, this side chain consists of a single hydrogen. But in the largest amino acid, tryptophanes, actually the side chain is far more complex. So a methyl group is connecting here to a five ring and a six ring, creating a very large side chain. Noticeably, this side chain of tryptophanes is highly hydrophobic as well because of the ring structures. And the same goes for all the amino acids that are essentially here near the top, they're quite hydrophobic. So the exception could potentially be tyrosine because it has a hydroxyl group. It is maybe a little bit more soluble. But let's say isoleucine, methionine, lucine, and valine, and alanine are all aliphatic. They're very hydrophobic amino acids because they mostly consist of carbon and hydrogen in the side chains. Lysine, sorry, methionine is special because it has a sulfur atom here as well in the side chain. Now, apart from these aliphatic and aromatic side chains, there's also the polar uncharged ones. So here, serine and thionine are quite well studied amino acids because they have a hydroxyl group which can be phosphorylated. The same goes for tyrosine, which also has a hydroxyl group here on the ring structure. There's cysteine, which is very important for disulfide bridges, which help proteins maintain their structure because the disulfide creates a bond between one cysteine's thio group, SAH group, and another cysteine's thio group. And so these are quite rare amino acids in proteins because they serve such specific function. But wherever they're found, they tend to be highly conserved and highly important because they kind of staple the protein together in these places. So they really help maintain a certain structure. Then there's proline, which is a cyclical amino acid. It's the only amino acid whose side chain folds back onto the backbone through the amine here. There's asparagine and glutamine, but they have lost the amine group and replaced it with another oxygen, which means that they now have carboxylic side chains. And then in the case of the positive amino acids, we have lysine, which has an epsilon amine group. So it's a second amine, just like on the backbone. And then arginine, which has a humanidinium group, which is actually far more basic. As you can see here, because the double bond actually resonates between the two amine groups. And therefore, it is a little bit more stable and can accept a proton much more readily. Finally, histidine is a five-ring structure, which has an amine here as well. And this five-ring structure can also accept an additional proton. Although in physiological conditions, so at a pH of around 7, it tends not to be ionized automatically. So it tends to only get ionized in certain circumstances, usually when there is a conjugation through, say, a proton acceptor like glutamate or aspirate in an enzyme. So now we've seen that these 20 amino acids are actually quite different. And this is important because if we look at the structures of the nucleotide bases, AC, T, G, and U, you will see that they actually are quite similar. So chemically, the differences between the nucleotides is very, very small. The difference is noticeable, if you like, or readable by the cellular machinery, but it doesn't change their chemical behavior all that much, which essentially means if you can dissolve one of these nucleotides in a buffer, you can dissolve all of these nucleotides in a buffer. And if you can analyze, say, the DNA strain of an E. coli bacterium, you can also analyze an mRNA from a eukaryotic cell. Because fundamentally, chemically, they're very, very similar. Now, that does not happen at all with proteins because their building blocks are so dissimilar. There are extremely hydrophobic proteins, which consist mainly of these non-polar aliphatic or aromatic amino acids. And they may live, for instance, in the membrane, in the cellular membrane, which is, of course, a very hydrophobic milieu. Now, on the other hand, there's also proteins that have to be extremely positively charged, for instance, containing many lysines and arginines, for the simple reason that they have to interact with DNA, which itself is negatively charged because of all the phosphate groups on the backbone. And so proteins can take a wide variety of different physical chemical properties based on the specific amino acids that are contained within the amino acids. And so this is an important realization. When we think about how the proteins are built, this means that we have this immense diversity in physical chemical properties. And that also means that analyzing every protein in a cell is practically impossible in a single experiment. The reason being that if you have a buffer that dissolves, say, all the cytosolic proteins, which are very water-loving, they're very hydrophilic, then there is a very, very high probability that only very few of the membrane proteins will dissolve in that buffer because these proteins, necessarily, are extremely hydrophobic and thus will precipitate very readily in the kind of buffer that dissolves the hydrophilic cytosolic proteins. So we are actually faced with a wide variety of proteins, each tailored exquisitely to their specific task in the cell. But it makes it extremely difficult for us to analyze everything. So immediately, you can feel that getting all of the RNAs out of a cell in a buffer, in solution, ready to be analyzed by some process or other, is actually not that difficult because you have molecules that all behave extremely similarly when it comes to their physical chemical properties and, for instance, the buffers in which they will dissolve. For proteins, we do not have that luxury. Proteins tend to not like the kind of buffer that their neighbor will dissolve in. So as a result, proteins are harder to analyze and give more of an analytical challenge than your typical nucleic acid strain. Now, on the upshot is that we can actually readily differentiate between the different amino acids based on, for instance, their molecular weight, which is essentially calculated from the number of atoms and which atoms are in there, so their molecular formula. And so the mass of carbon is 12 times 3 plus 7 times hydrogen, which is almost 1 plus the nitrogen at 14 plus 2 times the oxygen at 16 will give you the actual molecular weight of the specific amino acid. Now, apart from the weight, there's also two other aspects. There's the pKa and pKb, which actually give us the acidity and the basicity, respectively, of the carboxylic moiety and of the amino moiety. So as you can see that the pKa, so the acidity of the carboxyl group, actually varies depending on the side chain. So if your side chain is different, has a different chemical structure, that may stabilize or destabilize the loss of the proton. Now, all of these values are very low. So they're all below 2 and 1 half stay. The highest one is actually tryptophanate 2.83, so they're all below that. And that means that at normal pH 7, essentially all of these carboxyl groups will be ionized, will have lost the proton. And that is exactly how these are rendered, because we always speak of the carboxyl as COO minus. On the other hand, pKb, which is the basicity of the amino group, also varies. But it's typically above 8 and 1 half stay. I think the lowest is here as perigene with 8.80, so everything is higher than that. And again, at pH 7, that means that all of these amines will be protonated. And again, that's how we draw them with all of the amines protonated. You can also see that in many of these amino acids, there is therefore a balance between the positive charge on the amino group and the negative charge on the carboxyl group. And we call that a zwitterion, from the German zwitter in between, because the net charge is halfway between, of course, the sum of the positive and the negative charges, in this case, 0. Now, what is relevant here is that this pKa and pKb does change, and that may affect the way that these amino acids behave. And therefore, it also allows the way in which a peptide or protein behaves to differ a little bit. However, the most important value is this pKx. Now, the pKx is special because we don't have a pKx for everything. What is this pKx? It is the acidity or bassicity, if you like, of the side chain. Now, let's have a look at glycine, for instance. If we look up glycine in this alphabetical list, we can see that glycine does not have a pKx. And that is because the single hydrogen will never behave as an acid or a base. It is not a proton acceptor. It is not a proton donor. The same goes for the methyl group here on alanine, because we can see here alanine also does not have a pKx. So where do we expect the pKx in the charged amino acids? These are lysine, arginine, aspartate glutamate, and to some extent, histidine. And that's exactly where we find these numbers. So arginine is extremely basic. So I mentioned this before. This Huanidinium group is very, very good in accepting a proton. And that translates into a pKx, in this case, a pKb, if you like, a basicity of the side chain of 12.5, which is rather high. It's much higher than any of the amino groups which are on the backbone. Contrast with lysine, for instance, which is here halfway the list, and that's 10.5. So remember the p is minus log, which means that this difference of two, essentially two, 2.95, I think. Units on the scale here, that actually means that it's about a hundredfold more basic. And so the Huanidinium group of arginine, it's about a hundred times more basic than the single amino group on the epsilon of lysine. So you can see huge differences here as well. Again, that is something the cell can play with. So huge variety, even on these aspects. Histidine, we mentioned histidine, was a potential positively charged amino acid. And you see why we don't actually place this as a charged residue because the pKb is only six. That means that a pH of seven, it tends not to be protonated. So you have to lower the surrounding pH in order to get this amino acid to protonate, which is why you find it very often in active centra of enzymes surrounded by carboxyl groups, which effectively lower that pH, if you like, or thermodynamically at least stable through conjugated bonds and extra proton added to this five ring. Now, an interesting observation here is that we can actually also find a pKb for tyrosine. So the tyrosine ring can also support through its conjugated double bonds, a second proton added. So a tyrosine can behave like a positive amino acid. Finally, we have, of course, the aspartate glutamate, which both have very low pKa, so acidity constants, aspartic acid is about 10 times as acidic as glutamate because it's about one unit less in the pKa. But again, very clearly always deprotonated at a physiological pH. Sistine here, you can see actually can also act as a proton acceptor and therefore has a slight basicity, but it's about a hundredfold less than, say, lysine. Finally, there's PI, that's the isoelectric point. What does that mean? This is the pH at which the net charge of the amino acid is zero. And quite a few of these are around five-ish, which has to do with the balance between the pKa and the pKb of the backbone because there are no other charges anyway, right? So for tryptophan, for valine, for alanine and so forth, this is completely determined by the balance between the carboxylic group and the amino group. Of course, this is not the case for, say, the acidic amino acids, where you can see the pH quite dramatically to pro and the pKa. Oh, by A's are always then the pKa of the carboxylic. So in order to get the whole thing remains an aliphatic amino acid, even when you start protonating the side chain, so that's an interesting effect as well. You don't have this limitation on the basic residues because there you can go up in the basicity because we already have protonated the alpha amino. Right, so, and here's again what you expect. You get a higher value for the amino acids which have a basic side chain, okay? So it shows us very important thing. Amino acids are all very different. They don't just weigh differently, they actually chemically act very, very differently. And that gives rise to the huge variety of different properties of amino acids and therefore of proteins which consist of amino acids. Now, let's move on. How do we assemble these amino acids into peptides? Well, here's an amino acid, alpha carbon, amino group, carboxyl group, and here's a side chain, whichever one this may be, one of the 20. And then here another amino acid. And what will happen now is this carboxyl group and this amino group will form a peptide or amide bond because water is actually being expelled and what you get is a bond between the nitrogen and that carbon of the carboxyl group and that is the way that amino acids are coupled together. Now, you can of course do this multiple times in a row in daisy chain if you like, all these amino acids and then you get what we call a peptide. So naturally the side chains would always be pointing in opposite directions along the backbone. In the three-dimensional structure, this can be warped. So it doesn't have to be, but in the linear sequence I tend to point the other way, which is what I've replicated here. And now you can see that you have a residue which is what is left of the amino acid after it has participated in the bond. So the amino group loses one proton to the water that is being ejected if you like from the reaction and the carboxyl group loses that oxygen-hydrogen combination. So that is the residue and on the previous slide we actually had the molecular weight and the residue weight. And as you can see the difference between the molecular formula and the residue formula is that it actually loses water. So there's H2O gone, so H702 becomes H501. And then you can see here that the residue weight is in fact minus H2O. So you just subtract the mass of water from the molecular weight to give you the residue weight. Now you know exactly why that is because actually water has been lost by each of these residues. There's two notable exceptions of course that's the first amino acid in the chain which retains the original amino group here at the amino terminus. So that end is obviously called the amino terminus. And you can guess what we will call the other end, the carboxyl terminus because the last residue in the chain will retain its own carboxyl group, okay? And so the notation of a protein is always amino terminus to carboxyl terminus. It's just a convention, it's like three prime to five prime. You just choose to always write the amino terminal, amino acid in the front and then everybody knows what you're talking about. So this is about the nomenclature and very briefly about the basic chemistry of how these bonds are formed. Maybe one more thing that is relevant to mention is that this peptide bond is quite weak. So it's not a very strong bond and that's useful because that means it's relatively cheap to assemble proteins. It's also relatively cheap to disassemble proteins. So in that way, they look more like Legos than they would look like, say, Meccanols. Meccanols which is where you screw everything together quite tightly. Here you just plug them onto each other. That also means of course, as I mentioned that you can unplug them very easily which means that proteins are photosensitive. So UV light breaks them, they're heat sensitive and they're acid and base sensitive. So practically anything can break this bond and proteins are therefore not particularly stable when you expose them to these kind of things. Maybe a small sidestep here. We're currently in the grip of coronavirus which is why of course you're watching this on video rather than seeing me talk about this live. And one of the things that we know is that the virus which is actually built mostly out of protein, nucleic acids are just a very, very small part of the virus, an important part, mind you, but a small part. The proteins, however, that envelope the DNA and that make sure that the virus can attach to the host cell and invade it. These proteins are, as I mentioned, very sensitive to UV light, which is one of the reasons why UV light is such a potent disinfectant if you like for viruses in most cases because the UV light will break down the proteins that the virus needs to attach to and enter the cell. If these are broken, then the virus is unable to attack the cells. So thank God that these bonds are relatively weak. Right, so now we know a little bit about amino acids, how they form peptides. And obviously if you make a peptide long enough, we'll call it a protein. So that's the only distinction. So polypeptide protein is kind of the same thing and a peptide is a short version of a protein. And in fact, some peptides arguably are proteins because they are synthesized from a gene. They're just not very long, but it's a purely semantic distinction if you like between a peptide and a protein, both of which are amino acid polymers. Now let's have a look at mass spectrometers and how they are being built and why they're so interesting for analyzing amino acids. So a mass spectrometer in a generalized form consists of three main parts. The first part is an ion source. The second part is a mass analyzer and this is the real mass spectrometer if you like. This is just an auxiliary system. And then finally there's a detector which is also kind of an auxiliary system. Obviously you need an ion source to fuel the mass analyzer with sample and you need a detector to then record it. So all together they form the actual mass spectrometer. I've also highlighted here the sample because of course something needs to be pushed through the mass spectrometer otherwise it's an idle instrument. And then finally I mentioned here a digitizer. I'll talk more about that in a second. That is more a historic note. Throughout the presentation actually whenever I talk about the mass spectrometer it won't be that often but I tend to use the same colors for everything. If you look at my mass spectrometry basics presentation I will actually go into much more detail on how these things work and there again the same colors will be used throughout. Now the idea here is that these three parts together will record a mass spectrum. It will give you a set of masses that have been measured as well as the intensity for each of these signals. Now why is the digitizer here? The digitizer is here because in fact what happens is the mass spectrometry records a continuous signal. These detectors are effectively current measurement devices. And through electron multipliers which we will briefly discuss a certain incoming ion which is originally from the sample created by the ion source analyzed by whatever this thing is in the middle, the mass analyzer. And then it hits the detector which is an electron multiplier and it will feel this impact from a charged particle and will then start generating more electricity from that system using of course an external power source. So that is being measured by a current measurement device an ampere meter essentially which is continuously recording the signal. So it's kind of a continuous curve coming off the instrument. But that's not really something we can work with. So we have a digitizer which cuts the signal into very small sections. So fundamentally it is making an analog signal into a digital signal. Now, as you can imagine the frequency with which you sample the frequency at which you cut and the speed at which you can put each cut into the signal determines how accurately you reproduce the signal. So if you have a smooth curved signal and you cut it in three sections one just before one in the middle and one just after the best you can do is reproduce it afterwards from these three data points as a triangle which is a very poor approximation of the real signal. Now imagine you have 10 times that speed 10 times the sampling rate if you like then you would have one before one at say 20%, one at 40%, 60%, 80% that's actually five times as fast. So you would have before 0.1, 0.2, 0.3, 0.4, 0.5 and now if you retrace the original signal you get a much better approximation of what that signal looked like which as you can imagine is quite important if you want to measure the signal accurately. So what happened in the old days when computer chips were still speeding up quite dramatically from year on year is that you could upgrade your instrument by upgrading your digitizer because the signal from the detector would not change but the frequency at which we could sample that incoming signal would shoot up and that would give us far more fine-grained control of that data, far more fine-grained recording of that data. Now of course as you may know digitizers have kind of leveled out. It's difficult to build a faster digitizer especially since these digitizers are now so fast that the detector is actually the limiting step. So the resolution of the detector is now worse than the resolution of the digitizer but it's interesting because this instrument fundamentally relies on an analog to digital conversion at the end and this analog to digital conversion is key in transmitting the signal down to the downstream analysis. Right, so we'll talk more about these parts in a bit and the first thing I want to talk about is the first iron source which is the Maldi iron source. So Maldi is short for you can see it here at the bottom matrix assisted laser desorption and ionization and it very accurately describes the whole process. So what happens in the Maldi source that we take the analyte, put it on a metal plate which we call the target and then add another chemical molecule called a matrix and this matrix molecule is an acidic small molecule that is very good at absorbing UV light. Mind you, we do this all in very high vacuum and we will all mass spectrometers will always operate a very high vacuum and then what will happen is we will use a UV laser to shoot a laser beam onto the target and of course the matrix molecule as I just mentioned absorbs this UV light and starts vibrating violently because of the strength of the energy transfer from that laser source. The net result is an explosion. Quite literally the laser blows the sample together with the matrix molecules of course off the plate in a mushroom cloud and so you can see why we call this matrix assisted laser desorption, desorption being a fancy word for blowing it off the plate and so it's because of the laser and the fact that the matrix molecule absorbs the laser energy that we obtain desorption. So that's the first part of the name and that desorption process now gives us a cloud of acidic matrix molecules along with our sample molecules. In our case of course, peptides. Now interestingly I mentioned these are acidic molecules, these matrix molecules that means that they are proton donors and what we will see is that at one of these acidic matrix molecules will transfer their proton onto the analyte and so that process is ionization. So now we have a positively charged peptide in the gas phase under high vacuum and so this is the last bit, the ionization phase and the matrix is actually doing both which is of course why matrix assisted works for laser desorption as well as ionization. Now if you work with a peptide which as we remember we defined this as a short stretch of amino acids say up to 20 amino acids or so you will only get a single charge transferred. We don't actually know how this works in a lot of detail because this is very strange chemistry. It's a high energy, high vacuum gas molecules that interact in bizarre ways but fundamentally we expect to see only a single charge. However, if you use entire proteins for your sample molecules if every one of these blue balls is an intact protein then you can actually transfer multiple protons and then you can have multiple charges on a single analyte. But for peptides, which is what we usually analyze it's a single charge. That will become important later on. This technology by the way made it so that Koichi Tanaka received the Nobel Prize in chemistry in 2002 because this process was deemed so important and fundamental. There's another source which is actually used far more often these days. I'll explain in a second why which is called electro spray ionization or ESI for short. So electro spray ionization again it explains pretty well what happens. It's an electric current based spray mechanism which leads to the entry of the analyte molecule as an ion into the gas phase. So let me illustrate this. First, where does the electrical field come from? There is a voltage difference between the intake of the mass spectrometer right here. And so from this intake onwards powerful pumps make sure that there's a high vacuum. This is a needle which is just a spray needle. And of course it has to have a conductive surface because we want to have a voltage differential and it's quite a high voltage differential is three to 5,000 volts. You can also see that this at right angle so the mass spec inlet is at the right angles to the needle that need not necessarily be the case but these days very often it is and I'll show you in a second why that makes sense. And then here there's just a barrier and that's just for sanitary purposes as you will see. So what happens is we actually push our sample through this needle. You may have a buffering flow of nitrogen on the side nitrogen because it is an inert gas it will not affect the chemistry of the molecules. And the reason why this is sometimes done is to ensure a laminar flow inside the needle. But it's not crucial in this particular setup but you may encounter it. Now what happens when you push liquid through a very tight nozzle of a needle and everybody who's ever sprayed walk through with a garden hose by squeezing the end of an open hose knows what happens is you get a spray. So of course fancy word nebulization. You can see here a picture of what this looks like in real life. This is the needle tip. You and here at the small tip of the needle the spray actually because of the constant flux obtains a much higher velocity. That velocity is kinetic energy and that kinetic energy helps overcome surface tension and you can get very small droplets in a plume formation. So I've tried to mimic that here. So what we have now is we have our sample in very small droplets. Now what you can see is I've made some of these droplets neutral and some of these positive. So why are some of these positive? Actually the sample is dissolved in a bit of water and acetoneitrile because we want to create in the sample solvent. We want to create a slightly hydrophobic condition because all amino acids, especially peptides have a little bit of a hydrophobic nature. Not very strongly expressed hydrophobic nature but just slightly. So dissolving them in pure water is never a good idea. Proteins that not too like to dissolve in pure water. It helps if you add a bit of acetoneitrile to make the mix a little bit more hydrophobic. But we also add a very small amount of formic acid. And formic acid is there as a proton donor. So remember the matrix molecule that is from Maldi in the gas phase. Here we actually do it before we do any nebulization. We actually already have charged peptides in the solvent because we actually lowered the pH so that we're absolutely sure we can get these peptides ionized. Now what happens here is there's an electrical field but these neutral droplets don't notice the electrical field. And so if you don't have a charge this electrical field is invisible to you. And these droplets they start to evaporate because this needle tends to be at a temperature of 60 to 100 degrees. So it tends to be pretty hot. And then the evaporation makes the droplets smaller and smaller and they just continue through inertia to fly forward until they hit the barrier and that's game over for them. However, if you are a positively charged droplet you actually see this electrical field and in fact you will respond to the electrical field because you will want to go towards here the negative pole of that electrical system which is of course the mass pack inlet and you will start turning. And as you can see what happens here is we split from the unwanted droplets of the droplets we really want to see which is exactly why we would put this inlet at right angles. Now, what you can see here is that evaporation takes its plays its role as well and the droplets become smaller but then as droplet evaporation progresses at some point we will achieve charge-driven fission. What's effectively happening is all the positive charges in the droplets. So because of course there's multiple of these positively charged peptides in the droplet they will be squeezed together tighter and tighter and tighter as time progresses. And at some point the positive charges will start to repel each other. And of course when this repelling force is strong enough to overcome surface tension the droplet splits into two new droplets each with roughly half the charge same. So that goes on for a while and actually allows these droplets to shrink in size faster than the neutral charged droplets because they are only subject to evaporative influences. Here we have a combination of evaporative and charge repelling influences. Now, at some point the droplet becomes too small to harbor multiple charges and what happens then is ion expulsion which means that a charged molecule is simply thrown out of the droplet and now becomes a charged molecule in the gas phase sucked into the mass spec inlet and into the high vacuum. So just like with Maldi, we end up with a charged analyte molecule in the gas phase in a high vacuum which is exactly what we want. So the ion sources do exactly what it says on the tin they create ions of our samples and in the process transfer them in the gas phase in the mass spectrometer under high vacuum. Maybe something relevant here compared to Maldi there's one giant difference here is that most peptides which is what we are analyzing in proteomics tend to have multiple charges. So two plus, three plus, four plus, five plus and six plus are also not uncommon but are rarely identified. For numerous reasons we'll come back to that. Now, I said earlier on that electro spray ionization actually has become much more popular than matrix assisted laser desorption ionization. Why is that? Well, this process is a continuous process and you can easily feed the flow of sample through this needle from an auto sampler which can easily take a bunch of vials or a 96 well plate and have a robot move for one point on our sampling plate to the next and continuously inject sample. So this is perfectly automatable. It's very, very easy to automate. Which means that we can run these instruments 24 seven. This is not the case for Maldi where while we can move the laser on a target plate across say 96 spots on the plate once the 96 spots have been analyzed you have to physically remove the plate from the vacuum which means that you have to switch the vacuum off for a while, take the plate out, move a new plate into position and run the vacuum again and then shoot with the laser again. That process is much, much harder to automate and much more error prone once it has been automated. As a result, ion sources tend to have evolved towards easy sources. And then finally, another thing we do with peptides which is not discussed here in very much detail is that we a priori separate them on liquid chromatography systems which allows us to separate different peptides in time. If you want to know a little bit more about this definitely check out that mass spec basics course which goes into a little bit more detail as to how this process works. But fundamentally, the LC system automatically creates a continuous flow of sample which makes perfectly well with this system with the spraying needle. So that is why electrospace sources are much more popular these days. They interface directly with the upstream machinery and secondly, they actually allow the system to run 24 seven which is good because you get more sample analyzed. John Fenn got the 2002 Nobel Prize which he shared with Koichi Tanaka for demonstrating easy ionization of proteins and peptides. Just as an aside, there are many different other ionization techniques but they are never used for peptides or proteins because these other ionization techniques are called harsh or hard ionization techniques and they actually literally blow up a sample of the complexity of a peptide or a protein. They inject so much energy that the peptide bonds which remember they were very fragile bonds they would just fly apart. So Maldian easy despite looking like pretty tough conditions for molecules are actually known as soft ionization protocols because they allow a very fragile if you like macro molecule like a peptide or a protein to stay in one piece throughout the entire process. Right, so now we have our ions in the gas phase. The next thing we want to do is we want to analyze these analytes. Now mass analyzers are described in a lot more detail in a lot more variety in the mass spec basics course so I won't go into too much detail here but suffice it to say that we are measuring masses and we're using an advanced type of scales if you like. Now we're not measuring single molecules or single ions because remember all our molecules are ionized. The problem is that we cannot measure a single ion's mass it's far too small. So what we are doing is we're actually seeing a whole population of the same molecule so say 5,000 or 50,000 copies of the same peptide and they're all flying through the instrument. As you can imagine, there's a stochasticity to the way we measure the resulting mass. And so what you get is you get pseudo bell curves for each of these measurements. And now we get in trouble because if two things have quite similar but not exactly the same mass it becomes important to know what the spread on that stochasticity is to see whether or not these will overlap and whether or not we will be able to separate them as two different signals. So a lot of effort has been put into mass spectrometers to make that stochasticity to make that variance if you like on that measurement as small as possible. And there's many technical ways in which this can be done. And again, this is explained in a lot of detail in the mass spec basics course but fundamentally if you run the same peptide through a very advanced instrument you will see something like this. If you run it through a very simple instrument you will see something like this. It's the same thing only here we cannot separate the different isotopes whereas here we can separate the different isotopes and you see different peaks at different masses for the same peptide analyte whereas here you just see one giant blob and you cannot tell them apart. If you don't know what isotopes are I'll tell you in a second but for the moment please accept that one peptide will give you multiple of these peaks. Now, why is it important that we can tell the difference between these peaks? Well, to get an accurate mass reading because if you have to put a mass on this blob then you will have to somehow find the center of this blob and call that the average mass. But as you can imagine you see the spikes here everywhere small shifts in these spikes can actually dramatically shift the mass while dramatically, dramatically enough to create quite some uncertainty on the actual thing that you've measured. Whereas here we're actually using a far more narrowly defined peak. So it is a much higher resolution which is the technical term of course and we can therefore assign this mass with far greater precision. We call this the monoisotopic mass because we use a single isotopic peak for that and by convention we use the lightest isotope. That comes with a few problems come back to that in a second. So what are these isotopes I'm speaking of? Well, you know that every chemical element consists of different isotopes. So for instance, carbon most people have heard of carbon 14 which is a radioactive isotope of carbon which is used in carbon dating. It has a certain decay rate and based on that decay rate and the number of carbon 14 atoms left in a particular sample we can estimate when it was made. So something similar here happens but it's not carbon 14 because carbon 14 is actually very rare. There isn't much carbon 14 out there or we would all be very radioactive and that would be a problem because don't forget most of us consist of a whole bunch of carbon. There's another carbon isotope, carbon 13. Remember 12 is the basic isotope that's the normal carbon that's 99% of all the carbon in our nook of the universe but carbon 13 makes up roughly 1% of the carbon in our part of the universe. So it's quite abundant. And 100 carbons is that carbon 13 with one additional neutron in the nucleus will actually weigh one dalton more than the carbon 12 for a peptide. If it's only carbon 12 at this mass. So obviously then you will end up at this go out. What is then the chance of putting two carbon 13s? And of course that's a multiplicative series, right? So very quickly it goes down so it has an exponential decay and you can see that very nicely here. Now, why do I make such a fuss about this? Is because if we know that the difference between this peak and this peak is one neutron that's a mass of one dalton we can actually see what the difference is between these two peaks to figure out what charge the original molecule had. Why is that? Because, and again this is explained in more detail in mass spec basics because the mass analyzer uses electrical fields to measure the mass. But electrical fields affect molecules with different charges differently. Very briefly, if you have done read that mass as if it were only half the original mass because usually we use acceleration of some form. And double the charge means you get double the accelerator force which looks like as if you're smaller light because you go faster. So we never really measured mass in these instruments. We measure mass over charge. And when we measure mass over charge, well, mass is an unknown, but so is charge. And we really want to know what the charge is because once we know charge, we can resolve mass. And here we have that opportunity because if this is M over Z at a certain say M over Z zero, this will be M over Z zero plus one because it has one additional dalton. However, if this is a singly charged peptide, I am correct in my statement. If it is a doubly charged ion, then this is M zero. And this is M zero plus one divided by charge. So divided by two plus half. So the distance between these two peaks instead of being one for a singly charged ion will now become 0.5 for a triply charged peptide. This will become 0.33. And for a quadruply charged peptide, this will become 0.25. And so that distinction allows us to read from the distance between the isotopic peaks what the original charge was of that particular peptide. So not only does this isotopic envelope, this resolution of the isotopic envelope give us a better mass estimate. It also gives us charge information. So nowadays, every single mass spec used in proteomics is capable of resolving these isotopes very nicely because it is fundamental information, especially when you're working with these very convenient easy sources because you will remember that the easy source gives higher charge to these particular peptide analytes. So two plus, three plus, four plus and that creates a problem because now we need to find out posteriori what the charge was. In Maldi, that problem is far less important because in Maldi, everything has single charge so you can just assume single charge. But you know, EZ has other advantages so we needed to resolve that. A final thing I can say is why is the carbon 13 peak here at a higher intensity than the carbon 12? Well, this is due to the fact that this is probably a very heavy peptide. And in fact, a mass spectrometrist with a bit of experience from this pattern could put a mass on this, which would be pretty accurate. I'm estimating this one at 2,200, maybe 2,300 Daltons from the looks of it. Why is that? If an average amino acid weighs about 100 to 110 Daltons, depending a little bit on how you count it. And if you have 2,000 Daltons, that means that you have say 20 old amino acids in that chain. When you have 20 amino acids and you say on average, an amino acid contains eight carbons, then you have a whole bunch of carbons in there. So that's 160 carbons. Now, what is the probability if 1% of your carbons are carbon 13 that you have at least one carbon 13 in there? Well, that probability is pretty high. And so as a result, the probability of finding at least one carbon 13 is actually higher than the probability of never finding a carbon 13 because you have so many samples, so many carbon atoms in your peptide that the probability of always ignoring the very few carbon 13s when you're picking your carbons is very, very low. And so about 1,800 Daltons, we see that these two peaks get at equal heights, at equal intensities. And from 2,000 Daltons onwards, you will see the carbon 13 peak go up compared to the carbon 12. And at about 3,500 Daltons, this carbon 12 peak is essentially gone because the probability of having such a big peptide with no carbon 13 whatsoever becomes effectively zero. And that creates a problem because then we're mismatching the mass. We're taking the monowheizotopic mass from the first peak that we can see that we have recorded. And that first peak is now the C13 peak. So it's not actually the mass, it's the mass plus one neutral. So that creates some problems. That later on in the practical, we will see how that is dealt with. Right, that's all I wanted to say about resolution. Now let's briefly say something about detectors because this becomes important if you want to talk about quantification. So fundamentally, the principle is this. It's an electron multiplier that uses an amplification system that is rather straightforward. A single ion hits an electrode. And it's attracted to that electrode because it's at a slightly higher voltage than the surrounding or the previous electrode. And so you can easily attract that ion. The ion will impact this particular electrode and will then set loose quite a few other electrons because of the impact. So the kinetic energy of the impact ejects a few electrons from this plate. Now we have a second plate and an even higher voltage, a little bit further away. And so these electrons will now see that 20 volt differential between these two plates and head over to this plate. A quiet and kinetic energy in the process and again impacting the secondary plate. Now this impact, each of these impacts will set free another set of electrons. Which means that we get a multiplication of this signal and then this plate has a tertiary plate with again the same voltage differential and so forth and so forth and so forth. And what you actually get is more or less a runaway cascade where you get an exponential increase in the number of electrons moving between plates. Now, this means that you can get a single ion in theory. So in practice, there's a baseline above which you have to have becoming ions before the whole system actually will stop coming in and going out again. So this is known as Faraday cup, dynodes, dynode because there's two nodes. These plates are not being built in that way anymore. Nowadays it's all micro electronics with nano cavities, very complex stuff, but the fundamental physics are very similar still today. One thing that is worth noting is that over time these plates can get depleted. So if you keep kicking electrons out at some point you have nothing left. And so these detectors are prone to signal degradation over time. And at some point the signal gets so weak that the sensitivity is lost and then you have to replace the detector which means the sensitivity shoots up again. So if you're ever looking at longitudinal data and somebody replace the detector in the mass spectrometer in between, you're going to have to do some nice normalization to your data because of course the signal on the first samples will go down continuously until it hits essentially almost rock bottom, analytically speaking, and then they plug in a new detector and the signals all shoot up again and then they will start going down again. So it's very important to take these kind of things into account if you do a long-term longitudinal sampling and your mass spectrometer is essentially serviced in the meantime. Now the primary principle behind quantification and mass spectrometry is that we can use this output signal as a means to quantify the presence of a particular peptide between say two samples. Now, mind you, I said that very specifically we can quantify the relative abundance of a specific peptide. The reason for this is that it is not easy to correlate the output of a detector to the input in the number of ions. We'll come back to that, but keep that in mind. Fundamentally, what we need to do in order to compare intensities of signals as we can do here where I say this signal is at half of that signal, this is roughly equal and of course keep in mind there is some stockasticity here and then here it's a two versus two over one ratio between this signal and that signal. What we need to do is if this is our patient and this is our control, we need to somehow make these samples distinguishable and you can already see what we did here. We actually moved one of the samples in the mass space. So we made these peptides heavier, which is something we can do chemically. We can add in some heavy isotopes and then move these things away. So we don't have time to go into all of this, but this is essentially usually what is being done in mass spectrometry is you either move the analytes in mass by a label introducing some kind of chemical label which you have in a light and a heavy form. So this would be the light form, this would be the heavy form or you can separate them in time. And that would mean that we run this sample on one day or on one moment and then later on run this sample, but then we have to somehow align a posteriori. We have to find a way to say this peptide that we saw on Monday at this moment in time correlates to this peptide, which we saw on Wednesday at a slightly different moment in time because of course nothing is ever perfectly the same. And so you somehow have to merge these signals together again. But that is what is very frequently done. That is known as label free because you don't need to introduce a chemical label, but you now have to run your samples separately and so they become separated in time. So that's essentially what these two lines give you. So introduce the mass difference labeling, perform distinct experimental runs label free. And so the idea as I said is to measure the intensity of the signal for each analyte in each sample either together in the same workflow like it is shown here or in two separate workflows. And then statistically process the accumulated information which is something that Professor Clement will of course give you a lot more background on. Now I mentioned earlier that we have to be careful that we compare intensities for the same specific peptide. Why is that? Well, if we take a protein and we cut that protein into different peptides and then analyze these on the mass spectrometer, the problem that we have is that the intrinsic properties and remember these can be very different because the amino acids can be very different mean that this thing, for instance, acquires a charge really easily and therefore what we call flies really well in the mass spectrometer. So it gives a big signal because if you put 10,000 of these peptides into the machine, into the iron source, let's say 8,000 of them become ionized. So these 8,000 all give a signal. Now we know that these are all from the same protein. That means that there's also 10,000 of these, right? Because there must be equal amounts of protein therefore equal amounts of peptide. Now if there's 10,000 of these peptides but they don't ionize very well maybe they only yield say three and a half thousand of these ions and the other ones that never ionize and they're lost, then we end up with a much smaller signal even though it's the same original protein. So correlating this measurement or this measurement to the amount of protein present is extremely difficult because in order to know that you would have to know how much of this peptide you would expect to see back. And so that's not something that is easily done. It also means that we cannot compare the signal intensity for this peptide which is present in exactly the same amount as this peptide in this sample because of course the signals are very different because of ulterior parameters. And as you can see, each of these peptides can have either a poor signal or a high signal depending on what the nature of that peptide is. And since we don't know since we can't predict that a priority which is a very complicated thing which also depends on the circumstances in which this peptide is found in the mass spectrometer, it's always primordially important to compare a peptide with itself which is exactly what we do here because we have one peptide and then we have the same peptide but we made it heavier, okay? Now, if you run parallel analysis runs again, you identify a peptide, you identify a peptide and then you look up the intensity for this peptide, you look up the intensity for this peptide since you know it's the same peptide you can now compare these intensities. Of course, taking into account things like batch normalization and so forth again, Professor Clement will tell you a lot about that. But here fundamentally the message is you always have to compare peptide ion intensities for the same peptide, not across peptides that does not work. Right, here's how the detector behaves. Now, this is not specific to mass spectrometry. Every detector ever built including the detectors that are built into your senses in your body will all have the same kind of response curve. If you have a very low signal coming in you tend to first have no response from the detector. So your detector is only activated once you reach a certain threshold, once enough signal reaches you. So it's not enough to have a quantum of signal, you need many quanta of signal before your detector responds. So that's a flattening of the curve, a leveling off at zero and then at some point you reach the threshold, the limit of detection. As you hit the limit of detection, your detector is going to start sending output signals, going to say, hey, something is here, I just noticed something. And so then that flat line curve is starting to slope up. And then you hit the sweet spot which is essentially the area of quantification starting at the lower limit of quantification. And that means that you have a linear response, which is twice the input gives you twice the output. So if you go twice as far on the x-axis, you get twice as high on the y-axis. So that is essentially the sweet spot for quantification because now we can extrapolate if I had doubled the signal, I had doubled the stuff in the sample. And so that's this line here on this plot and this is real life data. So you can see more or less a flat lining here, then you suddenly get into limit of detection goes into limit of quantification. And that linear behavior lasts for a little while before your detector reaches saturation. And saturation means you put more input signal in, but the detector just cannot output more than it has before. So the output of the signal starts to level off again because it simply cannot give you more signal. It cannot give you a louder, stronger signal than it is doing because it's at maximum capacity. And that means that here we level off again. The resulting curve is a very specific sigmoid, which is specific to the detector. It has very little to do with the sample. It has everything to do with the detector. And every detector ever built by nature or by man follows this sigmoidal curve always, okay? And so with a mass spectrometer, you can see here what happens. This is a log two scale, which means we're here at one over eight to one over 10. So this is one over 10 and this is 10 over one. You see that we actually already have kind of reached the lower limit of quantification and the upper limit of quantification where the behavior is linear. And then we kind of lose that when we go forward. We can zoom out on that. Now we're not showing you the limit of quantification, not the actual quantification, but the error bars, right? So that's essentially the error bars on each measurement which become higher here because the signal to noise gets worse. And here they become higher because the level ignore is an issue. And what you can see is that around one over 10 to 10 over one is the sweet spot where the error is under control. Apart from that, the error goes sky high. What this means is that when you do quantification, you've got a heterogeneity of variance. The variance is not conserved throughout the entire measurement range. In fact, the variance goes completely off the scale as soon as you get out of the limit of from lower limit to upper limit of quantification, okay? So these are very important things. Limit of detection, from what minimal amount do I start getting signal? And to what maximal amount can my detector give me a reasonable output? And then you have the lower and upper limit of quantification which shows you where the linear behavior is, which is again also usually where you find the error control. Now this is a very extreme example. You can very nicely see that very flattened sigmoid. It's very flat, but it's a sigmoid nonetheless, where here you are at the low level. Here you hit the sweet spot for quantification, roughly here, which is one over 10. And then you can go all the way up to about 10, 15 maybe. And then it starts leveling off again. Notice this goes to a thousand over one and starts at one over a hundred. So this is five orders of magnitude. So 10 to the minus second to 10 to the third. And as you can see, you can quantify, but if you really want to have linear quantification, if you want to have twice the input, twice the output, you need to be here in the middle section, which gives you a smaller range of quantification. Here you can still do quantification, but it becomes a little bit more imprecise. So effectively here you can say there is a whole lot less in one sample than in the other. But whether that whole lot less is one in a hundred, one in 50 or one in 30 is very difficult to assess because effectively they look kind of the same. So you would not be able to differentiate the signal here from a signal here because they look the same because everything is leveled off. Mind you, usually a biologist doesn't really care about that because if you can say the difference is one in 10 or worse, then that is already an extremely strong signal. Most biology actually plays in one over one, two over one, one over two, two over one. So you tend to be kind of here in biology, very fine grain changes, these massive changes, they're not frequently encountered. And most of the time they're already known because they're much more easily detectable than with a fancy machine like a mass spectrometry. So is this a fundamentally big problem in mass spectrometry? No, what is important is this area. That's where you will find most of your changes. And again, you will see that when you process some real life data that you practically never see huge, huge effects. You always see subtle effects because biology tends to be quite subtle. And again, every detector that you ever will work with works in this way, sigmoid response curves. Okay. Now you have to take into account again that we have this digitizer, analog to digital converter. Remember I talked about that. And that means that our signal is not perfectly approximated. And it depends on the quality of our mass spectrometer, how well we approximate the signal. Now, once we have the signal digitized, we essentially have a bunch of dots and we have to reconstruct some kind of model, if you like, for what that curve originally looked like. And so what we do is we will fit a curve. But fitting a curve to a somewhat imprecise data set is very tricky, especially since there's noise at play at well, especially at the lower intensities, noise will actually bump up the intensity a little bit too much and make this job hard. So what we do is we try to fit a variety of shapes, but we can easily make, in total surface area covered, we can easily make about a 10% error. This is technical error. So a lot of technical error comes from the way in which we fit curves to the signal coming from the machine. So while the machine may be analytically very reliable, we still have to interpret the signal and there too you can make quite a substantial error. So always keep in mind, there's going to be measurement error and it's going to be a compound of errors in the instrument or variation in the detector and error in the software. So that's all I will say about this. If you want to know more, there's a lot of discussion on that in this paper from 2010. So there's a variety of ways in which you can quantify, you can use Apexes, you can use Lorentzian curves or you can do quadratic fits, you can do a lot of different fits and each has its own pros and cons. We will not worry too much about that in this course because it's very low level signal processing. But if anybody ever goes deep into the mass spectrometry data, that is something that you will encounter and you will see differences in different algorithms that people have made available. One last thing to show you is what an actual mass spec signal looks like because it has three dimensions, not two. It's not just MOVZ versus intensity. Remember that an electrospray source is a continuous thing. So that electrospray source also has a time dimension and that's the dimension, the Z dimension here. And so over time, a signal will start to come off the column and again, this is a bit stochastic, the process behind it. So you start seeing a peak emerging and it goes up, up, up, up, up. Then it reaches an Apex which is kind of the real, the average, right? Or the mode in this case, the mode of the signal which is where you would expect the signal and then it starts to wane again. And what you can see is that actually the rising side of the curve tends to be steeper than the declining side. And so you get tailing, which is something that is, it's quite subtle but you can see it here as well, which is realistic data. Here you have a sharp and sudden rise. You reach a peak, which is reasonably well defined and then you get a slight tailing here at the end, which is why very often these peaks are approximated with Lorenzians, which actually allow for this tailing. It's a very typical physical phenomenon that things actually start eluting roughly at the same time but there's always trailing elements. Why do I show you this? To show you that extracting the signal is not just an area in two dimensions, it's an area in three dimensions, it's a volume. So you really have to calculate the volume of these peaks. And so you have to fit a model to a 3D peak, which is a little bit more convoluted. From, seen from above, this is what this looks like and now we have to show yet another thing. You have seen it here already. There's multiple of these peaks in their side by side. Why is that? This is in mass over charge. This is the C12, this is the C13 and this is two times the C13. So first isotope, second isotope, third isotope. And you can see that the isotopes, which are less abundant like the third isotope, remember we've got an exponential decay in intensity. They will start showing up a little bit later because it takes longer for the small peak to breach the signal to noise barrier and they will disappear a little bit faster because they get subsumed into the noise very quickly or more quickly rather than the more intense peaks. And so here they use actually a convex hole to map that particular aspect. So again, the signal processing is more involved than you would think at first. It is a very important aspect of the quantification. So I wanted you to know about this. Don't expect a full engineering style signal processing tutorial here. If you want to know about this stuff, there's plenty of literature on that in the mass spec field. For you, it's important to know what's behind all of this because once you start playing with the numbers, you need to understand what kind of errors might occur on the numbers and where they might come from and also how different datasets may be processed differently. And therefore the numbers might differ slightly in the errors that they show. Now, proteins are also tricky because one of the reasons why we study proteins is for instance to find biomarkers, right? So marker proteins that show up in say early disease stages and one of the typical places to look for such proteins is in the blood. The reason for this is simple. When cells die, they tend to dump a lot of stuff in the blood and so blood is a place where we can find a lot of these things. So now if you take blood from a patient and you leave it on the counter, well, if you analyze it first without leaving it on the counter, here's a bunch of peptides that you will see. Automatically, just peptides that are present in the blood. If you leave the sample on the counter for five hours, you will see that some of these will start to shrink and others will start to disappear entirely. After 24 hours, the signal has degraded quite dramatically and after 48 hours, practically nothing is left. After 72 hours, the signal is essentially gone. What is happening is your blood is chock-a-block full of proteases, small other proteins that eat protein. The reason for this is because your blood is essentially a sewer and cells when they die, drop all of their contents into the blood, your body is prepared for that and always keeps cleaning crews ready in blood, which digest proteins. And so in your own body, every second of every day protein is being eaten up and is disappearing from the sample, which actually means that signal is highly variable. If you measure one of these peptides and it's leaking from somewhere in your body into the blood, depending on a host of factors, the signal can vary from moment to moment in the blood because it may or may not be digested already, okay? So it's very important to realize that there can be lots of variability in this because proteins are weak and proteins when they're not in the cells, tend to be degraded very quickly because your body doesn't want your blood to be clogged up with circulating crazy proteins that don't belong there, okay? There's a host of biological reasons for that, but for you, it's important enough to understand that this process happens. Another thing that can confound quantification is modifications. So modifications actually change the mass of a peptide. If you have a peptide and it has a certain mass, if you now put a phosphor group on it, it weighs 80 baltons more, which is one phosphorus and three additional oxygens. And so the result of that additional weight is that the signal will shift. So now you have two sets of your peptide. You've got the original guys which have not been modified at the mass where you expect them, but now you also have a few phosphorylated versions of the same molecule, but they weigh 80 baltons more. So suppose you're missing these, you're not analyzing these, but some of these have disappeared to give rise to these. So what effectively happens is baseline situation, a decrease in one and an increase in the other. So now this signal is diminished because these molecules move to this side. And now the problem you have is you're not looking for these. So you only measure this. So you're misjudging the amount of peptide present because some of it disappeared in a way you did not anticipate. And just to illustrate how complex this problem can be, this is some actual data that we processed. These are very commonly and widely abundant proteins which have formed part of glycolysis which is a key biological process in nearly all living cells. We actually found that each of these can carry up over a hundred different modifications. And a lot of these modifications have very little to do with biology but have everything to do with sample processing. So these are things we introduce inadvertently when we mess with proteins prior to putting them on the mass spectrometer. Or these are things that happen in the cell like glycations and dioxidations and dehydrations and oxidations. And so these kind of things can happen to these proteins. And if you don't take these kind of things into account they will lead to variability in the signal because these happenstance occurrences are very stochastic in nature. So again, not only is every signal prone to error not only is there error based on how the detector responds and what the ratios are and these errors can vary. There's also going to be different errors for different peptides in the same protein and different errors between different proteins. So there's errors everywhere. And again, Professor Clement will teach you how to deal with this. When you build models, statistical models to evaluate whether or not a particular protein has changed its expression level. Right, so now you know a lot about the background of this mass spectrometry stuff and how knowing about this can influence the way you work with the data. What I will do now is I will walk you through how do we take the data that we have recorded and make this into identifications. So for that, let's have a look at the MSMS spectra that come off instruments. So MSMS spectra first, what are these things? Remember that we had a mass spectrometer and we had an iron source. We then had a mass analyzer and a detector. So that gives us an intact peptide measurement. But an intact peptide measurement is not all that useful. It's essentially the same as saying all the people in this class will stand on scales and then we will look at their whole body weight and then we will write these body weights down and then we will afterwards blindfoldedly get a body weight measurement from somebody and we have to assign it to somebody in this class. As you can imagine, there's a high probability that given our rather imperfect typical scales that we use and the fact that most of you are roughly same age and roughly same height and build, we will probably get a few people that weigh essentially the same body weight. So it becomes very difficult to tell two people apart, right? Now how could we fix that? Well, we could do something very morbid and we could say, well, instead of measuring people, let's measure the bits, right? So we chop you into a head, a torso, two arms and two legs. So we've got your whole body weight before we chopped you. We've got your head weight, your torso weight, your arm weight, left arm, right arm and your left leg, right leg. Now, if we do that, there's a high probability we'll be able to make the distinction because even if two people have the same body weight, it is unlikely their heads, arms, legs and torsos will all have the same weights as well. So this is fundamentally what we're going to do. We're going to try and break intact peptides into smaller bits, measure the bits and then use this much more specific information to then find out what the identity was of these analytes. Now, how do we do the breaking bit? Well, we use a source again, we need ions. The peptides, intact peptides fly through but I'm hearing my analogy, my morbid analogy actually works still. Imagine if I were to do the chopping up and I let everybody into the room and then I start chopping everybody up. Now I have a tough time of telling whose legs belong to which torsos and which arms belong to which legs are which torsos and which heads belong to which torsos. So that will become one giant mess. It's much easier if I put a bouncer at the door and let you in one by one because then when I chop somebody up I actually can align the arms, the legs, the torsos in the head because it's very easy, there's only one person in the room. And so that's what this ion selector is for. It's essentially a bouncer which will let only one set of mass over charge having one set of ions with the same mass over charge through everything else gets thrown out. If you want to know how these things work, mass spec basics lectures, I know I keep saying this at museum but for those of you curious, everything is explained there. So ions that make it through then make it into our little chamber of horrors here which is called the fragmentation chamber where we blow these peptides to bits. And again, if you want to know about this you have to look at the mass spec basics section. And then a fragment mass analyzer is going to measure the masses of these fragments that come out of this chamber. So the bits, the arms, the legs, the head, the torso. And of course there's a detector involved because ultimately you need to record a signal. We call this tandem mass spectrometry because this ion selector tends to be a mass spectrometer and the fragment analyzer tends to be a mass spectrometer. So it's MS fragmentation, MS tandem MS. Now, what happens is this is our peptide backbone, remember, aminoterminous, carboxylterminous, amide bonds, side chain of each of these and then the alpha carbon here each time. What we can do is we can break here which is actually if we are, again, I continue the morbid example, if we are amino acids and we make a peptide by all holding hands then this would be cutting off at my right shoulder. If I am the first amino acid in the chain and this is my free amino group, then we cut off at my right shoulder because you end up with my carboxyl group which is essentially my right arm because this is one amino acid and cut through the amino acid. If we break here, we break where we are holding hands because this is your end terminus, your left hand and this is my right hand. So this is where we hold hands and that's where we break. And of course, if we break here, we would break at your left shoulder. So I would now be holding your ripped off arm and you would lose an arm. So sorry for the morbid imagery but it helps you remember because this is literally what happens to the peptides. Now you can see that we give these fragments that remain. So this bit, this bit or this bit, we give them names. So this is A1, this is B1 and this is C1. So A, B and C, Y1 because it contains one, amino acid. Now you also already see that we can also keep this bit. So instead of keeping the bit that contains the amino terminus we can also keep the bit that retains the carboxyl terminus. And we call that first bit, which is again, this is your left shoulder, this is the Z1, this is Y1 where we are holding hands, this is my right shoulder, this is X1. So you can see A, B, C, X, Y, Z. And this is the one, not because it contains the first amino acid but because it contains one amino acid. And of course you have the coordination here. The C1 is paired up with the Z3 as you can see, right? And why is this three? Because it contains one, two, three amino acids. Very simple, the nomenclature. So which of these ions do we expect to see most often in the most common types of fragmentation? Well, I hope it comes as no surprise that it's mostly the Y ions and the B ions. Why is that? Because the most commonly used fragmentation technique is rather soft. And that soft fragmentation technique takes the weakest bond. And the weakest bond here is where we are holding hands. So the analogy really works very well. Each of us as an amino acid, we're pretty tough. We're not meant to fall apart. And amino acids are built like that. Otherwise it would be pretty useless in biology because they would constantly be falling apart. So really think of them as the Legos, if you like. Each Lego block is pretty sturdy, but the connection between the Lego blocks is the weak spot. Okay, and so it's the same here. And when you try to shake a Lego structure, it will break not in the block necessarily. If you shake softly enough, it will break at the connection. There are techniques that will saw the Lego blocks in half, but we're not discussing them in a lot of detail. And then you get fragments here most often. Okay, so we have all of these fragments. And this is the kind of things that can fly in. So what flies through our mass spectrometer? So in this, sorry if we're going back here, but what, so here we, ions like them with fragments and then we can cut on these B and Y ions. And then these things fly through. So what kind of things can we expect to see flying through? Well, if I continue, we can see this thing fly. We can see this thing fly. And we can see this thing fly, but we can also see this thing fly and we can see this thing fly and we can see this thing fly. Okay, so all of these things can be flying through. And each of these has a specific mass, right? So this is light. This is light, but different. This is a little bit heavier. This is a little bit heavier. Okay, this is heavier yet. And again, this will hopefully also have a different mass from this because it has a different set of amino acids inside it. So if we assume like any good narcissist that my name is a chain of amino acids, that's a peptide, then we can have the break here, which is the B1 ion and that will be the L alone. But we can also have the LE, the B2 ion, which is this one here. It's a bit heavier, of course. Now we have the B3, which is this one, the B4, which is that one and so forth. Okay, so you can see all the different ions, all the increasing masses, obviously. And we can see that we find quite a bit of this, but a lot more of this. And we find even more of this. And that's because each of these bonds has a different propensity for breaking. Now, if we're lucky, we will find every single one of these ions. And then we can do something really quite nice. We can jump from the zero mass to the first peak and read this, oh, this is the mass of leucine. Then we can jump to the next peak and we can say the difference in mass, so the delta between these two is, oh, that's an agglutamate. So the sequence so far has been first a leucine, then add agglutamate to get to this peak. Now, what do we need to add to get to this peak? It's an asparagine. And then from this to this one, asparagine again. And we'll ladder sequencing one by one all the amino acids in this peptide. And this is very similar to how we ladder sequence DNA. Unfortunately, that's not going to cut it because in real life, things are more complicated because we also have the ions on this end, remember? So the Y1, Y2, Y3, Y4, Y5 and so forth. And so Y1, Y2, Y3, but these peaks do not announce I am a Y iron, I am a B iron. We can't tell. So we don't know whether we should jump from here to here or from here to there or from here to here rather than from here to there or from here to there. So it gets very complicated. Even so, if we would have perfect coverage of B and Y ions, we would still have a relatively straightforward task ahead of us, but that's also not how it works. This is what a real spectrum looks like. A lot of the B and Y ion peaks are not found at all. So this makes it extremely difficult for us to analyze any of this. Rather what we see is we have some of the peaks left and they may be mixes of B and Y ions again, which makes it difficult because you should not be jumping from this peak to that peak. They have very little in common. And we also have these red peaks everywhere, which are something else. They are not derived from our peptide, they're derived from something else. And so they really are very heavily in the way of our analysis, but we don't know that. So we could jump to this from zero to this peak and from this peak to that peak. And if we're really unlucky, that really would look like two amino acids. Now, even if we jump perfectly and we say we jump from here to here, which would be the first B ion jump because B one is missing. So we have to jump straight to B two. We can say, oh, this mass corresponds to LE. And that would be nice. The problem is it also corresponds to EL, of course, because that is exactly the same mass. So this thing gives us composition. It says it has an L, it has an E, but it doesn't tell us sequence. It doesn't tell us the difference between LE and EL, which is problematic. Moreover, and this is really nasty, isoleucine and leucine have exactly the same chemical composition and they have exactly the same mass because they contain the same number of carbons, hydrogens, oxygens, and nitrogens. So the problem now is this is either EL or LE, which I wrote here in regular expression notation, but it could also be EI or IE because I and L are the same. So essentially in regular expression notation, nested them, that becomes EIL. Now, and I'm not sure whether this is accurately correct, but I think it's close. Suppose that Q and N, so glutamine and asparagine together have roughly the same mass of this, then it could also be QN or NQ, and I happen to know that Q, glutamine, and lysine, K have roughly the same mass, so it could also be KN or NK. So right now here, we already have one, two, three, four, five, six, seven, eight different possibilities for one job. Now here we are lucky if we make this jump, we have only one option. Asparagine, that's it, so it doesn't add to our confusion, we still have eight different versions. But then going here, again, we have different versions. And I haven't really calculated how many different ones we have, but there are quite a few. So again, you see an exponential explosion where you get ever more options. And in fact, what happens is rather than assign the spectrum to a single sequence, we're now assigning this spectrum to a whole list of possible sequences. So fundamentally, we are faced with tremendous ambiguity. And this is really the biggest problem in how we interpret mass spectra, because we cannot extract a single, reliable sequence from a mass spectrum. Instead, we extract a whole range of possible sequences and now we are confused because we don't know which one to pick. And so that will be the key challenge. How do we resolve this inherent ambiguity in the spectra? Well, to make a long story short, we're going to cheat. And the way we cheat is by limiting the number of choices we have. For that, we use a database. So let me explain that to you based on database search algorithms. This is the principle. The principle is rather than allow any possible sequence of amino acids, we're going to allow only those sequences of amino acids which we have found in the known protein sequences for the sample we are studying. So if we're studying a human tissue sample, we're going to look at the database of known human proteins. And of course, we know human genomes and proteomes really well. So we actually have a pretty comprehensive list of human proteins. And so that's about 20 or 1000. And then we can say of all of these proteins, here are all the peptides we expect and here are all the possible masses. So effectively we use these peptide sequences to create theoretical spectra very much like this one, right? Where we put all of the masses for all of the expected Y and B ions that could possibly show up, even though many of these don't show up as I've just shown you, all of these that possibly could show up we put there. Now there's a big distinction here, I put them at different heights. But in fact, we don't know what the height will be of each of these. So what we do, and that's what I drew here, we're going to predict them all at equal probability. We know that it is not correct because not all of these peaks are going to be found. And because a lot of these peaks will actually be at different intensities than this one standard intensity for all of them. But it's good enough. We have a theoretical spectrum. And then what we do is we take our spectrum, our spectra that come from the instrument and each of these we score against all of the possible theoretical spectra. And then we do some kind of mathematics to calculate a score for this match. And I'll show you how that is calculated in a second. And then each of these matches gets a certain score assigned to it, and we tend to take the peptide with the highest score and we say this must be the match. And then with this peptide, we have a peptide, we don't have a protein yet. So then we have to find out which proteins contain this peptide. If we're lucky, there's only one. If we're unlucky, there's multiple. And I show you at the very end what happens when there's multiple. Okay, so this is the overall picture. What we're going to focus on right now is on these scoring functions. So how do you determine which of these spectra is actually the best possible match for this spectrum taking into account that we can have matches that look very good, that look very reasonable but are in fact wrong. And so we have to find out a way of scoring this somewhat reliable. That is now going to be our challenge in these database search engines as pieces of software that search a database for a nice match against the experimental spectrum. And again, we use this database as a means to filter down the number of possibilities from a huge number of highly ambiguous possibilities to hopefully very few that will give us an unambiguous answer. To show you what kind of increase or decrease rather you get from this, the human database, the human proteome contains about one and a half million peptides of these. One and a half million peptides is a whole lot less than all possible amino acid combinations between nine and 20 amino acids which will quickly run into the very many zeros realm. So this is going to be a much, much, much bigger set. So when I say we narrow, we filter down the possibilities we do so in a much, much more impressive way than I can show you with my hands. So we're really going from a ridiculously large number to a very manageable one and a half million. So this is a crucial technology for us. Now three algorithms are going to be discussed, sequest, muscult and extend them because they arose more or less in that sequence and they show you how things have evolved from the earliest, which is sequest to the more recent algorithms. So sequest is the original search engine. It was published in 1994. We won't go through all of these details. It's just a list of some relevant points that you can read up on, but I'll tell you how it scores because that's the key thing that we want to find out. So here's our experimental spectrum and here's our theoretical spectrum. So you know where this comes from, this comes from the database, this comes from the machine. Remember, this looks a bit silly because we have calculated all these masses and we put all the intensities at equal height, whereas here, of course, this looks real. This is variable and the differences in intensity are noticeable and plus there's quite a few noise peaks in here which have nothing to do with the signal, right? So that's something we expect. What we're doing now is we're essentially doing the dot product, right? So we treat this like a vector and we treat this like a vector and we do the dot product. So we take all of the peaks one by one in the spectrum, say the experimental spectrum and we multiply the intensity of that peak with the intensity of the peak here provided that there is a match. So if there is a peak here that matches a peak there we multiply their intensities. If this peak does not find the match then we multiply with zero, okay? So it disappears from the term. So what do you now get? And of course here, if you really think about it the multiplication is quite simple. So a peak here will have intensity one or it will have intensity zero but it will not have another intensity. Of course you could give it another value but it will have the same literal effect. It's essentially a zero binary thing. Peak there, peak not there, okay? So if you think of Y as potentially being one or zero what is actually happening is you're going to say I'm going to add the intensity of this peak to a sum if it matches a predicted peak. And so this becomes the sum of matched intensities. Every peak in here that matches an MOV is that you predicted to be present is going to be added to a giant sum. And so you're going to say for instance, oh, I've matched this peak. So add 90 to my sum and then I matched this peak. So I'm going to add 150 to my sum. So now I'm at 240. Then I matched this really big one. So I add 440. So I was at 240 plus 440 is 680. And then I matched a small peak so add another 10. So I'm at 690. And that's my sum of matched intensities. All these other peaks did not find a match and therefore don't get added, okay? So that's how this works. Now the problem with this is I tell you now that the matching score between this spectrum and this spectrum is 690. Now is that good or bad? We don't know, right? So it's very difficult to assess the quality of that particular match. So in order to assess the quality, what we are going to do is we're going to empirically create some nonsense to get an empirical null distribution to compare our score against. So fundamentally, the idea is give us a score for a random spectrum, a random theoretical spectrum and then do that many times until we get a distribution of scores that we can expect for random spectra and then see statistically speaking whether this thing, sorry, this score coming from this thing is part of that collection or not, okay? So how do you create here your empirical bad spectra? Well, very simply, what you do is you're going to create other spectra, RI, this is R0, and I is the shift by which you move all the peaks left or right. So for instance, R1 moves all the peaks to the right by one dalton. So as you can see, every peak is shifted by one dalton. R2 is going to shift these by two daltons. So every peak shifts by one dalton, every peak shifts by two daltons, and so forth all the way up to 75 daltons. We also do the same, but then shift the peaks to the left. So R minus one shifts every peak minus one dalton and R minus two shifts every peak by two daltons and so forth until minus 75. Now I put these in red because you know every single one of these Rs different from zero are in fact crazy spectra because they have no bearing with a real peptide. It's just a real peptide and you've mangled it. You've started shifting things. And now you can calculate the same score for each of these red spectra. What you get when you do this is something like this where you have a bunch of scores, a histogram. So this is frequency versus the actual score you calculated. And from minus 75 to plus 75 and you can see most of these scores are pretty bad. Some of them are a little bit better than others but most of them are pretty bad. I've on purpose, so on purpose I did not make this a normal distribution because there's no reason why we would assume that this will be a normal distribution. It may be, it may not be. There might be a pattern behind this theoretical spectrum or behind the real spectrum that makes this a non-normal score. So I've deliberately not put a particular distribution on this score because it's also not important. What is actually going to happen is we're going to do something relatively straightforward relatively dumb if you like. We're going to take the average of this set of scores and then we're going to calculate the distance between our score and the average Crap Score, if you like. And that's shown in this formula where you have the score of R zero minus the average of the sum of scores divided by the number of scores which is the average Crap Score. Mind you, you have to exclude R zero from this summation if you want to be mathematically correct because obviously R zero is not supposed to be a Crap Score. We are calculating an average and an average has a high breakdown point in statistics. So you definitely do not want to include R zero because if it's a good score, it's going to skew that average and this whole measurement is going to be skewed by the thing you're trying to validate which is not a good idea. That's why we omit the R zero here and we have only 150 measurements. Now that gives you an idea of how different this is. Mind you, this is not statistics. This is a distance. So it's very simplistic, but keep in mind it was the very first score developed. And so in those days, you would do this for five or 10 spectra that you recorded during the last two or three days. And so the whole statistics thing wasn't yet big enough in this field to be a problem. In fact, people could actually calculate this with calculators or in Excel or rather Lotus one, two, three in those days rather than trying to do it with software. Nevertheless, sequester is written to do exactly this and automated it. Now there's another problem that we haven't discussed and that problem is that you actually have more than one of these theoretical spectra to compare against. So if I can quickly move back, what you will see is of course, we have a whole bunch of possible theoretical spectra. Yeah, we have may have thousands for which the precursor peptide matches the intact mass of this peptide which of course is an important filter here as you can imagine. So what happens then is now we scored it against one theoretical spectrum, but we have to do this for every other theoretical spectrum in our candidate list, which means that we will get a whole bunch of these cross-correlation scores as they are called. So here the blue spectrum, which is the example we just looked at, we had a pretty big difference. So a pretty high X-core score. Now there's another spectrum, this green spectrum, which actually also looks very different from its random distribution, but look, this random distribution is also quite different from that random distribution, but at least the difference here is also noticeable. So now the problem is this is the best in class and this is the runner up. So now the question becomes, is there a sufficient distinction between this separation and this separation? So for instance, if in this class, I want to find the person who is absolutely the best in class in statistical genomics and at the end of the year, three of you have 18 out of 20 on the final exam score, then who is the absolute final best, right? It's going to be very difficult to tell because you guys, the three of you have exactly the same score. So, and again, if somebody has 18 and the other one has 17 and a half and the other one has 17, that's pretty close. So it's very difficult to say the 18 is clearly so much better than the 17 and a half because that could also be luck, right? That could be the way that the question was asked or the sun was shining in your eye in the wrong moment, right? So it's really down to nitty gritty details. And what we try to see here is, is that distinction between top of the class and the runner up? Is that distinction substantial or is it minor? Well, of course, if we're really lucky, this will be a 20 out of 20 and the second person will only have 15 out of 20, then we can reasonably state that this person was probably really the best in class. And so we're going to measure that distance between the best scoring and the runner up theoretical spectrum. We take this cross-correlation score that we calculated here and we're going to subtract the runner up cross-correlation score and then normalize it by dividing by the correlation score which gives us kind of a percentual difference between these two. And we call that the delta CN, okay? That's the difference between the top scoring one and the runner up. And that will give us an idea on how clearly this is a better match than this. If you remember, I talked about ambiguity being the problem. If you don't remember, at least in this video, you can rewind to where I was talking about that. What we are essentially assessing is how ambiguous is this match? Is it highly ambiguous? Then this score will be low. Is there is no ambiguity and it's clearly better for this one than this will be high. So this gives us essentially a discriminating score for the quality of that identification being the correct one. So that's actually pretty clever. The downside is you now have two scores to contend with. You have a cross-correlation score and you have a delta CN. And that will create a bit of a problem as we will see soon. But first, let's have a look at what happens when you assess the score for different charge states of spectra. Because something funny happens. This is the cross-correlation score. In gray, you will find the scores for really random spectra that were simulated in this case. And in white, you will find the cross-correlation scores for correct spectra or things that we assume at least are correct spectra. And what you can see is that there is a certain average score and you can see a standard deviation on that because we've done this many times. And so we can calculate the distribution. And you can see that the cross-correlation score for one plus spectra tends to be quite high, around two. And that for crazy spectra, random spectra, is pretty low, it's one. So that's good. The score seems to differentiate quite nicely. And you could even say if I want to separate these distributions and let's do this really stupidly visually because it doesn't matter whether we do this correctly or not, it's the concept that has to be correct. If I put a line here, I could exclude most of the bad stuff up to say one standard deviation. And I would include these as being pretty purely good results. And that would be around about one and a half give or take. So I can differentiate between these scores and there's a clear separation between them which is how we can differentiate. Now what happens if you go and look at two plus spectra, you will see that actually both of these scores rise. Okay, so the 1.9 here becomes a 3.5 which is not exactly double, but our 0.9 here becomes 2.1. So this score goes up faster than that score. So this line versus that line. And that's an annoying thing because it means that now the random spectra somehow are catching up to the real spectra in terms of score. More disquietingly, the standard deviations start to overlap. So that means that really these distributions have a higher degree of overlap than these distributions which were more clearly separated. It gets worse when we reach three plus, there's practically no increase in score for the correct hits, but the bad hits continue to increase. Mind you, it's also leveling off, but it's not leveling off as fast. Why is that? Why is this happening? It's happening because the number of theoretical peaks changes and it's a multiple testing problem in a way. So here, if you have 10 amino acids, that means you have 10 B ions and 10 Y ions, theoretical ions. If that is confusing, rewind to where I'm explaining the B and the Y ions and you will see, you look at the example with my name, that's seven letters, so you're seven B ions, seven Y ions. So here in 10 amino acid peptides, you will have 10 Y ions, 10 B ions. And so you have a certain probability of matching a random peak in the spectrum based on the fact that you have 20 peaks in your theoretical spectrum. If you want to, it's like going to the lottery and buying 20 tickets. So you have a certain probability of winning the lottery with these 20 tickets you have bought. That probability will not be very high, so the score for random hits is very low. That's good. Tends to be better, much better for the real spectrum because of course there you actually expect some of these things to match really nicely with the theoretical spectrum of the peptide that actually originated the experimental spectrum. Now, if you go for two plus, the problem is if your precursor has two charges and you break it into a B and a Y ion, very often the Y ion will have one charge and the B ion will have the other charge. So now you have a singly charged B ion and a singly charged Y ion, but you can also have two charges on the Y ion or two charges on the B ion, in which case you will have a two plus B ion and a two plus Y ion, okay? Well, rather or because you would never see the same complementary answers, only two charges to go around, right? But that means that in our theoretical spectrum, we now have 10 B ions singly charged, 10 Y ions singly charged, but now we also have to add 10 B ions that are doubly charged and 10 Y ions that are doubly charged. And this creates 40 lottery tickets. And now it gets a lot easier to win the lottery by random chance. Why is this curve not going up in a two fold fashion because it too has more lottery tickets because the real spectrum tends not to behave in the way that everything duplicates. A real lottery tends to have a certain Y ion will show up as one plus, but never as two plus. And another Y ion will show up as two plus, but never as one plus. So despite the fact that you have more peaks in the theoretical spectrum, you're not necessarily going to match more real peaks because they simply are not present in the spectrum. So you're raising the noise level, but you're not raising the signal level. And that creates a signal to noise problem and the random hits exploit that. So it's a multiple testing problem. Your probability for randomly matching a peak in a spectrum go up. Now, as you can see for three plus, this gets even worse because now we have 60 peaks. All the three plus peaks for B ions and Y ions are added now as well. And we definitely don't see a very big increase, which means that practically none of these additional peaks actually lead to additional signal recovery. But we also see a leveling off a little bit in the random. Why is that leveling off in the random? Because it's M over Z the axis. So a peak, which was one plus, if you make it two plus, it's going to shift to half that M over Z because instead of divide by one, we now divide by two. If you divide by three, it will shift even further to the left. So this was the original divided by two divided by three. So now the problem is that we have this divided by three version here. And for all of the peaks, all of those three pluses will all be pushed towards the zero point. They will all become very light. So it gets very dense in that area. And because all your peaks are grouped together, they don't behave like a real lottery anymore. So there is a bias and that bias means that they cannot match that many different peaks anymore. In fact, some of them will start matching to the same peaks because we're also compressing them into a smaller area where they will all be more similar in a way. So fundamentally, even the random guys now start having trouble recovering more signal from the spectrum because they can't use the whole spectrum. They can also only use this small bit. Nevertheless, they are more adept at this because they are random than the real peaks because the real peaks don't find the matches because they don't exist. So this is a really big problem and it has everything to do with counting the number of lottery tickets you buy for the matching. That is important because we'll come back to that in the next engine, which tries to solve precisely this problem. There's another problem which I discussed, which is the fact that we have two distinct types of scores. We have the delta CN and we have the cross correlation score. So which one do you use? Or if you use both, how do you combine them? And then you create, then you get this problem. So here you can see a bunch of curves. Each of these corresponds to a cross correlation cutoff. So let's say you take a cutoff at 1.5, then you follow this curve and what you see on the y-axis is how many false positives you will get for a certain cross cutoff on the cross correlation given the filter you put in for a delta CN. So for instance, if I have cross correlation of 1.5, I take a delta CN of 0.06 cutoffs that I add up here on that curve. I go to the left and I get about 1.25% false positives, okay? So let's do our thresholding and let's assume, let's agree rather, that we want to do the thresholding at 1% false positive rate. So we want to cut it off here. Now I'll start and I say, yeah, I allow cross correlation to be rather lenient. I put the cutoff pretty low and I put the cutoff at 1.5. So anything with a cross correlation of 1.5 or higher, I will accept as good. However, I want 1% false positives maximum. So let's look up what kind of delta CN I need. And in order to get at 1.0, I need to put my delta CN all the way down to 0.08 which is pretty stringent. So I need a very good separation between the best score and the runner up because I'm not very strict on this best score, okay? But I am more or less sure that in this data set at least I will get 1% false positives maximum, okay? Now you guys, you say, no, no, this is a really bad idea. You want to be stringent from the start, right? And you say, we want the cross correlation score over 2.1 which is much more stringent because you need a much better separation between random crap and the spectrum in order to reach that threshold. So you're very strict. Now let's see where you need to put your delta CN to get 1% FDR. So here we have the 2.1 cross correlation cut offline and you want to be at 1% which is right here already which is at 0.02, okay? Delta CN and what you can see is you can be much more lenient on the delta CN because you've been much more strict on the cross correlation and you too will obtain 1% false positives. So both of us now have completely different threshold scores and we both obtain 1% bad hits in our data set, maximum. The key question however is will you have the same data set that I have? And the answer unfortunately is no, you will not because if I have a peptide in my list which has a cross correlation of 1.8 it will pass my first filter. So it's in, but then I need to filter the delta CN but let's say the delta CN is 0.1 which is very good. So it passes this filter. So I include it in my data set. It's a good hit. However, you will say, oh, a cross correlation of 0.8, 1.8, I throw it out immediately. So it will not show up in your data set. Conversely, if we have a peptide in our sets and the cross correlation score for that peptide is 2.5 which is pretty good. I will accept it because obviously it's above 1.5 but then the delta CN for that spectrum is 0.06 and I will say, no, I will reject it because I will need 0.08. You, however, will say 2.5 is above 2.1. So it's fine for me and 0.06 is well above 0.02 so I will include it so it will be in your set but not in mine. So as you can see very quickly we will have the same results from Sequest but because we use different filters both of which give 1% false positives however both of which will get different lists of final peptides identified. So in terms of reproducibility this is a very serious problem. There's too many degrees of freedom here, right? So too many things where you can make a choice and where the choice influences the final results and none of these choices sounds really very objective. So that is a fundamental problem and what you can see and this is no coincidence is that these papers start coming out around 2002. Why is it that people started worrying about the scoring function of Sequest around about that time? Well, it was because mass spectrometers had gotten a whole lot faster and because mass spectrometers had gotten so fast the problem becomes not scoring five or six spectra now we're scoring 500 or 600 spectra and that creates a lot of uncertainty because now we can make not just a few mistakes which we can manually interpret and correct we're making so many mistakes potentially that we cannot manually correct these anymore. So we are getting worried that our algorithms may actually not be powerful enough anymore, okay? And so at around about the same time the whole world started looking at that. Now, funnily enough, some people in the industry figured this out well before that time and these two people are David Creasy and John Cottrell and they built a search engine called Mascot and Mascot does away with this intensity based scoring that Sequest did. Remember that the fundamental score is the correlation score which is the sum of matched intensities? Well, these guys said no, no, that's not your problem. Your problem is how many lottery tickets did you buy? How many peaks in your spectrum? And how probable is it that given this number of lottery tickets, you will happen to win the lottery because that will give you a chance, a probabilistic framework to calculate a chance that your match is a random match, okay? This is not built on intensity. This is built on how many peaks do you match given that you have this many peaks in the theoretical spectrum, okay? And so this thing actually allows you to calculate an a priori threshold based on statistical means. Now that core algorithm because this is a business, a company that made this, they never published the algorithm. However, another person, he actually built a search engine that mimics Mascot so perfectly that the R squared correlation between the scores is 0.99, okay? Which is probably a rounding error because it's essentially the same thing because it's actually not that difficult to reverse engineer the scoring system for that. And that search engine is called Andromeda and it's part of the MaxQuant package and it was developed by Jurgen Korps. And so, funnily enough, Andromeda was published. So if we look at the scoring function of Andromeda, we actually know what the scoring function for Mascot was. And so this is that scoring function. So the score is minus 10 times the log 10 of this probability, yeah? And you can see it's a probability because you see the binomial here. And so you know it's a binomial probability where P is the probability of finding a single matched peak by chance, okay? And J here is an iteration over the number of theoretical peaks, the number of lottery tickets we buy, and K is the number of matched peaks within a given fragment tolerance. So you can see that this whole probability is predicated on how many peaks do we have in the theoretical spectrum? How many of these do we match and is this kind of randomly expected or not given, of course, this probability of randomly matching a peak? Now I won't go into the specifics if you really want to know a former PhD student of mine. Shuli Yilmaz wrote a really nice review on this where she picks this apart and there's implementations and open source of the scoring algorithm that you can look at which is really interesting and really nice. But what I want to focus on here is that the scoring function has now become truly statistical. We have gone away from calculating a distance and then thinking is that good or bad in one way or the other. We have now said, no, no. Because of this, we have a theoretical function which predicts the probability for random matching without us having to calculate the score. In fact, the score is derived from the probability distribution and it automatically allows us to translate this into something that allows us statistical control over the match. This is a major, major step forward. So it's based on peak counting instead of intensity sums. So it's a very different thing. One thing says matched intensities. The other thing says, no, no. Number of peaks, okay? Very different. Now, if you think that the number of matched peaks on this side, which is mascot is like gin and if you think that the sum of matched intensities is like tonic, you know what will happen next, right? When you see tonic and you see gin, you're going to merge them together and make gin and tonic. Or if you're like me and you don't drink alcohol, you can think of chocolate and milk, okay? You want to make chocolate milk. And that's exactly what happened with Exxpandum. So Exxpandum was actually originally also a commercial engine, but they got kind of caught up in the dot com bubble that burst and they lost their investors. So as a result, the company unfortunately went bankrupt and then they decided to do the right thing and make their algorithms available for free as open source, which was extremely nice of them. And so it was published in 2003. So a few years after the dot com bubble burst and you can see here how the scoring function works. And you will see it's gin and tonic. This is the gin, the tonic bit, the sequence bit. This is the mascot bit. What is this? This is the intensity in the experimental spectrum times a bifunctional or bimodal function P, which is either going to be zero or one. This P is going to be zero if the experimental peak does not match a theoretical peak. It's going to be one if it does match a theoretical peak. If that sounds familiar, that's exactly what the score in sequence is built on. This is just some of the matched intensities. So some of matched intensity. Here we see a factorial, which automatically makes you think of probabilities. And in fact, this is what is happening. This is a shorthand probabilistic model for how many b ions that we match and how many y ions that we match. So we take into account the number of matched ions with a slight statistic source on it. And we take into account here the matched intensities. So really the gin, tonic or chocolate milk version of the previous two logical consequence. The problem, however, is that while mascot with this type of model, a rather more sophisticated model in their case could predict a theoretical function and an a priori threshold, we remember that sequest because they use this sum of matched intensities, they needed to construct an empirical null. So as soon as you add something in like this, you need an empirical null because you cannot calculate a theoretical model for this result. And this is again, what Xtandom has to do, but it's a lot smarter about the way it builds the empirical null. So in sequence, what we had to do was we had to look at 51 different calculations for every theoretical spectrum. The R zero and then 75 shifted ones and 75 shifted the other way. So that's a 151 calculations per random spectrum. But if you per theoretical spectrum, apologies. However, if you think about it, how many peptides could match to the spectrum in real life? If you think about it, there's probably either none because the spectrum comes from something that is not in the database, which is always a possibility, or it could be one, but there cannot be two spectra that give rise to this, two peptides, sorry, that give rise to the spectrum. So if there's only one, that means that all the other theoretical matches are already models for random matching. We've got them for free. So what you do then is you plot for a given theoretical spectrum, you calculate, sorry, for a given experimental spectrum, that's important, that's a given experimental spectrum, you calculate all the scores for all the theoretical spectra and then you make a histogram. So number versus score. And you see what happens. There's very few theoretical spectra that have a really bad score. Most of them has a mediocre bad score. Some of them have a reasonably high bad score. Some of them have a pretty good score and there's this one guy who has an amazing score. Okay, so in sequence what you would do is you would say, okay, let's take the difference between this guy and the average here and then we see whether that is a big number or not. But that's really wasteful because you take all of these numbers and you reduce them to the average, which is a really bad idea because you have a beautiful distribution here. And so you can do proper statistics on this on your empirical null because keep in mind there's either one or zero real hits. So this for sure is an empirical null. You just know, okay? Only this point is questionable in a way. Now, how do you know what kind of model you fit to this? Well, this thing is called a hyperscore because the way it's engineered should make it look like a hypergeometric distribution. And in a hypergeometric, if we log scale this whole thing, this descending part, we should get more or less a straight line, which we can then regress through. So regression fit allows us to calculate the hypergeometric model that we want to apply. And so this is actually kind of a log odds transformation that you do here. And you really want to find things that have a log odds of zero or below, okay? And you in fact, want to go a little bit lower than zero because of course you don't want every log odd that makes it look a little bit out of place. You want something that makes it substantially look out of place. We'll see that in a second. So then what you do is once you've fitted your regression curve, you're going to regress and you're going to find out where does your curve hit the hyperscore that you have observed for these nice hits. And so what you can see is the score of 50 here. Where does this regression line hit 50? It hits it at e to the minus second. Now, e to the minus second, that's one divided by 2.7 to the second. So that's about one in five, that's about 0.2, right? 0.2 does not sound like a good alpha to place a cutoff. So there's a 20% probability that this thing belongs to that distribution, which is too bad because for us, it's not special enough. So this is probably not a good hit. This hit, however, hits the curve. So it has a score of about 80 it hits the regression line at about e to the minus eight. Now that is a much smaller number, okay? That's one divided by, I don't know by heart what that would be but it's a very, very small number. It's something with at least three zeros. And so as a result, this thing is definitely well past say a 5% tolerance. So this thing is quite clearly statistically distinct from this distribution whereas this thing is kind of distinct but definitely not distinct enough. So this thing can now be fitted directly with a kind of E value. What is the expectancy value for this spectral match to belong to the random distribution? And it's very, very small. So now we know how reliable this hit is. So it's a lot faster than sequence because sequence calculated 50, I'm sorry, 151 calculations for each spectrum whereas here we don't, we do one per theoretical spectrum and second, we actually utilize all these values to give us a proper statistical estimate of how special the score is. And that is how most search engines work today, okay? They either work like mascot which is only gin or only chocolate or they work like extend them, which is half half. So they're gin and tonic or chocolate milk so they mix the two systems. Whenever you use a mix, you have to use this empirical distribution and empirical null. If you use only the chocolate or the gin then you can do it with a theoretical model and you don't necessarily need to do empirical models, okay? Right. Now, one thing that influences this probabilistic model is how many possible matching spectra you have because the more possible matching spectra you have, this curve will start looking different. If you have more spectra what will effectively happen is not only will all these numbers go up also you will find more by random chance because you have more variation in the database some will get higher scores and some will get lower scores. So this curve that is now looking like this it will broaden, okay? So we would get a broader curve. So this will move out. As these things move out like a tsunami they will start flooding this point. So this point which will not change because this is one theoretical spectrum one empirical spectrum that's a deterministic value that changes that does not change its place. This curve will creep closer and we'll have a larger variance which will mean this line will start to look like this. And then a score of 80 suddenly doesn't look that special anymore because by random chance because we have so many possibilities some of them will start looking like this and they will start eating up this point as a significant point and it becomes an insignificant point. That's what happens when your database grows too big. Oh, mind you when your database grows too small this curve will collapse and you will have a point here and a point here and a point here and a point here and your regression line will essentially go flat and it will just say you don't have statistics anymore. Fundamentally your N becomes so small your number of samples become so small that you cannot make a significant call and your algorithm will throw up its hands and say, there's not enough data. I cannot work with this. So you want your database and your number of potential matches to be reasonably large but if they become too large this line will go flat and it will become very difficult to identify anything. Okay, now the statistics will hold because the statistics will tell you this point is no longer special the problem is just your sensitivity drops. So the specificity is guaranteed the sensitivity will suffer. So the algorithm errors on the side of caution in that case, okay? So now I have a look here in this paper in mass spec reviews by one of my PhD students if you want to know what the effect on the size of the database is. And again, the bigger the database the higher these violin plots go the worse this line will start looking like a flat line. Okay. And as you can see adding something like phosphorylation actually raises the content of the database quite a lot compared to a normal database. So changing the search settings influences your capability of sensitively identifying stuff. Remember the specificity is guaranteed if you don't mess with the score but the sensitivity drops so you get fewer identifications. But if you want to know the details I won't go into too much detail here if you want to know the detail just have a look at this mass spec reviews paper where it's all explained in great detail. This is the output that you get so the identification rate. And as you can see as you go to say phosphorylation your identification rate is actually lower than what you would get by random chance. Sorry, right random chance than what you would get by original baseline analysis. And so here you can see the effect of all these different parameters. And again, this is explained in a lot of detail in this week. How popular are the different search engines that's also discussed in the review? You can see that the most popular ones are mascot and then Andromeda. And as mascot goes down, Andromeda comes up. Why is that? Because Andromeda is a clone of mascot but this is paid for and this is free. So you can see what happens when a free competitor comes up it starts eating away at the commercial one. This is Sequest which has only recently started to decline. Sequest is really showing its age. Interestingly, the person who made Sequest made a new algorithm called Comet and we will see that in the other category here, other. So this one, you will see that Comet is one of those. You see here Comet in 2014 this is where Sequest starts to dive. So Sequest, which is also commercial has been replaced with the free and open source Comet which is actually faster and better and looks a lot more like Extendum than it looks like Sequest. And then MSGF plus is interesting because that's a search engine that's completely different from all the other ones. So I won't have time to go into the details but if you really care about these search algorithms have a look at the MSGF plus paper because that's a really original take on the search engine question. The only downside is it's very slow but it's a very good search engine. Now in the practical you will be using these search engines and we will use our software called search query for that and you can see that most of these most popular freely available ones are implemented in search query but that is for the practical and when you go through the tutorials you will play with these engines but so it's very easy to use them because we built nice software around them. And to look at the results we have a tool called peptide shaker. Again, we will use that extensively in the practicals. Just wanted to preview this for you where you can see the match between the theoretical and the real spectrum. You can see the B and Y ions plotted on the sequence and you can see them annotated on the spectrum and so there's lots of information here to play with including multiple possible matches to the same spectrum and how they differ comparisons between different search engines if we run multiple you can map stuff onto 3D structures and you can see the scoring engine and the false discovery rate stuff close up in detail but that is something we will talk about next where this is calculated but all of this will be visualized and played with you in the spectrum and then in the practical, sorry and then you will be able to find out about all of these things. Okay, there's other stuff in there as well we won't have time to go through all the details. Now there's special kind of search engine called a sequential search algorithm and these guys actually try to do what I started to do very at the very beginning of this whole identification thing and we're going to try and jump peaks like some kind of crazy Super Mario and we're going to try and find distances between peaks that correspond with amino acids but we're not going to do it throughout the whole spectrum we're just going to do it for a bunch of peaks and we're going to get a few possible amino acids and then we're going to append that with the mass that is not explained here which is in this case 1,080 Dalton's almost and then the mass that remains here which is this peak subtracted from the precursor mass so this is what is left here and then we're going to say we call this a sequence tag which is there's 1,080 Dalton's of something we don't know what that something is but it weighs this much then there is SD IRL remember IRL we can't tell apart and then there's 303 Dalton's left and then we're at the end and then we're going to ask the database have you seen any peptide with the total mass of this peptide which is a good filter and then we're going to say it has to contain SDI or it has to contain SDI and that peptide if you find one or more of these whatever comes before SDL has to have this mass and whatever comes behind that in the peptide has to have this mass and if there is a match we say that's the one we've identified and this is a search engine that's built on a completely different principle it starts from the spectrum extracts the sequence a priori so without looking at the database and then matches this to the database and this approach was originated by Matthias Mann who is a very famous scientist in the field still in 1994 exactly the same year as Sequest now interestingly enough this approach very quickly started losing interest because the database has got too big and you've got too many ambiguous results because you don't use the full power of the spectrum you're only using a few peaks which is actually not the best way about this however in Sequest you can actually use all of the peaks in theory because they could be matched to a theoretical peak so obviously it's better at resolving the ambiguity than this particular approach so this approach was kind of lost if you like and the Sequest approach won out that's how everybody built their algorithms afterwards nevertheless there's something really fun you can do with this Dave Tapp in 2003 and then we find it in 2008 made a tool called Guten tag and direct tag which are actually so this is the 2003 one this is the 2008 one which actually builds on the Mann approach and does something special I'll show you what they do so they generate these short bits of sequence the sequence tags from four peaks and then with this list of sequence tags they're going to look up in the database and notice that you can have multiple sequence tags from a single spectrum and they may or may not overlap you look up all the possible peptides in the database that could match any of the tags which is what you see here and this is where the original algorithm would stop but then Dave actually had the clever idea to put a second stage sequence scoring in so he's now going to score each of these going to create a theoretical spectrum of course and then score that theoretical spectrum against the real one so he's using this as a filter then making spectra and then scoring the spectra with a sequence like scoring and then get a result the cool thing is that because you use an additional filter you can now also allow say an amino acid mutation to take place and see how that scores against the spectrum and then potentially get an identification of a mutated sequence so this is where they kind of position this in direct tag and he said we can allow strange modifications strange mutations which would otherwise be very difficult to find so it's a very interesting take on this but unfortunately direct tag and good and tag didn't really take off very much so they're still not very popular but they're built into search GUI so if you ever want to play with them in the practical you will learn how to do that finally you can do the novel sequencing and this is what I really tried to do at the very beginning which is by jumping from peak to peak from start to finish try to read the whole spectrum of the peptide but that is very very difficult because as I mentioned for that to work you need to have very many preferably all of the peaks in the spectrum and that unfortunately doesn't happen very often so you need really beautiful spectra nevertheless there are many algorithms available to do this probably the best one out there right now was published in 2015 is called Rapid Novore that is also implemented in search GUI but it's only free for academics it is not free for commercial use if you want to play with that it's also implemented in search GUI I'm not going to spend too much more time on de novo because the only time you really want to use it is when you deal with peptides for which you don't have a database and most people try to use proteomics only on samples for organisms where they have a good database okay so what we have done is we've looked at amino acids, peptides and proteins we've looked at mass spectrometers how these mass spectra are made we talked a little bit about issues with quantification now we're going, then we talked about identifying spectra so how do we figure out what we saw in the mass spec now we're going to see in the whole list of identifications that come from an experiment there are going to be some mistakes the question is how many can we control them and can we even pinpoint roughly where these mistakes might be which is essentially the false discovery rate calculation if you take all of the experimental spectra in a set and this can easily be hundreds of thousands to millions of spectra each of these will be assigned with the best possible hit this is the top hit that we extracted say from Xtundam remember the hit with AT as a score that was very special so for that particular experimental spectrum that peptide would be the hit but of course we have many more of these experimental spectra so we have many more peptides each spectrum has one peptide or zero peptides but then these spectra are not counted so here you see the scores of all those top identified hits and you can see that some have a really poor score but not that many quite a few have this mediocre score and some have really good scores now we know that some of these are going to be wrong in fact here we just get rid of all the filters that we did previously even in Xtundam all the E values are gone all the E value filtering sorry is gone all the E values are represented so we start here with very low E values if you like and low E values means high reliability and here are the high E values with the low reliability the question is we have to somehow put an E value cutoff to make sure that we have all the good stuff to the right of the cutoff and all the bad stuff to the left but how do we do that because we don't know where good or bad is now if you've learned anything from my explanations of proteomics and how we do identification by now it is that we when we need to find out where is the good stuff and where is the bad stuff that we need to take somehow an empirical null distribution so how do we create an empirical null distribution here well ideally and this is theoretically we would have perfect knowledge of where all the bad guys are and all of their scores which are mostly here and then very few high scoring ones and then we know where all the good guys are some of which look like bad hits but most of which are special and don't look like bad hits mind you this is what such a curve will look like if you have a good scoring function we have decent scoring function so we would expect this to look like that and here we could set a threshold where the number of bad hits is small compared to the number of good hits and how small well we can determine that value we could say the ratio of bad versus good has to be less than five percent for instance okay or less than one percent and then we can choose where to put that line so that we achieve that desired condition that's the theory okay now in in practice however you may want to have a little bit more fine-grained detail so as a morbid example let's say in the whole class of statistical genomics we say there is a psychopath okay one psychopath so that gives us a percentage number of psychopaths in the class so if you are 35 students and there's one psychopath there the ratio is one on 35 but that doesn't help you figure out who to avoid in the group right so if you want to know who to avoid in the group and I would say well one thing the psychopath is not online the psychopath is participating in class today right which is a bit silly because we're doing this all online but you get to drift right now if I say that it automatically makes it easier for you to limit the scope of that or the presence of that particular person and so that's what we're aiming at here we're aiming at trying to limit the scope of where the bad guys could be the way we're going to do that is we're going to have we're going to calculate a local false discovery rate and the way we can do that is by saying what is the ratio of good versus bad at every given step in this histogram for instance here we have practically no good hits and practically no bad hits but in fact there are zero good hits and say five or six bad hits so the ratio of bad versus good is 100% so everything is bad and that continues for a while up until here where we start to have maybe one or two good hits and the rest bad hits so here that ratio is 99.9% then that ratio continues to fall as more and more good stuff starts to creep in until we hit this point here which is half half and if we go down here half of these hits are good and half of these hits are bad so at this score point we know that 50% of the data is reliable and 50% is not that's very accurate information right that's something you can work with and then as you continue you get fewer and fewer bad hits until all your hits are good and so you get 0% bad hits and everything is good okay so if we have something like this we call that a local false discovery rate that's because it's empirically estimated if you had models for this you could call this a posterior error probability so this is just as a a means to remind you that this is kind of the same thing but this local false discovery rate which for every score gives us the chance of finding a false positive a true positive hit sorry a false positive hit in this case so a local false discovery rate at each core gives us the probability of finding a false positive and so here for instance is 50-50 but here it's 0% and here it's 100% because everything is bad now we know that this is super reliable this is very very reliable this is kind of reliable this is 50-50 which is actually quite bad already and this is 100% bad now we can again put our threshold but we can do it based on the some of the local false discovery rate so in fact we can do an integration exercise so we make it calculus okay because we fitted a model to this now this is all in theory it's all nice in theory how do we do it in practice well in practice what we need to do just like we always do is we have to make a lot of crap and if we make a lot of crap we can see how crap behaves and then we extrapolate that the made-up crap is behaving like the real crap and therefore if we look at what the made-up crap does we kind of know what the real crap is doing how do you make up crap well you make sequences like my name is a sequence you make sequences that don't make sense for instance the easiest way that everybody uses nowadays is you take the sequence of the protein and you flip it around and now this flipped around sequence is being put in as a fake protein and it's labeled as such and we call these fake proteins decoys and of course we make one of these proteins one of these fake proteins for every real protein a cool effect of doing this is that every amino acid prevalence remains the same so there's three ends in this protein and there's three ends in this protein there's two a's in this protein and there's two a's in this protein and ends tend to co-occur next to other ends in many cases like these two and that's the same here and for instance my a's tend to be close to a's and that's the case here okay so we maintain a lot of these features of the sequence here so all the patterns remain this is very important because you don't want to mess with the patterns because you might have consequences to messing with these patterns you can also shuffle the letters so think of these as scrabble blocks and then you put them in a shaker and then you roll them out in a new order but then you actually destroy a lot of these patterns for instance the a-r-t-a-r-t pattern is gone here yeah so this is not often used also this is not reproducible if i shuffle and then you shuffle we will get different patterns and this is actually computationally much easier so most people use this you can do other stuff but let's forget about that it's not worth it people have tried this is good enough it's as good as all the other stuff if not better so reversing is what everybody does now with these reversed search spaces we can plug them in together with the real search space with the actual database as a concatenated hit is a concatenated database and then we can see how often does a spectrum match a decoy hit and at what score do they match so now we've got all the target hits and we know that these target hits in the real database contain crap we just don't know where it is and we've got hits against the decoy data set and the decoy data set every hit is crap that's the assumption because we made the decoys as crap so let's look at where the decoys are oh look at this the decoys start going down dramatically here whereas the real data actually has a tail here so what can you do well you divide the estimated number of crap hits which are the decoy hits that's our estimate our empirical know divided by the target hits and that gives you your false discovery rate and as you can see that local false discovery rate starts falling down here where you have 50% of your hits are expected to be random hits because the decoys are 50% of the total and then here you can see that they go to zero whereas here it's 100% because you have as many decoy hits as you have target hits okay so this is what we're going to do we're going to append a decoy database to our searches and as a result we can calculate the red curve and we can calculate at each point of this curve with the blue curve we can calculate what the local FDR is and we can then set the cutoff so that the sum of local FDRs above that cutoff is no larger than our particular threshold for instance one percent or five percent or two and a half percent or whatever you choose okay so that's the way we do that in proteomics we'll also look at that in the practical and of course once you set a threshold you guys know that there's true positives and true negatives and you want to maximize these numbers that there's going to be false negatives which we really hate because that's good data that we're missing and we have false positive which we hate even more because that's bad stuff that we think is correct and therefore is going to mess up our downstream analysis so usually what people do is they try to limit this because they're really annoying downstream mind you if downstream you can do a very fast cheap and easy essay to filter out between these two then maybe that black line can shift all the way here because you don't care that you have to still filter stuff out if you can do it cheaply quickly and reliably however if downstream you're going to have to invest an enormous amount of time and effort to validate each of these hits maybe you want to move that black line over here because you will not have time to do more than this anyway and that means that every ounce of effort you invest is likely to be good effort so it really depends on what you want to do with the data where you put the threshold okay right and now the very last bit because this has gone long this has gone on for long enough I think you will agree with me protein inference so I'll very briefly explain to you what is bad, ugly and not so good about this so fundamentally we have four peptides identified in my hypothetical example and we have three proteins in our database so all nice and simple real life is much more complicated peptide A can occur in protein X but we also find it in protein Y so this is a peptide that matches to more than one protein B is only found in protein Z C is found in X and Z and D is only found in Z okay so again very simple very simplistic example the question now becomes given the identification of A, B, C and D which proteins were in the sample so there's many ways to go about this and fundamentally you have to make a choice based on philosophy so as you may know there's something called Occam's razor which is usually quoted as the simplest solution is the best that's something we use a lot in science where we say we take a simple system and we assume no crazy stuff happens and in this particular case Occam's razor would say we need to take only those proteins that will allow us to explain all the peptides and so what it turns out that we can explain all these peptides by having protein X and protein Z in the sample because protein X explains peptide A, protein Z explains peptide B and then protein Z or protein X can explain peptide C and protein Z can explain peptide D so we're done we only need these two proteins and we can say this is the minimal set of proteins that explain our peptide observations and we call that Occam's razor or the Occam set however there's a problem with this because those of you who are perceptive and still awake have noticed that you could also take protein Y and protein Z and you could explain A, B, C and D so why do we not take Y and Z for our identification? Well as some of you will then figure out this guy has two peptides and this guy has only one so it's capitalism this protein has more peptides and therefore must be better right? and that's how most algorithms will do this they will say yes but this guy is richer in peptides and therefore we like him better that's a bit sad because you know this peptide which is something that protein X has that Y doesn't have is actually a shared peptide so is it really from protein X? we don't even know okay but nevertheless that's how these algorithms work and you can see that there's sometimes equivalent sets and it's not always that clear cut it's also algorithmically quite complex to calculate this now Ockham is a particular type of philosophy it's a philosophical point of view and for every philosophical point of view you have the exact opposite and the exact opposite of this is called anti-Ockham and that states hey wait a minute why are you throwing protein Y under the bus here? you are now declaring a protein absent from a sample even though you have some evidence that it may be present so you are not God you're not supposed to do this kind of stuff you have to include every protein for which you have tentative evidence and that's not an unreasonable claim to make and so that we call the maximal explanatory set it says you must include every protein which has at least one matching peptide be that a shared peptide or not and that is called anti-Ockham also notice that algorithmically this is ridiculously easy because what do you do? you just go protein X does it have a peptide in the identified? yep in the list protein Y has a peptide you're in the list so this is extremely easy now there's another thing that you can do and you can say yes but we don't have enough information in fact protein X turns out to be purely hypothetical and nobody's ever seen it it's a pure prediction from the genome we don't even have an mRNA never seen one protein Y on the other hand is beta-actin which is literally everywhere so now what we're going to do is we're going to say you know what? we don't believe in hypothetical proteins when we have an explanation that makes a whole lot more sense beta-actin is everywhere everybody finds it all the time so we're going to claim that protein X is absent whereas protein Y is present because we use additional evidence that we take from knowledge that we've acquired earlier and now we call this our set and we call it the minimal set because we exclude protein so we make it a minimal set which retains maximal annotation and in a way you could call this the true Occam's razor and why do I call this true Occam's razor? well because what you've effectively done is you've eliminated hypothetical entities in favor of real known solid entities and this is actually when you look up Occam's razor on Wikipedia it does not say that the original statement is the simplest solution is the best it specifically states the solution with the fewest hypothetical entities and in a way that makes sense here so this is something that I personally think is a much better way of doing this unfortunately I have no real evidence to back that up I have some but I don't have time to go into that it's also not that important what you need to learn is that the list of proteins that you end up with is different in every single scenario you see they're all different and that difference does not come from the data that difference comes from your philosophical point of view and that's a problem because it means that you cannot reproduce this unless you know what philosophy is behind this and so protein inference therefore is a really big problem so why do we make such a big fuss about that? well if you want to quantify a protein well first you need to know is the protein really there so in this case you will not quantify protein y even though it may very well be there in this case you will quantify all proteins but maybe protein x is not even there and then the other thing is peptide a are you going to let it contribute to protein x are you going to let it contribute to protein y or to both right so it becomes very complicated when you go into quantification and as a result some quantification algorithms will only take into account peptides which are uniquely matching to a protein like B and D because they say these guys we don't know what to do with them they confuse us and we will leave them alone now how big is that problem of finding multiple proteins per peptide it depends on whether you care about isoforms all of these databases include isoforms all of these databases do not include isoforms so this is all explained in a review in 2013 but to make a long story short in this particular database which is SwissProt which has one gene, one protein so no isoforms for human data 92% of all the peptides in that database match to only one protein so if you pick a random peptide you've got a 92% probability that it does not match to more than one protein so in fact situation B and D is 97% of that database sorry 92% of that database so 92% of peptides are like this and then 8% of the peptides are like that so the vast majority of peptides do not give you any problem however if you care about isoforms then now you can see that we have only 36% down to 26 so on average let's say 32% 33 if you want to make it easy so one in three peptides matches only one protein the rest will match multiple proteins but they will most likely be isoforms so proteins that are spliced differently from the same gene so if you care about the difference between different splice isoforms life could get very difficult in quantification if you do not care about that life is going to be reasonably good okay so it really depends so peptide shaker will show you this and a blue circle is a peptide and a red circle is a protein this is a typical swiss plot situation all of these peptides beautifully mapped to only that one protein this is beautiful perfect and simple this is okay because most of the peptides match to only one protein and some of the peptides could also match to other proteins but because the evidence here is so overwhelming we tend to think this guy is here and we tend to not think these guys are okay this is really nasty because this one peptide could match to all of these proteins and it really doesn't help us anyway okay this is also nasty because these peptides are each of them matching to very many proteins so which protein have you found it's practically impossible to tell so this is really bad also this looks like a bunch of balloons which is nice and this one looks simple but it's nearly a fully connected graph which means that again these three peptides will not allow us to differentiate between these proteins so if you come across something like this quantification is going to be extremely extremely difficult of course the same thing goes here the same thing goes here here and here however it's easy worst-case scenario you drop these three and these two from the quantification you still have plenty of data left for this protein okay so that's just how to show you how peptide shaker will show this to you so that you can estimate how bad is the problem in fact we back in the day we built a tool in 2010 to look at protein inference issues and what is relevant here is that this is the protein sequence you see the peptides identified and when a peptide is colored in blue it means it only matches to this one protein when it's colored in yellow it matches to more than one protein okay and so what is the issue now if you look at the ratio this is the ratio between two samples say patient control for that for each of these peptides remember we always compare a peptide with itself so the ratio between patient and normal for this peptide this peptide and this peptide are three lines here and you can see them they're very closely matched which means that that is probably correct and we know that that is probably correct for the protein because of course each of these peptides is only found in that protein now if you look at the yellow guys we can do the same ratio for the yellow guys patient versus control and you see that all these yellow guys are over here this by the way is the global distribution of all the peptides of all the proteins and you can see that most of them are not changed zero and this is a log scale log ratio so you can see these are actually going down a bit and these are going up a bit so they're at completely opposite sides of the center point and that's probably because what we're measuring here is the superimposed signal of two or more proteins you're seeing the signal from the original protein but you're also seeing and possibly more strongly the signal from the other protein which is also present and which may actually be up-regulated whereas this one is probably down-regulated and so but not significantly in this particular case so here you can see that these shared peptides completely alter your perception of the regulation of the protein compared to the unique peptides and you can see here in green you can see the median and the average for all the peptides and it's heavily, heavily skewed by these yellow peptides so in this particular case you definitely would do best to ignore all of the yellow peptides because they do not give you an accurate representation of the quantification of that protein between when you compare healthy and disease right so that's all I have for you thank God because this has been going on for long enough thank you very much for watching this class I hope to see you in the practical in real life or online and if you have any questions about this lecture make sure to ask me those questions then or otherwise you can always forward them either through Professor Clement or by sending me an email thank you very much for your attention and with that we're done