 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture Mr. David Campbell, he started discussing about SWATH Atlas in various comparisons of DIA versus DDA methods. He also introduced to the concepts of DIA and the softwares and tools available for analyzing DIA data sets. The SWATH Atlas contains high quality ion libraries for use in SWATH or DIA experiments. In today's lecture you will provided an overview of the features available on SWATH MS and how you can utilize this valuable resource for analyzing your mass spectrometry data. So, let us welcome again Mr. David Campbell for his today's lecture. So, this is DIA-LQC, basically what we have is we have an ion library, we have a variety of different formats. We have a SWATH file. And the SWATH file basically says, okay, I am going to look from this mass to this mass and I am going to, you know, basically the width of each bin. And you can also compare it to a proteome. So, basically you have taken a proteome and you digest it and you have all the triptych peptides. You basically run this thing and you get this ion library, excuse me, ion library summary in tabular format. And so, these are the different criteria. Library complexity is sort of how big the library is, how many peptides, how many peptide ions. There is precursor information. So, what is the average length of the peptides in the library? How many modifications are there? What kind of modifications? Fragment characteristics, you know, what ions do you have, B or Y? How many fragments do you have per peptide? Retention time is a very important part of any sort of targeted or scheduled type analysis. You don't want to look over the entire retention time range. So how much does it vary? Is it consistent? Do you have marker peptides or these things called IRT peptides? Library completeness, you compare it to a proteome. How many of the proteins in that proteome do I cover? And library correctness. So, it turns out that there are certain things with relation to the SWATH file or relation to the actual embossies that are being reported. Is this library in fact correct? So, basically I'm going to go through these very briefly and so I have two different libraries depicted here. So, this is the PHL, the Pan-Human Library and this is again taken from all these different experiments. Everything was run on a pretty modern instrument, the A, B, Psi, X, T, T, 5600 and you can see that there are some 211,000 peptide ions, so again a peptide ion is a primary sequence plus modification plus charge and there are almost 3 million fragments. So, the other thing I have is I basically took this PHL and I applied our SWATHs. So now, because they used, they had a, when they did DDA, they were looking at wider M to Z ranges. In addition, we were looking at these 100 SWATHs and so we don't want any fragment ions in our precursor windows and what's more, we actually took the top 6 fragments. So we want to specifically take just the top 6 and so you can see that going from here to here, we didn't lose that many ions. So by applying maybe some more stringent mass filters, we didn't lose that many ions, but if you look at the number of fragments, we have way fewer fragments, almost something like 35 or 38% only of the fragments and that's because now we only have up to 6 fragments per peptide and actually, I've got ahead of myself here. So, yes, so I've, confusingly, I've put it down in this section. So this is the number of stripped peptides, we would say. So this is just the number of total peptides and again, these two numbers are pretty close, but if you look at this number 149 up to 211, those are things that we saw, say both oxidized and unoxidized or multiple charged states of the same peptide. That's actually pretty common, especially plus 2 and plus 3. So, yeah, so again, the peptide characteristics are how many peptides did we see, what sort of modifications, what percentage of the peptides are modified, things like that. This is a big one, so basically this is the number of fragments and as I told you, so this is the PHL, the average number of fragments is about 13, whereas the new library, it's exactly 6 and that's because we wanted to know, we were comparing these different softwares and we wanted to ensure that all the different software used the same peaks to do their analysis and so most of them use 6 fragments by default and so we limited our library to 6 so that they would all be on equal footing. But you can see the percentage in charge 2 and charge 3 and, you know, sometimes going from the all to the modified library, you do actually see a swing in the percentage and, you know, I don't really know why that would be, but, you know, you might expect you'd see fewer charge 3 because, you know, basically M to Z might fall below now the lowest M to Z value that we're looking for because something divided by 3 is obviously smaller than something divided by 2, but this is, it's a useful sanity check to compare to what you're expecting, are the mass ranges in my library what I would expect. So here's a difference so you can see that in their analysis they had Q3s, fragment ions up to 2,000 whereas we only took fragment ions up to 1,500 because that's what we're looking for in our SWAT. There's really not that much above it so we don't bother with it. We just look up to 1,500 because, again, we want to do our cycle time so that we can sample across the peak. So, yeah, there's all these different things and they probably, I mean, I'm showing you two guys in the back can't even see what it is, but if you're trying to assess the validity of your library it's actually very helpful and we have had a number of occurrences where this has helped us a lot. So, again, retention time is very important for analysis. So we're using something called IRT peptides and IRT peptides are peptides that you're meant to put in pretty much all of your runs and they help you to register the retention times between the runs and it just so happens for some reason that they go from negative to positive and so the reason that these show up as negative value from negative 60 to 180 is basically because they've been converted into what we call IRT space. So that's a little unusual, but it's a dead giveaway. If you're looking at a library that, you know, if these ranges don't make sense then it's a red flag to look at something. So one other thing, there's a measure of consistency. So as I mentioned, sometimes you see multiple charge states of the same ion and it turns out when you make a spectral library, the software treats everything independently and so you end up getting potentially different charges for your plus two and plus three of the same peptide. Now that doesn't make any sense from the perspective that when your peptide's eluting, it's not like your plus three elutes here and your plus two elutes here. Really the peptide elutes and then you do your ionization and the ionization either causes a plus two or a plus three. So you would hope that your plus two and your plus threes would be very close to each other. But it turns out that they're not. And in some cases we found libraries where these numbers are very different. So here are two libraries that are in the SWATLIS and basically, oops, sorry. And so what we're looking at is plus two and plus three charges. And so this one has a very good correlation. So having used the IRT peptides, everything is pretty much right along the diagonal. So this is basically, we take in all the peptide ions that you have the plus two and the plus three for exactly the same modified peptide and plotted them against each other. So you'd expect in a perfect world, they'd just lie down on this axis. But you see that there are some outliers and the width of this is kind of big. So you could look at this and say, well, that's actually plus or minus five minutes or plus or minus three minutes, something like that. So that can help you to decide how wide of a sort of window that you look at when you're trying to look for these things. So this is a yeast library and actually it's got a pretty strong correlation as well. Everything lies a little bit better along the axis, but we have sort of more pronounced outliers. So this is just an example of how IRT can vary. And it illustrates why we might look at this in our libraries. So this is sort of another thing. So basically, in addition to storing the retention time, these libraries have what they consider to be the best spectrum. So we've taken between, well, one and a hundred different spectra and made the best consensus that we could. So, Spectras will pick one peptide that either has the highest signal or has the lowest signal for noise depending on how you have it set up. And so all this shows is that if you pick pairs of peptides in the blue, where basically the retention time, the difference between the minimum retention time seen for this ion and the maximum retention time seen for the ion is pretty close, then you get a much better correlation between the median retention time and this best replicate spectrum. So library completeness is basically taking a proteome. So most of you, if you're working with some sort of organism, you have a proteome or a genome that you want to compare to. And this just looks at those things and says, okay, of the things that I might expect to see, of the peptides I might expect to see, of the proteins I might expect to see, how many of them do I actually see in this library? And it gives you an idea of how complete the library is. And for most of these things that the two different libraries, even though we've cut this one down, don't really vary that much. One thing that's kind of interesting to note is the average number of peptides per protein is seven, but the median peptides per protein is like 17. So it turns out that the PHL is pretty over represented in some, some proteins are over represented. You have up to say 100 peptides for that protein, including some things that are semi triptych or have miscleavages. So normally for SRM, you wouldn't want to use that type of peptide. You'd want sort of the best representative of peptides for each protein. So that's one thing we might want to do with the, the PHL is, is narrow it down a little bit. And I'll get some more evidence of that later. So actually one caveat. So basically, this is the number of triptych peptides. This is the number of perfectly triptych peptides, that match the perfectly triptych peptides here. So 20% of the entire proteome, covered in this library. We know that there is about 12,000 out of 20,000. So we know that we've only covered 60% of the proteins. So those proteins, we're just not getting perfect coverage. So, so yeah, you might think that this number is pretty small compared to the 500,000. Yeah, you may well be right. It may be a function of sort of quality criteria that were applied here. But I mean, it's a very rare mass spec experiment or set of experiments that can see 100,000 different peptides from human. So these things are actually things that are wrong with the library. And you might not think that it's possible, but it turns out that you can, you can actually have improper M disease. And because we're accepting things at SWAT Atlas, we want to make sure that people submit a database and we're then providing it for the community. We want to make sure that there's not something wrong. And so by and large, these numbers are all okay. These are things comparing the SWAT files that you provided with the actual library. And so generally speaking, you look at all these things and there's like four or five of them that stand out to you. And as somebody who actually works with these libraries quite a bit, I have used this a number of times. Anyway, so I know that's pretty boring. Here's a little bit of a description of why we might want to have good libraries. So as a test, basically, I took this library that we're talking about. And this, these are the, and it's only sort of a subset of the PHL. And so these are the minimum, maximum, and median retention time. And so I did two different things. So first of all, I took randomly added or subtracted up to ten Dalton's from every peptide. So maybe I added one and subtracted three and added ten and subtracted two. So basically it's randomized. And you can see that the minimum does go down by about ten. The maximum doesn't really go up and the median stays the same because I've done this randomly and so I've sort of added as much as I've subtracted. In this case, basically I took and did a fixed RT. So basically I said every peptide in here, I'm gonna add ten minutes. And then I'm gonna use this library for a swath analysis and I'm gonna see how the software can handle it. So between these two things, so this is sort of the baseline library. I've done a random thing where each individual change is probably less, but it's plus or minus. And in this case, I've added ten to everything. So everything in here is moved by ten minutes. So which one do you think gave the worst results? The one where I randomly added plus or minus, and the magnitude is probably averages around five. Or the one where I added ten to everything. Who thinks that the random one would be the worst? Who thinks the fixed one would be the worst? Who has no idea? Okay, no one's gonna play. Fixed one is the problem, that is correct. So basically this is a so-called whisker plot, box and whisker plot. The box and whisker plot basically, this is the median of your data. This is the first quartile, this is the third quartile. And this is the minimum and maximum. Although there is some outlier detection done and so sometimes you can actually get some things below the minimum or above the maximum. But basically it shows that the library that we started with has very narrow tolerance. I mean everything we're getting, so this is 15 different samples. We're seeing how many basically heel of peptides we're seeing. And in this case, each of our analyses came up with about 18,100, very tightly grouped. When we did these two retention time perturbations, we actually saw a significant decrease in not only the number that we see, but also the consistency across the different replicates that we see. So in a similar vein, basically we took the Q1 and Q3 values and we added basically modified by 40 ppm the Q1s or we modified by 40 ppm the Q3s. So this one, which would have more of an effect. We're taking these big Q1 chunks and then we're basically fragmenting them all and then analyzing the Q3 versus modifying the Q3, which is actually what we're reading out. Who thinks the Q1 will be a worse thing? Q3 will be worse. And that is exactly right. So the Q1, you're not really doing much. Maybe a couple of them, you're going to move it from one swath to another. But most of them are going to say in the same swath and so it doesn't really matter. You're just looking at basically the, you're looking at the same ions and it's still in the same swath. But in the case of the Q3, a change of 40 ppm again causes a drastic decrease in the number of peptides observed. Although it didn't affect as much the sort of reproducibility. So I'm going to talk about one sort of real world example. So basically we took a HeLa, do you guys know what HeLa cells are? They're a cell line, it's one of the oldest cell lines. It was taken from some poor woman back in the early, mid 1990s and was grown from a cancer. It was sort of the first cell line that became available and it's still in use today. And then we spiked in these halo peptides at different concentrations. And there's nothing special about the halo peptides, it's just a different organism than human. And so what we wanted to see is if we have this complicated background, can we see these halo peptides that we spiked in? So the five different dilutions and in the data I'm going to show, it goes from four femtomol to one nanomol. There's three replicates of each. Tested on nano and micro flow LC, two different instruments, interpreted with four different software tools, excuse me. So if you look at all the replicates, times machines and everything, there are actually 360 different measurements for each point. And so we basically took it and ran against the PHL and this is what we saw. So basically these are the two-fold, this is a log transformed, because when you have ratio data you always take the log. And ideally this would be right at two. And so what we've done is we've taken, and so this is sort of the two-fold. We're basically taking the ratio of everything to the most concentrated sample. So this E sample has the most things in it. And so we take the, and so pretty much anything in D is also going to be in E. And anything in C is also going to be in E. So this is most concentrated, less concentrated, less concentrated down here. So by the time we get to the very low femtomol sample we only see four or five, so actually the number that we see is here. So out of these 475 peptides we spiked in, we're only seeing 204. And actually this is peptide ions. And so there's going to be in excess of 500 possible things. And we're only seeing about half of them or really 40%. And so we looked at that and we thought, well that's not too good. I mean we have reasonable quantitation. I mean that everything was lining up on the, where it should on those axes. But we had poor agreement between all the replicates. So what we did is we looked at all the different experiments and we picked peptides that we're seeing in all three replicates of any one technique or software, what have you. And we created a subset database from this list. So basically the PHL had about 147,000 peptides. Our new database had 40,000 peptides. So what we've done is taken a repository library and we focused it on our sample. And so the reason for this is basically the way that people do big data nowadays. You take all these measurements and you don't try to threshold. You don't try to decide what's good. You don't try to take Excel and pick a cut off. What you do is you make some sort of statistical model and you try to separate the true positives from the negatives or the, yeah, anyway. So as you go here, as you push this way you get more and more sensitive. You get more and more things that you might not have gotten. But as you go this way, you actually lose your specificity because you start getting these negative things in with your positives. And so what you try to do is pick a point where you have an acceptable FDR. And the thing is with a big SWAT library, you don't expect to see everything in there. And so where all these things use decoys. And so you expect to have a one to one ratio of decoys to properly model this. But if you don't see 60% of your SWAT library and you make decoys for 100%, you basically have, basically you only have about 20% true forwards and about 80% decoys. So because you don't see everything in your library, having a big library in effect increases your number of decoys. And that is hard to model. And so by cutting that down, we hope that we'd have a better result. And it turned out that we did. So basically this is the same thing analyzed with the other library. And you can see that now we see in excess of 500 things, they're still pretty close. This is pretty close to minus two, this is pretty close to minus four. And we now see 23 of these lowest concentration ones. So I'll go back. I went from here to here. And so basically these lines along the bottom are a density plot. So kernel density basically gives you an idea of how these things are spread out. Because if you look at this cloud of things, it's kind of hard to tell. But you can see that these things are a little bit misshapen. And that's an indication that there's something a little bit weird about your data. This one has a pretty pronounced shoulder. And so sometimes what we do is we actually cut this one off at a CV, coefficient of variation of say 20%. So we exclude data that is problematic. But even by doing that, we get a much cleaner peak. And this density plot gets more nicely shaped. But we still have far more. So by using a smaller library, we've actually achieved, we've seen, we've done better quantitation on more peptides. And this is another way to look at the same thing. This is looking at just the Hela background. So these previous one, these are all the Hela peptides. This is looking only at the Hela background. And even in the Hela peptides, yeah, here we saw on average about 14,000. But here we saw over 16,000 on average. And as I said, sometimes this is max and min. But because they do outliers, sometimes you have outliers out here. So this is an indication that targeting your library. So we think that library resources are good. But you still have to do some focusing. It's good because you don't have to do all your own DDA. But it does require some customization. And I think that feeds in well with the proteogenomic theme. So SWATH is an important MS technique. It has the ability to be re-analyzed forever. You get better depth than DDA. And as long as you, since you can re-analyze forever, as our knowledge of the proteome space and our libraries improve, you're going to be able to re-analyze the same data and get better results. Library quality does matter. I really think that Docker is a useful thing, especially for somebody who's not that technically savvy. It's pretty easy to install. And then you just have to go browse around for things and run them. And comprehensive resource libraries save time. You don't have to do your own DDA. But they should be tailored before use. And again, I think that leads into proteogenomics. Because what if you had some genomic data? And then you took one of these resource libraries and you said, OK, I think that in my genomic data, it tells me that I should see all these things. And so now you make a much more focused proteomics library. And you have a better chance of actually seeing what you're looking for. In today's lecture, you were introduced to the library assessment feature of SWATH Atlas, which provides recommendations to improve the DIA experiment. The DIA library QC workflow considers features such as library complexity, precursor, and fragment peptide characteristics, retention times, library completeness while assessing an ion library. We hope now you can appreciate the use of tools like a serum atlas, peptide atlas, and SWATH atlas in carrying high quality mass spectrometry-based proteomics research. Thank you.