 Welcome to MOOC course on Introduction to Proteogenomics. We are now moving on to mass spectrometry based data analysis and in this light a new method which involves data independent acquisition or DIA will be discussed today. DIA is an MS technique which involves fragmentation of precursor ions within a selected M by Z range. It differs from data dependent acquisition or DDA where only selected precursor ions based on their abundance are fragmented. In this lecture Dr. David Campbell will introduce you to the concepts of SWATH MS, a type of data independent acquisition or DIA method and the various tools available for analysis of such data. So let us now welcome Mr. David Campbell from the Institute for Systems Biology, Seattle. So I am going to talk about basically SWATH MS and in particular some tools that we use to ensure that the libraries that we use for SWATH MS are robust. That was initially going to be the focus is this one tool but as I sat through the talks and as the schedule changed a little bit I broadened it to include more sort of general SWATH information. I would like to actually before I begin thank some of the people that, there we go. So Muckel has done most of the mass spec work that I will talk about. Dr. Robert Moritz is the lab head and they both of them help prepare some of the slides that I am going to show. So I would like to thank them in advance. So basically this is an overview of the talk. So first of all I am going to try to tell you what SWATH is, DIA or SWATH and why you would want to use it. I am going to talk a little bit about SWATH Atlas and Peptide Atlas and some related resources that are out there that you can use and again you have already heard about some resources. There is a lot of people working and trying to make this data available. Somebody that can use that, somebody that can take what is already available and leverage it for their own research is going to be ahead of their competitors. Then I am going to talk about the specific tool and a little bit about what it might be in an ION library that we would want to make sure is proper so that we can be confident in our SWATH analysis. And finally I am going to give you a short example of a SWATH analysis and to try to illustrate some of the points that I have discussed and I will say that along the way I am going to take a detour that occurred to me during Dr. Mani's talk yesterday. So this is a depiction of SWATH MS and so each of these cylinders represents a proteome. The top of the cylinder is basically more concentrated or basically more concentrated proteins say in the micro molar range and down at the bottom is very dilute proteins say in the femtomolar or atom molar range. And these are the different techniques for mass spec proteomics and I just like to go through them briefly and give an illustration of some of the pros and cons. So the one that we are most familiar with is called DDA. So basically what you do is as the peptides are eluding from the column you pick generally speaking the most intense one you fragment it and then you interpret those fragments that is a spectrum. And that is typical proteomics. That is very good at seeing pretty much everything. So yeah basically it is from most concentrated to least concentrated and from side to side basically means how many of the chemical of the ions at that level you have been able to see. So DDA sees everything that is very concentrated. If you then instead of basically selecting the most abundant precursor if you instead have an inclusion list a set of precursors that you are looking for to begin with you can actually get deeper into the proteome. But it comes at a cost of losing some of the coverage at the higher concentrations. SRM which was explained well this morning carries that trend further still. It is very sensitive but because you have a limited number of precursors you are trying to select you necessarily the mass spectrometer has to spend more time basically finding the fragment ions of interest. And so you don't see nearly as many things but you see them much more sensitively. So that is generally correct. So almost all of these techniques that require sort of precursor information they generally rely on originally on DDA experiments. There is a number of ways you can get deeper into a DDA experiment by doing fractionation by doing longer gradients, by doing different cell types, by doing subcellular fractionation. So basically by really beating this technique to death you come up with inclusion list for this, for this and it turns out for SWAT. So SWAT also in the main way it is analyzed depends on previous information. So SWAT does not get quite as sensitive as SRM but you can see that basically it analyzes everything and that is because whereas these three techniques they basically again there are peptides alluding you select one, you fragment it and you interpret it. With SWAT basically you are taking chunks and I will go through sort of how we decide to do that but you are taking big chunks of M to Z ranges which contain hundreds and even potentially thousands of peptides and you fragment them all. So you have fragmentation information on pretty much everything in your sample. So right now you are only limited by the sensitivity of the instrument. Obviously in the future we would like to be able to analyze every peptide in the entire proteome plus you know any post translational modifications but it is not clear how we are going to get there and we are certainly not there yet. So yeah so the advantages of SWAT are you get greater depth than DDA methods, you get more consistent breadth, the broadness across seeing all the proteins at a given level and you are able to reanalyze the data pretty much forever. So the idea is that you because you are essentially fragmenting everything you have information about everything. With these techniques you have selected an ion and fragmented it so you only have partial information so this one you have complete information that you can reanalyze forever. So this is a depiction basically this was gone over well this morning but in SRM you essentially have many chemical species eluding at a given time. Use one of the quadruples as a filter to filter out this particular ionic species you collide it with a collision gas and then you sequentially zero in on individual Q3 or fragment ions to make an MRM assay. In contrast with SWAT basically again you have all these things eluding everything else is the same everything is eluding the same you fragment multiple species at a time actually many more than are depicted here and then you read out all of the fragments and you get a very convoluted or complicated spectrum at the end. So this is a depiction of just how complicated it is and it also shows how SWATs work. So basically the distance between here and here is called one cycle. So initially you do one precursor scan to see what MS1 things you have and then you step across a predefined mass range as quickly as you can and in this case we have depicted 25 Dalton SWAT windows fixed width all the way across and then we go back. So that's one cycle back but you have to consider how peptides elude. Peptides are coming off of the column and they start eluding and there's a maximum and then they come back down to baseline excuse me. So ideally you would get multiple points across this peak and because you're taking all of your time doing all these different SWATs you run the risk of not getting data for across every peak. So these are this basically depicts in time versus MZ space the fragment ions coming from just this one SWAT. So these are not the precursors even though the upper end of the range is the same these are actually the fragment ions and the fragment ions in any one sort of vertical slice are likely to be related with each other. You see this smear across here this is because this fragment width corresponds to the precursor width and this is bleed over of precursor into your fragmentation spectra and for this reason when you make a library you actually exclude any Q3s that are within the SWAT window that you initially used simply because there's so much background. So this is a depiction of the actual SWATs that were used you see it goes from 4 to 425 then 424 to 450 so there's a little bit of overlap in each of these cases as you see and basically there are 32 identical windows and then you go back again but it turns out that that was the initial way things worked but the instrumentation companies came up with improvements in the software and the way the instruments worked and in fact it turns out that if you look at sort of normalized M to Z space there are a lot more chemical species that have M to Z ranges between four and four hundred and a thousand then out beyond here so really what you want to do is have a variable SWAT windows and so you see down here where there's you know it's still going up you have a width of about six or seven units but by the time you get into this regime it's just six and so the minimum width would be right at this peak and then when you get out here your width might be you know a hundred or something like that and the idea is to try to make sure that there's a same number of precursors in every window so that you can give your machine the best chance to analyze everything and basically so our SWAT setup uses a hundred different windows and so you can imagine that the machine has to be pretty fast to step over that so because the output is so complicated you need a high resolution mass spectrometer so you need high resolution MSMS or else you'll have no way to really distinguish what you're seeing you need a fast cycle time because you're going over all these SWATs you need to be able to get up and down so that you get reasonable sampling over any one peak and so this is an idealized peptide elution profile and of course in reality these things are all overlapped I mean at any one time many things are alluding so if you take human cells there's something like 800,000 different or something on that order different possible peptides and you think you're alluding them over 90 minutes off this column even if you figure you're not going to have half of them there that's a lot of things alluding at any one time and also a rational selection of these Q1 windows as I described these variable with windows goes a long way towards successful outcome so how do we analyze SWAT data so most of the data that is used it's depicted in the top row and that's library based analysis so we take DDA data we make a spectral library and then we use a spectral library to put into format usable by all these different software tools skyline has its own methodology for doing this as does peak view so a number of tools can can do this conversion for you but as was pointed out that requires that you do DDA runs so ideally when you want to go to a SWAT experiment you wouldn't have to go and run 50 or 100 DDA runs just so that you can make a library to analyze your samples I mean what if you're looking at a thousand human samples or or a whole bunch of different animals perturbed in different ways you know it really wouldn't be reasonable to go and make all these libraries just for this one time use and so we advocate the use of these repository libraries that can be used to do these analyses so so our software can read and write any of these formats there's also another way a library free method and this one is from ISP it's in the trans proteomic pipeline it's called disco and it's the data independent signal correlator so basically what it does is at any given time it simply looks at the fragmentation pattern and it it looks for signals that are rising and falling in concert in unison and it assumes that those belong to the same chemical species the same the same peptide it also uses these precursor scans to watch the MS the precursors come up and down and it essentially makes a pseudo mzml I don't know so mzml is a is a common file format that is used for encoding proteomics data so everybody has different instrumentation some people have thermo instruments some have abcyx so mzml is an attempt to to make take all these independent and different vendor formats and put it in a single format that people developing software can use and so what disco does is it basically it mimics a dda run and then you do a normal search and you come up with results so I'm going to talk a little bit about peptide atlas and swath atlas so peptide atlas is something that's hoping hosted at the ISP and basically what it does is it takes mass spectrometry experiments from all over the world and imports them and analyzes them in a consistent way and comes up with highly highly curated sets of of peptides and highly confident sets of peptides it's it's actually very we strive for very low error rate as a consequence we throw out a lot of data but but the data that's in peptide atlas is actually very good and it's broken down by tissue type by organism and it's searchable and it's it's a reasonable first place to start if you're embarking on a proteomics experiment if you if you want to do proteogenomics and you don't have the tech the know how to basically do analysis of certain cell types you can actually get some expression data directly from the peptide atlas that may be useful so peptide atlas is sort of the umbrella the umbrella property or the umbrella service here at ISP there's also something called SRM atlas where we've basically similar vein we've taken SRM data from various labs and we've compiled it and made it available for people to use and built an interface so that people can look at it and swath atlas again is yet another data type so the the first one peptide atlas that's DDA data SRM atlas is SRM data and swath atlas is swath data peptide atlas is by far the most developed SRM atlas and swath atlas provides some searching interface so that you can get the data that you want and hopefully render it in some way that's useful for you to do an experiment down the road both of these two sort of generate the inputs for experimentation on your own and we use a common code base and a common back-end database to to make these so I would be remiss if I don't mention the human SRM atlas so this was an effort in Robert's lab to basically cover all the all the proteins in the proteome the then known proteome and to do that we basically took unoprote and made at least five for five peptides for every protein based on things that have been seen in the past and we actually used synthetic peptides so these were were made so that we had a high confidence of what we were we were looking at and then for longer peptides we took up to 10 and basically this this gives us an ability to look in theory across the entire proteome in addition to that we did two other things that I think are useful in the context of proteogenomics so one of which is we looked at all the different variable spliced protein forms and picked peptides for those proteoforms and we also looked at about 5,000 well conserved snips so we have we have peptides for a lot of different things and it's possible that that would be useful in the context of proteogenomics so SRM Atlas this represents what you would happen if you had done a search for say this protein each one of these lines represents an SRM or MRM assay an assay is basically a Q1 value a Q3 value so a precursor M2Z a fragment M2Z and retention time as well as a relative intensity so that you can rank it relative to its its neighbors we have all these links you can look at spectrum chromatograms so this is we did all these different collision energy so for this Qtuff so that for any given protein fragmentation there's sort of an optimal collision energy we actually looked at all these different collision energies and you can see that the the behavior of the fragment ions as a collision energy changes this is an example of a chromatogram this is following one of these links and also of course you can see spectra in here the swath Atlas right now has has really three functionalities so first of all it's a repository there's we have several several different libraries from different organisms rat human number of bacterial organisms the biggest one that we have is this so-called pan human library basically it was an exhaustive effort done in Rudy Abersol's lab to to sequence some 60 different samples a number of cell lines a number of primary tissues a number of fractionation techniques that came up with with basically assays for well in excess of 10,000 human proteins and so we also allow people to basically using one of these libraries it turns out that different people have different experimental setups and so using one of these libraries you can provide certain information is in which proteins you want or which mass ranges you want to use and you can basically make a subset library that is more amenable to your research and we also have this tool which i'll talk a little bit about which is which is an assessment tool for an iron library so you can take anything so you can make your own iron library in skyline and maybe maybe you're not getting the results you wanted well you can take it and upload it here and it will give you some feedback on exactly what's in the library and you can see if there's something that that indicates why it wouldn't be working properly and once you do this you get this very complicated output which i'll break down a little bit so it doesn't look so daunting so yeah i'll talk about this assessment tool so this is sort of the the workflow of creation of libraries and then swath analysis so this here the simple sample prep digestion is is what was talked about at length yesterday i mean you have to get the samples you have to figure out how much protein is there you have to know you know what kind of contaminants you have you may do fractionation so there's there's an awful lot of work i mean it's it's easy to put it on a flowchart but there's an awful lot of work that goes into this then at some point somebody does dda and uses a search engine and you come up with a validated set of identified spectra those spectra are then combined into a spectral library a consensus spectral library and what you do is you basically take all the observations of a particular peptide ion so peptide ion is a peptide plus any modifications plus a charge so if you have the same peptide in plus two and plus three that would be two entries in the spectral library and it turns out when you're doing mass spec you often have noise in your peaks and by combining all the different occurrences of a given spectrum in your data you actually attenuate the noise because peaks that are real will tend to add up and peaks that are are just background noise will tend to attenuate and so this consensus library is really the basis for all of the all of this sort of inclusion list all the library based techniques so this can be taken and made into an srm library directly with spectras or it can be made into an ion library and so an ion library is what we call the libraries that are used with swathms and then we can basically take this ion library and and generally we make the ion libraries as complete as possible and then we apply our own particular criteria so every time you do mass spectrometry you have you have a mass range you're looking at you may have a preferred sort of b or y ion composition so you can apply these kind of filters and make an appropriate subset for your data and then of course you do your din analysis with one of the tools that i mentioned so one thing that i decided during dr monnie's talk it seemed like a lot of people were a little confused and i think that for this audience so so it turns out that the spectras tpp spectras ms proteomics tools as well as this uh this data set are available on docker so docker i think for people that don't have a lot of software background i mean by all means if there is if there is a platform that you can use that like galaxy that has everything built in use it but if you want to download something onto your computer and run it docker is an excellent way to do it because basically all of the maintenance is is handled for you and all you have to do is run it and i thought i'd explain a little bit more what what it is so you can see that this is meant to be sort of a container ship and these are containers but really the the whale is a computer and it could be your computer it could be a server in your department but it's just a computer but it's what it's doing is it's running these little mini computers on it and and these running mini computers are called containers um so basically it runs a guest on a guest computer um on a host computer so it runs one or more little guest it's almost like a little computer on a host computer so what you there are two concepts in docker so the first is images it's like it's like a hard drive so basically it has an operating system like linux it has some software so it knows how to do one thing really well but it's just a file it's just you know it's handled by the docker software but it's just a file when you actually start the docker that's when you get a container and that's what dr mani was talking about oh so i lost a couple so basically uh so so you could you could have a little hard drive icon if it was showing up for for sort of all of these things so you could have on your computer you could have a protigy container or an image you could have a tpp image and you could have a fire cloud image but the neat thing is on any given computer you can actually run many of these things and so the images is just sort of this set of instructions it's like a little computer that you can run on your computer but you can you know you can distribute it uh you we're talking about fire cloud basically they can take and they can put their software this little container on as many nodes as they want um and so or you could you could run multiple jobs on your own computer so the and uh so anyway it's but why is it so neat and so so i think from a from a non-software expert perspective i think that the neatest thing about it is that the developer takes care of it so you guys tried to install our studio and protigy and you had issues with the versions of r so i'm actually responsible for for maintaining different versions of r and python back at our at our lab and it's really a headache because basically some really smart postdoc in your lab makes some software that you want to use but he's gone now or she's gone now and it depends on python 2.6 and then somebody else comes along and says oh i have this new software i want to run they download it they say oh i have to upgrade python and now all of a sudden the old thing doesn't work so what docker does is is everything is is separate and so it basically it's there's no contamination of the environment and it may seem like a trivial thing but that really is one of the biggest headaches is the fact that all these software things have their own dependencies and by having the developer maintain the environment it's it's makes it much easier so there's no conflicting versions so basically each docker knows what it does and it does it well the updates are generally lightweight so the first time you download something maybe it's a gigabyte it takes you five minutes to download but the next time somebody just made a little tweak to it you just say oh i want to get the latest thing it looks and it does it by what is called layers and says oh i already have that i already have that oh i have to download this one new thing it takes 20 seconds and boom you're running the latest and greatest and so it's really convenient way to update and keep your software updated so it can run virtually anywhere it can run on your computer department server or as we heard yesterday on on the google cloud or microsoft azure or amazon web services and was also what was also asked yesterday was can i modify it so it turns out yes you can download any docker well almost any docker you can start it up interactively and you can make changes you can put your own software on you can add a little pipeline and then you can save it as an image and then next time you go in you can basically run that image you have that container you don't have to update it again so the one thing that i should point out is you can run these on your computer but it's not a miracle if your computer is slow the docker on your computer is going to be slow it doesn't make you suddenly have a super computer but if you have a reasonable computer you can run some pretty complicated software and there's not a lot of overhead running the container so it would run just about as fast as if you installed it yourself and and took care of the headache so anyway so what what runs on docker well as we heard yesterday the fire cloud uh you guys also heard a presentation about base space well it turns out base space these are all all docker images that each one of these little apps is a docker image that runs on amazon web services and in fact they have a thing for proteomics called one omix um Illumina does and basically we've developed a couple apps and they're basically docker containers and the reason they do that is it's very easy everything is all self-contained they don't have to worry about collisions of all these different software packages that they're they're offering um the tpp runs on docker open ms runs on docker that's one of the things we used to analyze swath python r you guys were having trouble with r you can download any version of r you want and then boom it's it's running um there's also this thing called uh biocontainers and the biocontainers folks uh basically maintain all these different tools uh sam tools is is a big uh genomics tool for doing alignments and um basically it's one of the tools available in galaxy uh and this is uh this is an easy way so these are basically biosoftware that are maintained that you don't have to do anything you don't have to go search and form you can just go to this one place and look and see is it does it have anything that i want anyway so that's the end of the sermon i think docker would be neat i think i think it's worth checking out in today's lecture you will introduce to the greater depth and consistency of data independent acquisition or di a method over data dependent acquisition or dda method softwares like skyline the spectronaut peak view and open swath they require a library for di data analysis softwares like disco and di empire they do not require the library for di data analysis the softwares disco from isb mimics di a runs like a dda run it performs normal search and interprets the results resources like peptide atlas asarum atlas and swath atlas they contain information from dda experiments asarum experiments and di experiments respectively these resources are freely accessible to the research community and contains a lot of valuable information the dockards let the users analyze data from multiple places and offers an easy interface and workflow for data analysis in the next lecture we will look at some key features of swath atlas and how it can be used to accelerate di data analysis thank you