 So, like I was saying, collaborate with the Beijing project VGP in on the second version of the pipeline, based on back by your reads high file. You can see on the left we have three separate track for the initial assembly using high file rates, depending on what data are available. You have a regular mode using only back value high file you have high C mode using the high data for the initial phasing of the assembly. And you have a trio mode, because for some project we have data from the parents of the sequence individual, the data for the parents are in the data. In every case, the first workflow is camera profiling workflow using genomes cop and Maryl to evaluate. So the camera provide the size of the genome, the percentage of the resiliency and different parameters that are going to be used in downstream analysis. The next step is high phasing using a phasing to assemble the genome. So, in that case, we have simple as simple as phasing using I see all three of us. This workflow include three tools for the quality control, including mercury busco and Jeff starts Jeff starts provide simple statistics about the gym assembly such as and this kind of data busco evaluate the presence of auto loans and the completeness of the genome and mercury evaluate the level of duplication among other things, depending on the results of mercury. The first one after the tracks can be used to be getting prototypes. It's not as well because a lot of project don't have necessarily the data. And the third step is using salsa for scaffolding with high C data. You can use pretext and Jeff starts pretext provide a map of contact with I see to evaluate the completeness of scaffold. This pipeline has been available six months ago, it has been used on 21 genome for six months, 10 birds to amphibians to fish, six numbers and one repel. There's four more that are currently being assembled. The largest genome. This is the fall. Yeah, base for the eastern. And it has been assembled using the high C. Hi-Fi is unfazed as something. We also have the what I be at the Hawaiian picture here. We also have the zebra finish, which is the weapon genome, and we generated three new assemblies for this. But which has a high level of generosity. So we have the three different but high-fives and high-fives and high-fives and trio. And without surprise here, like the high-fives and trio provide the most complete the highest level of facing. The full pipeline, we have a nearly free face chromosomes. You can see on the right, the contacts map with pretext before on the top and after on the bottom, the highest scaffolding. We also have the water. And we evaluated it on this, on 19 of these genome and you compare it with the combination, the contamination by hand. So we remove contaminants and mitochondrial genome. 12 of them provide exactly the same result that you get with minor contamination. And among 321 scaffold that has been identified as contaminants. We have four positive and six negative, although among the first positive, some evidence later indicate that they might be actually positive. But overall, it's very similar results from the minor decommitamination. So this is the 2.1 version of the workflow. While this is being used to assemble genome currently we're also working on the 2.2 version, including among other things. We also have an HGP based workflow, including HGP FASTA reconciliation to keep to present the graph information that are provided by the initial high-fives analysis until the end of the scaffolds. We're also implementing the yes tool yet another high-scene scaffolding tool. We also have the SASA for the last day of scaffolding, and we are working on multi-tucie add-on to merge the quality control over the whole analysis for final unique reports. So in addition to implementing the pipeline in the galaxy, this project allends to a specific development in the galaxy aiming to big workflow and big analysis. So one of these things is the AWS data retrieval, you can now access the genome report data from Genome Arch, AWS repository, including the genomic data and the completed assemblies from the data output directly into galaxy. The development has been an export tool and workflow. So it's a scary workflow, but the larger works on the extreme right is a tool that takes into inputs, destination, distance files, so in that case it's dedicated to the export back to the AWS repo for VGP. It takes a full path including folder and subfolders so you can build an architecture to send your file to, and that is why the AWS scary workflow is creating the right path for the right for the output of the analysis function. And since the history are gigantic, we have dozens of inputs and dozens of data set in the history. So if we have a filtering like tag, you can see on the bottom right to have a tag filter scaffolded S1 in this example, and this tag are applied during the analysis workflow, and we select as inputs of the workflow, the file, the data sets having this tag. So this is a lot of new developments. The filter by a tag, when you run the workflow, you're going to have in the inputs available for the workflow only the file type in this one. So that way when the people running the VGP analysis want to export everything is already pre-filled, they don't have to manually scroll for all the data sets in the history. So all of this workflow available in IWC. Some of them are already merged in the main repository. Other ones are still in to request while we are working out last test testing, but you can download it from IWC whatever you want. And then the tool are available in .org.en.eu, but there's an assembly.useGalaxy.eu that is dedicated to the VGP program, including presentation with all the workflow available. So thanks for the workflow in ..en.org and running there. Still working progress with Galaxy Australia to get all the tools up and running. Thank you for giving me the opportunity to talk today and also thanks all the people who collaborated with this project in the VGP team and the Galaxy team and people who didn't participate directly in the project but who participates in the development that makes all this possible. If you want to know more about the pipeline we have developed two tutorials that are available in the GTN assembly section. So the question is if the data type has been decided, if there's a reason for choosing this data type, including Nanopore long reads, and the answer is, so this is the data type that has been financed through the VGP project, but there's also in the work of the initial assembly that is really early in the work, but eventually we want to discuss Nanopores as well. For example, in COVID project we have this ability to trigger tutorials. So the question is, for example, in the COVID project there's a possibility to trigger the workflow through API instead of through the interface. And it's been discussed, it's a role we have to be able to run it on common back with the interface, but we haven't started working on it yet, but it's going to be eventually. So this is starting from, sorry, the question is what is the end point of this project if it's going to be available in several other places. And so this is the phase one of the project, the goal is to have all the assembly in the reference to in the end of various people. And the second phase of the VGP project is to build annotation and run analysis. So eventually the goal is to have this resource available completely. And once I send it automatically through the pipeline, the genome also sent to for operation. And so eventually the goal is to have it available everywhere. Awesome. Thank you very much. So our next speaker. Our next speaker is Timothy driven. Great, I will get right to it. Thank you very much. So, Hi everyone, I'm here to present some work from our team. This is work here at the University of Minnesota so a short walk over the river is where this is going on campus. Very much team effort, people from the super computing Institute as well as my laboratory and biochemistry and also some collaborators from Chicago and the Lurie's Children's Hospital there. And we're working on cystic fibrosis analysis using proteomics and the galaxy tool so real quickly just kind of the background so the overall goal of this project is cystic fibrosis obviously a pretty well known disease but there's a lot of questions about the bacterial contributions to it, whether it's a consequence or or actually paste some sort of a positive role in in some of the negative clinical aspects of cystic fibrosis. So we're working with a team that has access to these bronchial lab bronchial angular lavage fluid samples Belf. It's a sample that they can use a special scope into the lungs and basically flush cells and other molecules off of the lungs and bring those out. So we're analyzing those samples using mass spectrometry to try to identify proteins that are coming from bacteria that may give us a clue as potentially even their functional signatures but also as a marker and sort of a quantitative way to to to assay the bacteria as well as the host proteins that that are a part of these samples. So that's kind of the long and short of it using mass spectrometry and also galaxy to analyze this data. So this is kind of the workflow I'm not going to go into every piece of this other than that we have access to cystic fibrosis samples as well as disease controls which are samples that are other lung conditions that are not cystic fibrosis as a control. A couple key aspects that are shown in this slide is that we're what we've done at the start here is taking these Belf samples and pool together different patient groups and analyze these trying to do a really deep dive with the mass spectrometer to understand what can we detect in terms of bacterial and human proteins. We also have some some cyst 16 s data that gives us at least an idea of what bacteria are there that's important because it gives us a sense of kind of the protein space the proteomes that we should be looking at in terms of these samples. And then we've taken them through a pretty rigorous bioinformatic analysis, mostly based in galaxy. This is just kind of another look at that it's really kind of these two phases that we've used. So, in this slide it kind of shows you the overall design so we have disease control cystic fibrosis we also have their microbial diversity, whether it's higher low based on that 16 s status so we kind of group these into these tools samples, push these through the mass spectrometer. It's a little dicey when you're doing what we call metaproteomics because now when we have even with 16 s RNA, lots of proteomes we get these huge databases. We use a tool called metanovo that gives us for sort of a first pass with the mass spec data to say okay here. Here's the organisms that we think are actually present based on the initial signatures of the mass spec data so it kind of reduces down our database. And then kind of taken, I guess, this sort of kitchen sink approach of using different tools to analyze though that reduced database to understand what proteins are there given that these tools have some complimentary identifications. So this is just showing the proteins and the peptides and in blue shows those proteins and peptides and match the bacterial proteins the rest of these are human so this is going to surprise these are primarily human proteins were detecting. But we're trying to tease out those bacterial proteins as well. And from there, we then take these bacterial proteins and take them through a pretty rigorous validation step. We use a tool called pep query to really try to make sure are these really truly confident and real matches bacteria and not human protein so there's a pretty rigorous step there that happens and Catherine Doe is the one who's really put together these figures have done a lot of this this part of this analysis. And also we have what we think are sort of our peptides from from microbial proteins of interest. We can then do some analysis. This is from this unique tool that shows sort of the phylogenetic tree so to speak of where these proteins are coming from. This is a highlighted pseudomonas this is a very well known pathogen within cystic fibrosis so we're definitely seeing some things that are making sense but there's also a sort of profile here of other things as well that are that are interest. And from there, we're then taking these and we've done a lot of manual as well as looking at quantitative response of some of these bacterial peptides. So we come up with this sort of panel of 87 very stringently filtered peptides. This is showing in the mass spectrometer there's some ways to go back in and actually reconstruct the signal from this peptide so these nice colorful plots are sort of all the fragmentations from any given peptide that that that were used to detect that sequence and you can sort of build those back up to confirm that this is a real sequence and also quantify this this data using sort of this area under the curve. So we can finally see some some quantity quantities of these different peptides from different bacteria that that seemed to show some interesting possibilities, as well as more Excel is a very common upper aerial, aerial track bacteria so not a surprise to see that but not necessarily one that is CF CF associated so shows that that that's in these controls but not as much in the CF. And really where we're going to with this is not just the bacteria part of this but what we think is really kind of novel here is that we can in the same assay. But also at the human protein so I know this is kind of a busy slide but what we've done with the human proteins if we have a lot more data on is going in and try to do some some analysis on you know what are some of the sub pathways and functions that we see that are enriched, either in CF or in the in the disease controls at the human protein level. And where we're trying to go with this is now kind of construct this panel of peptides that we could now target, really create an assay that you could go back in the clinical samples and watch the dynamics of bacteria and human proteins of selected pathways, and how do they change under under certain disease conditions and things so we're almost there, we're trying to validate these and do some and develop these targeted research fingers cross will will kind of show some results in clinical samples so last little pieces we have a metaproteomics workshop today if you're interested in metaproteomics is heavily used here so come and check that out at 120 if you want. What I have maybe take a question or two. One question. Full scan acquisition or also targeted. Great question the question was is this a full scan acquisition or targeted so all of the data initially was untargeted just open ended proteomics so just trying to select everything we could and identify what's there. We're starting now and we're starting to follow this up with PRM so targeted so now we know our markers peptides let's see if we can detect those and confirm their real so hopefully it will all become targeted. Up next is a Dr Jeremy Gex from actually the same institution as me, which is uses my opportunity to give more time. Galaxy and she learning. Excellent. Well I'm looking forward to our time. So yeah, my name is Jeremy Gex I'm a faculty member already in health and science university and it's my pleasure to talk to you today about advances in machine learning for galaxy. And I have the pleasure of talking about work from several individuals across the community and I like them along the way. So the key points that I want to talk about today are kind of free for one I want to share an overview of machine learning working with Deming Galaxy from past to present some kind of jobs park across where we started and where we're going. So I want to highlight two different recent advances that we've done with respect to machine learning that would give you some insight into how we're thinking about this process. And how we think galaxy can contribute and make possible machine learning analyses. And then finally and with future directions for machine. So kind of level setting make sure we're on the same page. Many of you probably know machine learning is a subfield of artificial intelligence where you use data to understand patterns and identify different trends that are you can then use to predict outcomes in the future. Deep learning is a further subset of machine learning that happens to use these neural networks, and it's called deep because instead of only a couple layers of neurons you have many layers here. And you can see with this kind of example here up maybe is super high quality projector that as you move across the layers you get more and more advanced feature to touch so you start out with things like edges. Okay, so if we think about where machine learning is applicable. This is a slide that we put together a while back thinking about machine learning from my Madison thinking about the idea of the collected home you can collect data clinic you can apply machine learning models to many of these different data sets. But of course we know machine learning is useful not only in biomedicine but other places like ecology as well. I saw a really interesting study recently where somebody use machine learning to identify the different trees that were present inside. So we're looking at wide applications, we want to do so in galaxy. And so what we've done over the last couple years and we've developed a toolkit called galaxy ML. It took kind of the best practices and libraries they're already out there in the community, and stuck them into things you've probably heard and such as I can learn some other tools such as extra boost in balanced learning. If you're familiar with that where you have challenges of one class being underrepresented and we're classes being overrepresented and also some deep learning libraries as well. We wrap all these into galaxy to provide an end to end approach where you can take your data sets in galaxy do the training do the testing do the evaluation actually do the visualization as well. And when you take these machine learning to look at some stick them inside the galaxy framework, what I think you get is some very nice abilities to do things that are hard to do otherwise. In particular, you can do scalable and reproducible machine learning. So we take advantage of the scalability and reproducibility of the galaxy provides me combined with these machine learning tools. So here you see some heat maps were ran and created thousands of machine learning models and compared them to each other across many data sets to try to identify which models were most appropriate and best performing on different data sets. And then I'm sorry I made my rights and left. So right here a few maps, but on the far left you see examples and deep learning applications that we applied as well. And those curves are simply different architectures and different data sets that we went and mapped out predictions on. So we think that the synergy of these machine learning libraries coupled with galaxies powerful infrastructure is quite valuable. Okay, so that's what we've done. What's new. I want to talk about two things that are new with respect to machine learning galaxy. The first one comes from the University of Freiburg led by a new tomorrow, you know, and this is galaxy Jupiter lab for AI. The idea here is the galaxy can run interact tools. We'll hear more about this, hopefully as the conference goes on. The idea of these interactive tools is that you can spin up these kinds of sub web environments inside of galaxy. And of course quite powerful, but what's been done in this particular instance is they've taken this idea of an interact tool and they've loaded it up with lots of things that you want to be able to do when you do machine learning in the context of Jupiter lab. And Jupiter lab is of course this online or interactive notebook where you can type in your type code for instance executed in real time. So in making this interactive tool, they've put in core aspects of machine learning in particular deep learning approaches to make it possible to run these calculations on GPUs. Many pre installed packages once again so you don't have to spend your time installing things and getting it set up, get integration integration with different model formats, so that you can import and export models. One of the things that is pretty important, I'm going to go a little bit off script here is the idea that you can use models that are widely available in the community. So rather than building up your own models what you really want to be able to do oftentimes is take models that other people created and apply to your own data. Well this requires the ability to import these data sets of recognition of these different formats, and all of this is really nicely baked into Jupiter lab. This particular application. So a particular interest is something that we've seen already, where you start to connect these notebook environments back to galaxy and say well I'm in a notebook environment but I want to use some of the powers that galaxy office. And so here what you're seeing as an example where you can walk through and you can take your machine learning model and go back and use galaxy to train. So you sitting inside this interactive tool, but you write script, and then you go back and you can talk to a galaxy server and say okay I want you to execute and train this model. And then I want to train on my small doctor container where my notebook is running I want to use the power of Galaxy. So again we're starting to take these aspects of machine learning, take advantage of the framework the galaxy provides. If you want to learn more about galaxy Jupiter lab for AI, there are some links here and I'll make sure I tweet on my slide so you know to try to copy these links down. But there's tutorials there's a preprint on that and then finally there's a lot of you can actually try out there right now. So the second advance I want to talk about something that's called declarative deep learning. So declarative programming and it's probably familiar to many of you either because you've used it explicitly in the context of galaxy you've seen it elsewhere. It's the idea that what you want to do is not necessarily tell the computer how to do something, you just want to say what to do. The particular problem that this is trying to overcome when you think about declarative programming with respect to deep learning is that often difficult to build these models you have many many layers that you're trying to define and connect them together. And there's complex syntax involved. So this is a hard problem to solve, even in the context of galaxy where we provide nice rise, and there are you eyes for building up deep learning models and galaxy. The solution is to use a simpler approach where you take something like the animal once again or another format, and you declare what you want to do. So what you're seeing here on the left is a simple set of commands that say these are the features that I want you to encapsulate here in the outputs that I want you to encapsulate go off and build a model for me. There's a cool kid out there called Ludwig right now started out at Hoover but now it's hosted by Stanford, and it allows you to set up these YAML files and then build up your deep learning models. And so the key advantage here you can take, take full advantage of this framework is that it's minimal and it's simple code into that pretty complex models. We can talk and we can discuss about how much visibility we need to get inside these models to make sure that they're useful and make sure that they're doing what we want them to do. But as a first step they really lower the barrier in terms of taking these models and using them in your own research. So what we've done as you would expect since I mentioned Ludwig and we're here at the Galaxy conference to be taken Ludwig who has exposed its functionality inside Galaxy. In particular the Galaxy tools that we have mirrored the Ludwig concepts of encoders and combiners and decoders. This is a pretty simple concept that says if I want a bunch of input features I'm going to encode them into my network. Then I'm going to combine them in a particular way, and then finally I'm going to decode to get the outputs that I want to predict. So we can do this using custom approaches or we can get these predefined models that I talked about before. So things like ResNet for image analysis is possible to do inside Ludwig with a similar line, and we can do visualizations on top of it. So Ludwig is quite good about saying you're going to run many iterations of your model. Here let me visualize them side by side so I can evaluate the performance. This is an example application I've done so far. Ki67 is a common biomarker in cancer in particular. You care about this because often this is a sign the tumor is growing rapidly or proliferating rapidly. And so in this particular application there was a data set by a public publication, where the goal was to predict Ki67 positive cells versus Ki67 negative cells. This is a common thing so you get a slide from a tumor and you say 20% of the cells are positive. Pathologists are pretty good at this. It would be nice if we could automate this to some degree. And so we took this example data set. You can see some examples of positive and negative cells right here. And we put it into Galaxy and we use Ludwig. And I'm showing you the rough configuration here where we have the input features. In this case just a set of images. We have some output features that we want to get out of it, in particular the label positive or negative, and we ran this through Galaxy. So this wasn't a lot of code to be able to bring this up pretty quickly and to reproduce something that was out there in the community. So this is a nice demonstration of power what we get to. And we got pretty active results as you can see that mirror what they found in the publication. So, looking forward now some of the future directions that we're looking for. Again I emphasize this point of being able to use what's out there in the community and build on top of it. So we're looking to model zoos as it turns out where these zoos are places where you hold these different models that people have created so we'd like to be able to integrate more tightly with these zoos that are out there. Ludwig is a good first step. Galaxy Jupiter lab for AI is a good first step but we can do that. And so we're looking at how to do that. Additional tools for visualizing understanding performance are always needed when it comes to machine learning. These models are complex. It's hard to understand sometimes what they're doing. And of course advanced that by introducing additional visualizations and understanding performance. We of course want to be able to push the state of the art forward in terms of scientific applications as well. So I've talked mostly about infrastructure today, but we of course we want to partner with our scientific collaborators and say, let's actually do new analysis in Galaxy with these machine learning tool kits. COVID and cancer are two examples that I know that are going on into the Galaxy community right now. And finally at more of the infrastructure level there's an interest and there's been a long standing interest in using machine learning to improve how Galaxy works and orchestrates its tools. So in particular when Galaxy chooses to run an analysis job, it has to figure out what resources are likely to be needed and then assign it to the proper queue. So what we're working on in terms of the core Galaxy community and team is to do this prediction of job resource needs based on historical data and then tie that back into Galaxy. So a really nice application of Galaxy that's not scientific, but will help Galaxy function much better. Okay, so the final thing I want to mention today is that there's a machine learning workshop tomorrow morning. Tomorrow afternoon at 120 let that cave on all. So I encourage you all if you're interested in machine learning and learning about what you can do right now with respect to Galaxy. You can stop in tomorrow and learn at that workshop. I had a pleasure today of talking about the galaxy Jupiter led for AI done by Kamar in New York Greening Penn State University cave on Kamali and antenna tanker we're doing a bunch of machine learning work there. And then an organ health and science university Chengdu and the sergeants have helped out tremendously with bringing up these tools as well. So I want to thank you all for funding and fortunate funding from NIH, the variety of places. So, thank you very much. I need to repeat this question for the people online right. So, very exciting question came up with our 2000 turkey pictures and the question is what are we going to do with those 2000 turkey pictures so what are you hoping to do with your processing and normalization tools that are built into this framework to do that type of analysis before you do the machine learning. So, oftentimes machine learning as we've talked about today is this concept of predicting different labels so maybe you want to predict a turkey that's going to grow very big or it's going to be very friendly for instance, and you could label your turkeys and then you can decide on that but if you need to do pre processing up front that's fine. There are also unsupervised machine learning tools in Galaxy as well where you could cluster your turkeys for instance, based on particular attributes, whether it's color size or something. That's a great question. Any other questions. Yes. I would say the models are okay so the question is our models use the same things model registry, then I don't quite understand the context of the second question which is the features that we were asking about. Sure. Okay, so let me address the first question is I think that's a pretty simple answer I would say the model registries are the same things models. They actually sound better so maybe I'll adopt that term in future talks. The second question was about pre processing features. And here it's a little bit. It depends on the model and the approach that you want to use. So, in traditional machine learning you would also often separate your pre processing from your model building for instance so you might normalize your data and scale it and then you would run it through say a logistic regression model. So you create a galaxy pipeline that would do these different steps and hand that off to someone so they can do that deep learning models tend oftentimes to do all of this internal in a black box so you just get machine deep learning model, it will do the feature pre processing along the way as well as the classification So I would say there's absolutely ways to do the feature engineering if you want to inside this toolkit, but it depends on your particular approach when you need one tool or when you need a more traditional workflow where there are multiple steps. Okay, great question. Yeah, I think that's all the time we have for questions right now so any more questions about how to image process turkeys please take them to slack I think we'll have a very interesting discussion there. Okay, next we have Andrew. Good morning. My name is Andrew Chesky. I'm a PhD candidate in the University of Minnesota and I want to talk to you a bit about the project I've been working on bottom up against fire comparing bottom up proteomics and a modified and degradation methodology for automated on targeted protein analysis in the galaxy platform. You and I are constantly being exposed to reactive chemicals. Every day of our lives. This cohort is called the expose on it comes from a number of different sources from exogenous, or in dinos, my products to specific exposures, occupational hazards and environmental pollutants. These are of interest because of their direct toxicity as well as their meter genetic and oncogenic potentialities. Because the expose on consists of highly reactive electric files they can readily form addicts on the field of biomolecules. This collection modifications is termed the adeptome and serves as an endowment record of the expose on as well as potential source of exposure biomarkers to tailor individual patient care. Classically. Sorry. Classically, DNA adults have been highly studied because of their direct impact on DNA replication, exposed gene translation. However, they tend to be very low abundance and have a very short life due to presence of DNA repair enzymes. By contrast, protein adults tend to be very long live as they're not repaired or just recycled as the proteins are so high abundance proteins that are long live such as serum on human and hemoglobin make for ideal candidates from the study of long, the long term expose on. So, one strategy that's been successfully employed in this regard as the analysis of adults at the end term and I have hemoglobin molecules. The tourniquest lab at Stockholm University is pioneering methodology to study these ads called environment. Just an acronym up here fits the measurement of adults are by modified and then procedure fire. It doesn't work any better in Swedish. But basically, if you remember back to your elementary biochemistry, it's essentially modified and degradation where flourishing I assigned is added to hemoglobin, and it will form this cyclic molecule with the internal family which is then comes off of the rest of the protein, which can then be cleaned up by SPE and then analyze via LCMS. While this has been shown to be very successful and identifying a number of as terminus. The problem is that the fire method can have undesirable side reactions and also it's been shown to be somewhat in effectual towards larger electrophiles. At this end, we hypothesize that a simple bottom up proteomics approach, a consultant, not replaced fire. So, and these sorts of experiments human woven is isolated through devices of red blood cells with pure water, and then just digest and trips and cleaned up and run on that kind of sort of standard bottom up proteomics workflow. So the test or hypothesis we got some donor blood and expose it to a series of six electrical compounds. These are integrated blood with blood at various dosages and incubation times after which the ones divided to do is jet fire and bottom up proteomics. So we found a fire that only some of the expected internal addicts were observed these top four and green had a nice linear dose response. The two NGN was only barely detectable in a non linear fashion, and this red, but here DNCB did not see any at all. By contrast, the final proteomics, we can detect all six of these addicts at the end turbine I. Here is a MS tandem mass spectrum of this and terminal beta peptide with the DNCB adept and also with bottom up proteomics, you can differentiate off the chain and beta chain and terminize where these addicts are forming. What's more, when you use bottom up proteomics. This allows for the detection of side chain addicts because many evidence and sidelines are highly empathic. These are detected at us in our human woven samples and it's secure this beta 1693 is especially good reservoir for electric bill at us so what is this a lot of new galaxy. So this experiment we did were done with known electrophiles so in a real sample would be only agnostic we wouldn't know what we're looking for. So to do that, we've adapted software package from the rapcorp lab called atomics are which essentially has a destiny control peptide and you will for mass differences from that control peptide for potential electric files so this program is originally developed for human serum argument and reasons adapted it to take on human hemoglobin as well as rat hemoglobin, given his promises and my work in the zone. So, the way that this workflow works with the galaxy is raw MS data comes in converted to a more amenable file type addicts moments are is able to give you a list of internal addicts will then be searched against a data database of DNA protein addicts. And then statistical validation be done. So, this is currently in process and once that's done, we will be doing a long term study of the unknown addict of smokers and non smokers so with that I'd like to thank all the people that worked on this with me and if you have some questions I'll try to answer them. Thank you. What are you trying to look for with the so you look for known specific adults with constitute certain exposures or. Yes, well, yes and no. With this we're trying to do have it fully agnostic, but we don't know what products were there have been some studies where certain addicts like earlier here, I showed you this a trilomite has been shown to be a biomarker in human development of exposure to various generals and say people who work in all of our industry, and this is correlated with certain cancer so the idea is, can we sort of go back, find novel addicts that might be useful at biomarkers for different diseases. Any other questions. Thank you very much. Thank you. The last talk of this session introduction. I don't need introduction. Hello, everyone. I'm sitting on the galaxy for three weeks team at the University of Minnesota. Am I too low or just perfect. All right, thank you. So basically I'm not going to go over this long title. It's just saying that we dug up galaxy workflows that could get very specific peptides from clinical samples. So, COVID-19. I don't think I want to say anything about COVID-19 anymore everybody wants to get rid of it, and everybody here has a PhD on COVID-19 so I don't want to do it on it. We have two tests, basically, I mean more than two tests, but one of your base assets is antibody testing, antigen testing, you guys all have packets for that. But there's also mass spectrometry based testing for COVID-19. So, last year we just, I mean, as Bjorn said, okay, why don't you go ahead and do some COVID-19 triomics. We were like, okay, let's start with COVID-19 triomics and we decided to publish papers on COVID-19. The first was to basically take COVID-19 peptides and clinical samples and also create a peptide panel that could be easy for targeted COVID-19 detection. That was one of the papers and the other paper was to create meta-prediomics, to create workflows that could detect meta-prediomics COVID-19 analysis and to detect any potential co-infecting microorganisms in your COVID-19 samples. So, that means that there's any pathogen in COVID-19 patients or positive patients that could be fatal for them. So, that's what we were doing with our workflows. While this was going on, there were waves and waves of COVID-19 coming up with different variants and we thought we should like end up with two papers, but that was not it. We just decided, okay, if there are variants coming up, why not like develop workflows that can detect these different variants in COVID samples, clinical samples. But what are these variants? I mean, we are not trying to detect everything like all the panoramic ages, but we were just concerned about main causes of the waves, which is the alpha, beta, gamma, delta, omicron. So, we just thought like, can we detect like peptides from all these different waves? So, we had like a theory in mind, let's select databases from different timelines from 2020 to 2022 and across the globe, different demographics, different parts of the clinical sample, nasopharyngeal, oral samples, ball samples, urine samples can be detect COVID and then specifically the different variants, WL shows specified variants, and have a peptide target list or a peptide panel, a list of peptides that could be used for targeted essays later on. So, we ended up developing two workflows, discovery workflow as, sorry, discovery workflow and verification workflow, wherein first was to identify any COVID peptides present in the samples that we had. And the second was to verify whether the peptides that we detected were actually present as well as then allocate them into the various, the variants. So, let's get to the first workflow. The first workflow basically, as I mentioned, is identification. So, in our previously published papers, we just use search-free peptide shaker as our search engine. But in this workflow, we just added more because you would just be, you know, let's identify more. So, the basic input is the MSMS data, which is the mass spectrometry files, and then we used COVID-19 sequence database, contaminant database and protein sequence database, human protein sequence database as a combined database. The COVID-19 sequence database consists of not just the Wuhan type, which has all the proteins, but we also have COVID-19 structure proteins in the COVID-19 sequence database, everything we've got to say. And the contaminant database is usually for human errors and seeing if we have any human contaminants or, you know, experimental contaminants. And then because these were human samples, we have human protein database. So, we used search-free peptide shaker as well as MaxCon for searching, the output that we got where we had tons and tons of fabric output from these search engines, but we just took the peptide output from the search engines. And then, from all the modifications, we extracted only COVID-19-confident peptides. We filtered out the human peptides, the contaminants, as well as anything that matches with other known proteins, but we just extracted COVID-19-confident peptides. There's another output from this is MC2Sqlite output, which we were later used to look at the spectra from all these peptides and annotate them manually to know whether the spectra that we got from these peptides are of good quality or not. This is how the loyalty for one of the sample peptides looks like. So, if you can see if there are continuous blue and continuous red lines, then you say, okay, there's a good spectra, and all the peaks here should be three-fold more than the gray ones right there. So, that's what we're looking at the spectra, but that goes in the next workflow. So, when we ran the discovery workflow for all of our 12 datasets, we got about 103 peptides that were unique to COVID-19. And what we did was we added these 103 peptides to previously published 803 peptides that we had from our paper as well as other published papers, and then created a peptide panel of 906 peptides. Now, we used Pepquery tool, which is a validation tool, which tells you whether these peptides are present in your sample or not. So, we re-analyzed our 12 peptides and using these 906 peptides and got I think 114 peptides that validated. And these were, we made sure that these did not have any human as well as contaminant proteins in them. Once we had that, we looked at, and this is how Pepquery results look like. I mean, it's not very clear, but it has a peptide column and then it tells you which file it came from, plus it gives you a V-value associated with it and a confidence associated with it. So, that's how we filter out our confidence peptides. Once we did that, we also looked at the spectra like I showed you before, we had to manually go through all the hundred and 15 peptides and we got approximately 82 peptides, which were of good quality and we were confident that these are COVID peptides. Now, obviously, our next question was, which of these COVID peptides are really peptides? Yeah, we did BLAST-P. When I did first time BLAST-P on this galaxy, I just realized that the galaxy P, BLAST-NR was out of date, it was 2018, so it did not contain any COVID-19. So, I had to go outside of the galaxy, do the NR as well as BLAST-P searches and align them to the Wuhan time, so that I can find if there is any variance from the wild side, as well as I did BLAST-P NR to just to make sure that they did not come from any other virus. So, it was mostly to validate that these are indeed COVID-19 and variant peptides. So, we looked at multiple things. Once that was done and we were confident like, okay, these are maybe variant peptides because they didn't have 100% coverage or they didn't have 100% match to the wild side. We took these peptides and annotated them by looking at the peptide report that we got from our previous searches, as well as we have an operator whom we asked to write in our script who could take our peptides compared to the pangolinages and they gave us whatever pangolinages that peptides match to. So, there were some peptides that had like multiple pangolinages associated with it, but there are some peptides that are specific to a certain pangolinage. And then we looked at the pangolinage in the code tool lineage website, the org website and they had those mutations present in them from these WHO based variants. So, if you see here, you can see these are few of the peptides that we detected. The wild type is on the top and then the variant specific peptides on the bottom and the ones in red are showing the mutations from the compared to the wild type. And as you can see, we have like gamma, delta, amicron strains here, so from the different waves from different samples. So that kind of proved like, okay, our workflows did detect these variant peptides. So, few things that we did manually also we had to align these peptides and to the wild type strain and also see like given coverage, different coverage. These are the alpha, beta, gamma, delta, all these different waves that came and the color are coded like that. These all those strains that passed our confidence and probably there were being a lot of stringent. Most of them were nuclear capsicum peptide, we did not get a lot of spike proteins, because of the, maybe the digestion enzymes we used. And because they were clinical samples, we were not getting spike proteins, but we were getting nuclear capsule proteins and which had good mutations in the newer waves compared to the paid out because it was more concerned with the wild type. So those were a few of our results we got. And then here are the six peptides that we were confident about that came from different strains, delta, gamma, amicron. We do have alpha, but we didn't find any data in our samples, but probably because we didn't have many samples it was just 12 datasets, and these were like published datasets. So, if we have more we could do more research we're using our two workflows. So, with that I would like to conclude and say that our workflows are flexible enough to detect emerging strains from clinical samples, but hopefully we don't have to run it again and I hope this is the last thing that comes up. The second is that the peptide panels could be used to develop vaccines. The variant peptides could be, can be actually used for targeted assays, because now that we have these variant peptides, we could do PRMSRM assays to detect them. And the last thing is our workflows can be manipulated to detect other pathogenic organisms so it's not only confined to detect COVID-19, you can just change the database and the parameters and use it for other organisms too. So, I would like to thank all the collaborators and Galaxy PTU to have all of this project and if any questions I would like to answer. Questions? I wonder if wastewater is a possible source? So, the question is if base water data could be used for running these workflows for detect COVID-19 samples. Yeah, I mean we could, we haven't tried it because we haven't gotten the data yet, but yeah that could be used. The only issue is we will have a lot of data and a lot of microorganisms, other microorganisms because it's wastewater, so our database has to be, you know, it will be a large database to search for this, but yeah. So it seems like we can use this as an alternative to validate some of the variants which come from the sequence. For example, there is a question of if some of the in-dose observance has come to me, or did you do any comparison between what you get from your data and what's in the sequence base? So, the question is whether I compare it to the sequence that's already published. In congruent. So, like, I actually tried and compared it with whatever was already published, so people, like COPE2 linkages has like amino acid mutations already mentioned, so I have just done that, but I haven't compared it with the other variants. There are data somewhere. Yeah. Do you have the understanding for someone who's not a proteomics expert? I think COVID has something like 30 proteins, but you're finding that 1,000 peptides. Are those peptides being produced? 1,000 peptides are from 30 proteins during the degradation. Yeah. So sorry, the question was whether you're finding all these peptides, there are only 30 proteins in COVID-19 and we're finding thousands of peptides, and is it during degradation? Yeah, the peptides cut at different sites. It depends on the digestion enzymes I'm using, but using tryps and maybe so cuts the protein into different parts. And yeah, we have a lot more peptides. Any last questions before the break? Okay. Thank you very much.