 Hello everyone, this is Injunosh, a PhD student at the private university in the Galaxy team, and today I'll be guiding you through one of the metagenomics training where we are going to detect and track some of the passages in the example of this training in food samples. And throughout the training you will learn about the workflow created within Galaxy, which is openly available. You can use it anytime, you can adapt it. The only thing that you will deal with here is a Nanopore data sets, but you'll also learn that you can adapt the workflow to any other type of sequencing technique you have. So by reaching this point in your training in general within Galaxy. We have already learned about the quality controlling the mapping, and a lot of other stuff about the introduction to Galaxy platform how you can use the tools how you can create history, but we will do them together today in order to revise a little bit. So, shall we start. Let's get into our training so as you learned in order to reach to our training material, you can just simply click here on this hat. And you can you're opened into the Galaxy training network, and our training today is in the meta genomics part. So you scroll down to the meta genomics, and it's the last one pathogen detection for direct Nanopore sequencing data using Galaxy football edition. So, today we are dealing with the samples from food. The idea behind that is that we have the, there are a lot of food contamination causing a lot of hospitalization, for example in Germany, around 137 cases in 2019 and globally 600 million people a year. And mostly from Salmonella and other type of pathogens, and in order to do to deal with it quickly, you have to track it, and you have to detect it quickly. And a lot of techniques were used before like isolation and targeting specific genes and real time sampling and a lot of other things. And by that you're really wasting a lot of time and sometimes you target a little small part portions that may not be a good indication to how much pathogens you can found. So the more you go by time the more technologies arises and the more techniques were used until the for example the home genome sequencing was used, which gives you an overview of the whole microbiome, or the full cells in your in your sample, however it requires isolation, which also requires a lot of time and it's not totally guaranteed to be succeed. So the other thing is to do a direct sequence of old DNA present in the sample error, which is shotgun metagenomic sequencing. And this actually gives you an overview for all the cells, and at the same time it does not require isolation to your samples before the sequencing. So for the sequencing techniques you can definitely use Illumina sequencing but for this part of the, the training or the way or the workflow that you're going to use today. It's designed or some of the tools that we used are designed for Nanopore data sets and choosing Nanopore it's an easier and more practical actually so it's faster so if for example you're in a factory and you want to check quickly if there is a pathogen or not and where this person is found and so on, you have to be really quick, and you have to, to have a small, for example a small device that you can use and directly get your sequence data, put it for in the workflow and detect directly where, and the, the person is there and if there is a person or not, how many pathogens you are found, and so on. So, for this reason we have we have used and chosen Nanopore technology or Oxford Nanopore technology to use in or in our sequencing. The data sets that we are going to use today is from the company called Biletics, where they spiked chicken with some known pathogen which is mainly salmonella or two different step trains of salmonella, then they do DNA extraction and they did Nanopore sequencing directly. And we got these sequencing files, the passcode files from them, and for our training today we will be using two of them. And two of them are from salmonella, but for each one of them is for a different strain. The main goal of the, the full workflow is to, is to detect blindly so agnostic what are the pathogens are there and, and where they are found, maybe the time slot where you found the the pathogen in which sampling time for example, and so on. But here, with this data we know in prior that we already have salmonella, so that we can test our workflow and you see at the end, which exactly the strain and which exactly the, the pathogens that we found through our data and a lot of other analysis that you'll have. Okay, so let's start with preparing our galaxy and the history that we have so creating a history and importing the, the test data sets that you're going to start with. So as a first step, what you're doing that you click on the plus sign here to create a new history, you name it, whatever you want. So for example, pathogen detection, and you click save. The next step is to import your data sets so you will see here, two links for the data sets. All you have to do is to press here on copy, and you go back into the left far corner in galaxy to the upload data panel. You click fetch, paste and fetch data, and then you click paste, then click start. An important step to do here in order to, it's actually one of the benefits in galaxy is to tag your data sets in order to, to see them along your, your analysis and you'll be able to detect them that to detect them directly. So you're, you're all of the analysis and all of the tools that tag keep moving upwards, as long as you go. And in order to make it or promote it, all you have to do is to click hashtag, and call it whatever you want so for example barcode 10 spike to. So the barcode 10 is the name or the, and an ID to to mark a specific species with a specific strain and spike to it's the spike number two. So, the data sets we got from biotics were spiked at different time spots spike one was in a specific date spike to another specific date spike to be three, four and five and six. The spike is a specific time point or specific or different date, so that at the end, for example, you will see a big visualization for all the barcode so all of the different passers-ons trains or species, each one of them you know at which spike so you know at which sampling exactly in a real time example you'll take for example, a samples at different hours or different days, and hundreds of samples or, or 10 for example 10 different samples and you'll see a big visualization. Each sampling time whether or not we find pathogens, what are these pathogens are they related, and so on so then you'll be able to stop it directly or, or be able to detect it and know where why exactly is coming from. So, after pressing hashtag and then barcode 10 spike to you press enter. It's the same for the other data sets it's barcode 11 spike to be. And you also press enter. An important thing that you have to learn on for this specific training is that the training was designed into two ways short and long version. The short version is where what we are going to be doing together today where you will run the created workflows. So, as a full picture there's a big workflow which is divided into five smaller workflows, each workflow for a specific step. So what you're going to be doing today with a short version is that you'll be running a workflow as a whole to do a specific task for you. So, here after the uploading the data sets you'll be asked to choose between the training or the tutorial either short version or a long version so make sure that you click one of them. So for today I will be clicking the short version, but for the long version, it's designed in order to run tool by tool. So, a step by step tool by tool as if you are creating these workflows by hand. So by today what we're going to do with the short version that we're going to run the complete workflow as a whole, and at the same time you will be learning about the tools used in these workflows, and why exactly which shows them the difference between them and other tools that we did not use and so on. So, for today, make sure that you clicked on the short version. So, after uploading your data sets, the first step that you learned before, maybe from previous training from the Galaxy training network that the first step that we need to do is pre processing, where we quality control our reads, we trim and maybe do some other filtration in order to quality retain the reads and and make them ready for the analysis. In our case, what we have is Nanopore data sets. So the first step is to do a quality controlling we will be using some of the known tools like the fast QC and multi QC that you have been used before and edit in any other sequencing type or techniques. For example, be using also nanoplot this tool is specified for the Nanopore data sets. So it's, it's new for you as well today. So that's what we are going to be using so we'll be using both of them multi QC and Nanoplot. And then after checking the quality will be doing some trimming and filtration using poor shop and fast P also designed for the Nanopore data sets. So finally, what we want to do is that we want to remove all the host sequencing sequences. So, for example, we have a food sequence, or samples from food. So, what you are interesting in now is the pathogens whether or not there are pathogens. So we are not at all interesting in the interested in the sequences of the host itself. So if, for example, we have sample from cow or chicken, or milk, you don't, you are not interested in the milk or chicken or cow sequences all you have, you are interested in are the other sequences or other weeds that are unknown for you. Then you will take them, and then search for pathogens if there are any, or so you only have to do is to remove the host and proceed with the other sequences that you still have in the sample in order to analyze and test if they are pathogens or not. So part of this pre processing that we'll be running now is that filtration step where we used as a tool called crack into its designed for taxonomy profiling so it has it arranged the sequencing the sequences that you have on the tree of the tree of life. So it shows you the from the kingdom level down to the species level for example and how many weeds is this is set to this kind of kingdom or this kind of this type of species and so on. And in this tool you can choose one of the standard databases that are in like standard and standard plus PF and so on, and you can also add some other databases through the galaxy platform. So what we did is that we added another database called Cali Mary which is here you can click on on it's here to learn about it and as long as the training goes you can click afterwards on any link to read more about any tool, or any, any database for example that we use the Cali Mary database is one of the databases that are used recently in removing hosts, and in other studies is, which was proven that it's going doing better in identifying or. Yeah, so identifying the hosts, and then after you identify them in the step you start to remove them, or and you we can do this by some tools that we have in galaxy that do table manipulations like filter tabula and filter sequence by ID. And then after this step, we are now ready with all our sequences that are not hosts, and then take the sequences sequences and move forward, doing the other steps of the training. So, what you have to do now in the short version is that you download the workflow from here, like by only by clicking this workflow here. Yep. Okay, so it will start downloading. And then what you have to do is that you go into galaxy again and we you imported through the through the URL or uploaded from the website so what you have to do. You press on workflow and you press import. You can either choose a URL or you browse where you downloaded the workflows, and you can upload it directly from where you have the workflow itself. After you imported, you'll find it here in the list of imports, and all what you have to do is to play it. So I will go for the workflow that I have here. You will find its name nanopore datasets pre processing. So which is the first one, and all you have to do is to press on play which is one workflow in here you have two samples. So you have to choose the two samples from here. So multiple datasets, you will be clicking on, and then you choose one of them and by clicking import and control you choose the other one. And finally in the hosts to remove filter. That's actually an expression that you can write in order to remove so what are the species that you are looking for or part of the species name that you are looking for. So we are here as a default we have we have set galos which is chicken homo homo sapiens human beings and boss which are cows. If you are testing your samples, and you know that the samples are from milk so you have another bar like this or you have another bar like that and you have you add the other species name or part of the species now name for example, if you're sure that there are no homo for example you can just simply remove it, and so on. So here what I'll be doing is that I will check for galos homo and boss and what I to remove so after checking for these that the workflow so in the last step by tabulate tabula manipulation. And removes all these hosts that are specified in this step. So, what I will be doing here is I run the workflow. So one of the galaxy benefits actually is that you can run a collection. So instead of running sample by sample, you can also add all the samples that you have in a collection, and you can run the workflows through these collections that you can add. So but currently, though, the training material or the workflows associated with the training material are set to single file by file, and soon as we will be adding soon the collection, maybe in another training material or another version so short long and collection version, for example, if you want to try it as well. So it takes instead of file by file or multiple files it takes a collection as a whole. And then it does the same kind of analysis. So it's also one of the nicest things you can do in galaxy to select multiple files, for example, like that. And you say that I want to create build data set list, and then maybe call it samples. And then you press create, and all you'll have to find is collection with all the data sets inside. Now the workflow has started running. And as you can see, the text that we wrote in the beginning, moved upwards along all the analysis that this workflow does. So if I scroll all the way down, you'll see the input to samples that we have the collection that we had just done now just for playing around and show you how to create a collection. And now here are the results of the running the workflow, the workflow has run twice one on on each one on each sample, once on each sample, and the output results will be viewed on this very right panel for the results. When it's orange, that means it's still running so the results are not ready yet. When it's gray that means that time has not come to start running the tool so it's still in the waiting waiting list for the for the other maybe it depends on previous tools so the previous tool is still running. So this has to wait for, for a while to start actually the running. So this step will take some time. So for the sake of time I will go on and show you the results directly in a previous history that I created before for the pre processing, so that we can move on and check the results together. Okay. So, so I had here the same things, the same to data sets barcode 10 spike to and barcode 11 spike to be in here as you can see, all the tools has finished the running so it's color now is green. So that means it has finished all the, all the running parts, all the running tools. So, going back to the training. Now we are in the pre processing. We has already, we have already run the workflow, we set the specify into gales home home and boss. And you know in prior, we are spiking chickens or supposedly all the hosts that you should be finding now our chicken that you want to remove. And after running this workflow you should be remaining was remained with all the sequences that are not chicken. Yeah. So, for the quality control part that you have in the pre processing, as we said before the tools that we used are the process you see in the nano plot. So for trimming and filtering we used poor shop and fast P. This is only to, for example, trimming. I'm supposed also that you did the despite before, but I would be saying quickly in the, in the trimming you may be removing the adapters part that you have in your sequences and these parts, for example have very low threat score in the quality so you have to remove them, for example, you remove very short trees, for example, under or below a specific stressful that you know based on your protocol in doing the sequences sequencing. And yeah, so you remove wrong basis or them and most importantly the start and the end of the weeds themselves with a very bad low quality. In the end you are remained with all the, all the data sets that are the weeds that are good enough for the analysis. If you want to learn more also the link for the tutorial quality control is there, where you do all the quality controlling and the trimming and everything so you can learn more about that. So a tool called multi QC is a tool that can group all the quality controlling tools together. I'll put together and gives you very nice HTML reports or tabular reports that you can see before and after what happened to your reads, what does the quality before and what's quality after and so on. So let's check together and answer some of the questions. So the question here is that asks us to inspect the quality control the multi QC output tool for barcode 10. And it asks us how many sequences does barcode 10 had before and after trimming so how many reads, for example are removed after trimming. So you can only check for the multi QC but you can also check the fast P output because it also will will have a very nice report about that. So what I will be doing here is that for example I will go to them. HTML that's for barcode 11 but we wanted for barcode 10 I remembered. Yeah. Here for example there is a nice HTML that you can see how many reads before filtering. So 14,000 at 114,000 and how many are remained after which is 91,000. You can also check the multi QC output also and the HTML or before and after. So what we have here this that's this bar here is for the sequel this figure here is for the sequence count, or number of reads so here is before the trimming. And here is after the trimming, for example, and you know, that was before it's almost also 114 and here it's 91. And the report from fast P has shown here, for example, it was the beginning and the end was in the red zone. And then after the trimming it's all in the yellow zone still it's not a poor data set so that's why it will be it's average Lee seen in the yellow area or the yellow zone of the quality controlling score scheme, but still it's good enough to continue. Not a poor data as you know they are. They are improving in terms of the quality scoring. So now actually the to you can trust it by more than actually 99%, which is better than before which was really 80 and 70% now it's really getting better like other techniques like Illumina sequencing and so on. And you can keep on reading a lot of figures and know more about your set weeds and sample but for me now is what to answer the questions here. You know now the solution is that before you have 1000 114 K sequencing and after it's 91 K. And also here, what is the quality score over the weeds before and after trimming. So you have a, it was below 20 before now it's above 20, which is better for us now to continue. And to learn generally more about the quality control and fat score and why they are important to do before they doing the analysis. I recommend you also to go again through the quality controlling training that we had before that you may have had before, or if not you can check them throughout the this link through within the galaxy training network as well. So for the whole street filtering that we also did in this workflow, we're using crack and two, which is the tool that assign all the weeds, every week to specific taxa using using the database that we added in in Galaxy which is Cali Mary, the one that we talked about before. The things that you know about in prior that you should be seeing a lot of chicken because the host is already from chicken, and the, they are both of the samples are spiked with Salmela. So, Salmela should be one of the strains or the species I'm sorry species that you can find. So let's see and check the crack and two outputs for barcode 10 to answer these following question what is the species of the host, and how many sequences of the host was found. So let's see the crack and two outputs. And as you can see here from the number of lines. So the number of lines here actually are all h one of the line is one of the weeks, so it can answer the question by only reading this so it's 836 weeds were found as galos or was found as chicken. In general, so if you keep kept scrolling scrolling down, all you will find are all galos which is the chicken, and even that we have already in the filtering homo sapiens and boss, since there were no human or cow in the host so there were no human cow found so it's our old chicken. So to answer the question back again here about what is the species, it's the answer is galos, which is chicken and how many sequencing is the number of lines, which is 836 read. Coming up. So this part if you ready the, the output was not ready for you, and you want to go directly into the next step you can go on and copy the output of the pre processing workflow directly and fetch it in galaxy as we did before with the input data sets, and you can proceed with these data sets to the next step which is the taxonomy profiling. So, as we said the output, what is the output of the pre processing, the output of the pre processing is the reads that are without the host that and also we are the reads are quality retained by doing the quality controlling and by doing the trimming and using fast P and poor shop, and by removing the house as we said, so we what we all we need to proceed with are the nanopore process sequence treats for barcode 11 and same for barcode 10. The coming step is to do taxonomy profiling. The taxonomy profiling is similar to what we did in the pre processing what we did in the pre processing to remove the host, we use crack and tool tool in order to assign every read to a specific taxa. So, and then we, by knowing these taxes and by assigning the reads, we say that we don't want the, for example, chicken or homo sapiens or or boss. Now we what we have now are the filtered or our pre processed streets and we want actually to dig deeper and learn more about what other reads are. So maybe we know in prior, which bacteria that we have in our, in our samples that might be or might not be a pathogen. So what we'll do in the pre processing is that we can, we will run crack into tool again, and you can learn more about the crack into what approach it's used in order to do the assignation, and what is generally a taxonomy and what are what might be a different taxes and what are the different levels or eight level for taxonomy profiling or the hierarchy. So you can learn more by pressing on the details to read more about the tool and what's taxonomy profiling. So, the database that we used here is one of the standard databases in the tool which is standard plus p, the standard plus pf database is one of the databases standard in crack and to that you can directly search within the tool arguments. It's similar like Calimary within a lot of sequences or no need taxes that you can assign the weeds to both of the database are, are chosen in the, or are used in our workflow or the bigger workflow Calimary was used in the pre processing and now we are using the standard plus pf. So, actually, we choose this database after we tried a lot of other databases, and the only difference from Calimary and standard plus pf is that we found that standard plus pf were able to find more or was able to assign more reads to the specific taxes. So we got more assigned and more known rates to the taxonomy levels, which we found much better and for Calimary Calimary is better actually into significantly say if whether or not it's homo sapiens, whether or not it's chicken so it's more specific in terms of host filtering. So in all, all over the output of your workflow in the full workflow you'll have the output of Calimary and plus pf was crack and to so you can go on and see the differences yourself. So for here we're going to be choosing a standard plus pf from the standard from the crack and to run on the taxonomy profiling part. So, again, what you're going to do is to download to download this workflow, this is specialized or written for the taxonomy profiling to do this part. Again, we click on the blue here part and you choose any place that you want to download in, and then you press save. In Galaxy, you go into workflow, and then you import from your locally local drive and you choose your workflow you're just downloaded, or me it's downloaded before so I will go directly and run it. So, again, after uploading your or importing your workflow you will find it here along along with all the list of workflows that you created or you'll have uploaded it before or imported it before. So, so I did that I pressed play run the workflow and now I will give it again, both of the filtered or pre processed reads from both of the barcodes and its name is Nanopore process sequence reads. That are you're going to be using throughout the whole other analysis. So, a small chart that will help you understand better what about the workflows that we have been running today is this one. So, what we did now before was running a workflow which is called pre processing, where we had the outputs, which is called Nanopore process sequence reads for both barcodes barcode 10 and barcode 11. And if you have 10 samples you'll be having Nanopore process sequence reads, once for every sample or once for every file. So the next version of the proof of this training when you have a collection, you then have a collection of the Nanopore process sequence reads, and each one will be named with the name of the sample name. After having this filtered and trimmed and everything and ready to be processed reads you have you will run three different workflows actually in parallel. And that's why actually it's one of the very nice things that we did in this creation of the workflow that you don't have to wait for one of the workflows to finish in order to one the other, all you have to wait is in the beginning for the pre processing to finish and then you run all the three in parallel or any, any one of the parts alone based on your application afterwards. So what we're running to now is the taxonomy profiling, where we will learn more and more about the reads that we have, and maybe know what are the oldest species that we have ready we can have an overview from running it before in the pre processing, but now we get more and more classified reads by using standard plus PF. But it's actually a step into identifying the bacteria, or the species of might or not might not be a pathogen. So, let's go again to galaxy and complete running the taxonomy profiling workflow. So we will choose this one first for market 11, and by pressing on control, you choose data set number 26 again which is the same thing exactly here we are using a tool to visualize the taxonomy instead of looking at it from from the tabular form by crack and to we use some of the interactive tools in galaxy that tool is called fins, and this tool actually is really nice and visualizing by different figures and graphs and you can export them all you can by sharing your history you can share all these analysis and all these nice figures, which you can play also around all the time and be interactive with and learn more about your weeds visually through using this tool and using your history and your output and so on. So this tool specifically uses can take also as metadata of your samples so if you have metadata like the time of sequencing. Maybe the host that it was, if you know, roughly idea about the host, anything that you know about your data sets in prior you can have it as a tabular file and a CSV file, and then you can upload it here as well in running this workflow. And then it will be part of your report in the finish visualization tool. So I will not add here it's an option, so I will not add it I will only choose the process of Nanopore process sequence reach for barcode 10 and barcode 11 and proceed by running the workflow. One of the things that finish has is that it can be visualized using Chrome. So if you are using Firefox now or something and you want to view the finish output, it's preferably that you that you say throughout Chrome. And another tools that you can use in order to do the taxonomy profiling is Corona pie pie chart. It's always good to see and visualize it also using Corona. However, because we in our samples we have a very huge number of assigned reads and a lot of assignations so the data set is actually very new to Corona pie chart to be drawn in a in a in the figure chart. So that's why also we have used finish visualization tool. Also another tool that can replace finish is pavian that you can also read about it here and know more about it and you can also replace finish was that the so they both do the same or they can both work the same exactly. Since visualization tool takes a biofire. So, in order, or part of this workflow is that we take the output of Kraken tool tool, which is mainly tabulas, and then we change it into bio files in order to be read the finish visualization tool, and to do so we have installed another tool in galaxy called Kraken biome, and this Kraken biome convert the output as well as taking the, the metadata file into out into also finish visualization tool. Yeah, so what you did now is running the the workflow, you can wait for it, but for me now I will go on directly to the output in order we can to see the finish visualization tool but as I told you I will switch quickly to Chrome and come back I moved on or switch to Chrome, in order to visualize finish output. And I've also opened history where the output from type of taxonomy profiling is. So after you wait for your history to to finish you will find the same things. As you can see here the finish tool is in the yellow state and it will always be in the yellow state as long as it's still running. So it's an interactive tool and I will show you how you will open it through within galaxy. And it will be always in the yellow state as long as it's running and when it's when the time is up or something there is a time frame in galaxy when the active interactive tools will stop in galaxy. It will return red or green based on the state and if you want to rerun it again to see and use and look at the figures again all you have to do is this rerun button that will run exactly the same tool again. So, let's go back to the questions part to see what other questions that we should answer first before we look at the finish output. So one of the questions that you have in Kraken tool is to check the in the taxonomy profiling part is to check the output of the Kraken to report for the bark 10. And the questions are what is the most commonly found species, and the second most commonly found species. So, and was how many sequences for example, that you can find. So, let's go to the Kraken to output report for work with 10. So, as you can see here, the mostly found domain actually is bacteria. And if you went down into the S which is species, the mostly found species will be assertia collide with the total number of sequences 10,000 243. And if you went down and down and down again until you meet the second species or not the strain but the species, you'll find that there is the group salmonella and then the species salmonella in Turkey, and the number of sequences are seven seven eight that are found in your reads with the standard database standard plus pf database that are, or can be assigned to the salmonella species. And that's for the bark with 10. Yeah, so now you are good with the answer so let's to see the solution see whether we were correct or not. And the firstly most found school I was these number of sequences which is the same as we found same for salmonella the mostly secondly most found species within our reads for sequences of the barcode 10 sample. Now let's answer the third question this is how many sequences are classified and how many are non classified. This actually you can answer it from here. So when you press on the tool itself for example here, the report part, and you scroll, you can see a lot of percentages or an overview or a summary of the report itself. So as you can see here, it's for 40,000 143 sequences were classified, almost 44% of the weeds in the sample are classified and 50,455 sequences were unclassified, which is 55% of the weeds. So let's see the answer. And yes, it's exactly as we found in our solution. The first one is actually to compare between the crack and two tools both runs that you did in the pre processing using the Cali Mary database. And this run in the taxonomy profiling where you use the standard plus pf. So to do so you can go to your history where you run the pre processing to check the crack into report for barcode 10 using Cali Mary the one that you did in the pre processing and answer the same questions. So here you it's written that the also the firstly, it's detected with the Cali Mary database as well the firstly most found species is collide, but with these number of sequences. So in this case more sequences were classified by Cali Mary to collide then the standard plus pf, and same for the same Lella it's the secondly most found species in the sample barcode 10 with these number of sequences. So higher than the number of sequences found by standard plus pf. Now to the second part which is a classified and they are classified. You'll find that the number of classified sequences using Cali Mary database, or 32,000 which is much, much less than the classified which is 40,000, but much less but still, much less, which are found by standard plus pf. And actually that's the reason why we decided to proceed in the taxonomy profiling with standard plus pf because it was able to assign more sequences. And in this part specifically all what we're trying to do is to learn what are the species level that we have until the species level, what are the bacteria for example that might be there. So you can choose whatever built in database that are already existing in crack and to, or you can move on and upload your filtering or the, your database that can classify into Texas, like we did with Cali Mary. So feel free to update this part was whatever database you find most suitable to your samples. Now to the most interesting part the finish visualization tool. And let's see how can we open them within the galaxy. So what you have to do is that you go to user, and then at the bottom very far bottom there is an active interactive tools, you press on it. And there you will find all the interactive tools that are running or stopped already was running but stopped in galaxy within the time where you're running your tools in galaxy or you having your history. So here I have a twice and that's why that's because I have a twice here, once for every sample once for work at 10 and once for market 11, I move on to click one of them. So it opens another page which is for finish. And as you can see here on the, on the very far left, there is all the tabula columns that if you uploaded a metadata tabula CSV file and you had like I d, and a time location, or the metadata for the sample and upload the metadata file, along with your does the sample itself it will appear here, and you can choose or are not choose to view them along with the figures afterwards. So if for example you have time and location and I clicked on them, and you proceed to the gallery, you will find them along with the figures. So you can export them anytime you can share them, and all the metadata that you have with the sample will be viewed with the visualization as well. So let's proceed to gallery to see the figures that you can view. So all of these visualization techniques you can use and you can share and you can play around with to share with with your professors or, or share with your research or with your work in general. And you can also it's an interactive you can go on and check from the, for example, in the taxonomy bar charts you can check from the kingdom to the species level. You can present with the value or the percentages, you can check by name or by ID. So you can play around with a lot of things and also there was here a top sequences part where you can also view what you what we have already read in the report what is the most commonly species and was how many number of species same exactly as we read in the report. This is the secondly most and how many species and so on. Same, if we did the same for the barcode 11 you. We can press here. And then we can proceed to gallery. So we have our chart, and we go up down to this patient level, and we present the top, top sequences here that most top sequence is the bacteria Salmela, secondly, and so on, with these number of sequences that are signed to this taxa. So this is how the pinch tool is really cool to to view, and it's actually a very good overview about the might be or might not be a pathogen. So for the next steps, you'll see that we'll start identifying really what are the pathogens, whether or not the bacteria that we have our passengers are not. But until now it's good enough to to see and visualize what are the bacteria and what might be, or might not be a pathogen. So back to the fire plots or if you want to continue a chrome in chrome it's up to you, and I will be back again. So back from the taxonomy profiling where we have seen all these nice figures representing all the different taxa in our samples from the kingdom down to the species level. We have learned how to to check the number of sequences and the different types of database and back into now to the core of our training today where we will identify all the pathogens that we may found agnostic lead. So in the taxonomy profiling we have seen that we have a lot of bacteria in the species level. Now let's identify whether these bacteria has pathogens or our pathogens or not. So back to the nice figure to see where we are. So now we have we have, we are done with a pre processing and, as we said that you'll be running three different workflows in parallel we have finished the taxonomy profiling now, we will be running the gene based pathogenic identification. In this workflow we will be identifying pathogens and we will do other types of analysis I will be telling you now, and the output on this workflow will be all taken to draw all or visualize the pathogens of all the samples together. So the gene based pathogenic identification will give you whether or not there is a passion, or, and where exactly this person is found in the sequence for every sample separately, or so we run it separately on every sample, and to, in order to visualize it for all samples and compare samples along all together and know at which time point, for example, the pathogen took place in which sample the pathogen started to appear and so on. From all the samples that you can take in real life, we have to run this final workflow that we run at the very end of the training today, which will compare or visualize all the samples pathogens found together. So, or the one before the last is a different analysis that we're going to talk about later which is the SMP based pathogenic identification. So now in the gene based pathogenic identification let's see how will we identify the pathogens. To do so, we decided to identify them by detecting whether or not we have violence factor. So violence factors are gene products usually proteins, and they are always involved in the pathogenicity. So by calling a passage and or by saying that in this position in the sequence we have a passage and then we are, or we have a violence factor then we say that we have a passage and so this is the way where we decided to identify pathogens in this workflow. So the analysis that we do here is not only identifying the pathogens but also some other analysis like identifying the antimicrobial resistance genes that are found in our samples. So by identifying the serial typing or sequence typing schema by searching the database multi look a sequence typing database for the schema. We also do or in order to do all of that in the beginning we do the assembly from the filtered or processed weeds that we have in the beginning so the output from the pre processing as we said are the feeds or filtered reads from the host which are the chicken and milk and meat. Now we have this fast queue files, what we have of reads or very short treats now what we are going to do is to build context. So context are the weeds concatenated together, forming part of the sequence or what might be a sequence. So as you learned previously maybe in a in a in an earlier training. For example, so to map what we do here that we have a reference genome, and then we map the reads that we have to the reference genome so at the end from the reads what we that we have, they are now arranged in a such a way that we say that this is the genome sequence of our sample, but here we are working agnostic Lee, so we don't want to use any reference genome so we don't know which pathogens that we might have, or we know nothing but we work agnostic Lee to discover everything that we might find in the sample. So what we are doing now is instead of the reference genome is that we do assembly or context building where we concatenate the reads together, and build up, or using a lot of algorithm actually that are ready there to build up and form the bigger parts of the reads with where we call context. So there are a lot of tools and the tool that we're going to use here is metafly. And then to polish the assembly or to improve the quality of the assembly we, we decided to use medical consensus pipeline. And for that we can visualize the output of the assembly using one of the tools in galaxy called bandage image. We have MLST or searching the multi Lucas sequence typing database, there is a tool in galaxy doing so called MLST tool. And finally to our before, before the last to identify the antimicrobial resistance genes we use a tool called abrogate. This tool has a lot of databases that you can use and do analysis with. And one of them is the AMR database that we used in order to identify the AMR genes. And finally, and the main core of our training today is to identify the violence factor using again abrogate, but now we use one another database of it which is violence factor database. We're using some tool tabular manipulation tools like the one we use in the pre processing. We played around with all the reports tabula of the samples and made them prepared for the final or last workflow that we'll be running today at the end of the training to use it in order to visualize all the samples together. So let's start. The first thing that before we start we need to upload in our history is a file or a tabula file for the MLST tool. The output of MLST tool has no header. So that's why we decided to create a header for it to make it more readable for any report we create. So what we're going to do is to copy the link to upload to our history, we go to upload data, and then we paste and fetch the data and paste it here, and make sure to change the type of it to tabula tabula here. And press start. The next step is to go and download the workflow and upload it like we did before in order to run it. So we go to this workflow. It downloads here. Then we go to the workflow section. We press import. We import it from locally where you have saved the workflow for me it's there so I will not do it and then finally you press on import workflow. It will open up the workflow area again, and if not just go for it the workflow area will find it on the very top of your this list where you have all the important workflows or all your made workflows in your account in Galaxy. So here I will run it. It's called Nanopore data set gene based pathogenic identification imported from uploaded file in your case. So we press run workflow. I'm going to give it one file by one file. It's not, I will not choose multiple file as we did before now I'll choose it one by one, because I want to give for everyone of the sample sample ID, which can be used afterwards in the visualization of the samples together. However, in the new version that I told you about all the workflow or their other section that I told you about that it will be added soon into the training material, where you will be able to run a complete collection as a whole. This collection will include will automatically detect the sample ID, or the workflow will then be updated to take this. It will automatically detect the sample ID. So in this case will be only uploading or working with a collection without doing it multiple times for every sample and without writing a sample ID every time for every sample. So stay tuned for the newest version or the update of the workflows and the newest version of the training material very soon. So for this case I will just choose the Nanopore process sequence tweets, which is data set number 26. That's for barcode 10 and I will give it an ID barcode 10 spike tool, for example, and then I will choose the uploaded MST report header, and I will run the workflow. So it will, and it will run and take its time to be here. However, I will run it again for the second sample, which is the barcode 11 sample. So it'll go again, run it again. I'll choose data set number 50 I guess now, which is this exactly the same Nanopore process sequence reads but for barcode 11. So it's an ID again, barcode 11 spike tool B. And yeah, the same MST report header tabula file. And we run the workflow. For the sake of time, I will be going through the history I pre prepared before, like with it be in the beginning of the training. And let's check the output of the gene based passage and detection together. So this part of the training, or for this workflow, for the gene based pathogenic identification is the assembly, where we created the context. So as we explained in the beginning the tools that we use is metafly, or in another name fly, and then to visualize the output of this tool we use bandage image and then to polish, or before actually visualization we can do polishing to correct long error prone Nanopore reads so it's actually specific state for the Nanopore reads. So let's answer together the questions. Well how many differing context, did, did the fly get for barcode 10. So let's check we go down to barcode 10. Let's check its fly output faster file. And as you'll see here, we have 139 sequences. That means that we have 139 different content created in in this fly run. Actually the workflows now are now updated in the training. So we get here the same number 139. However, the answer, I guess will be slightly different. Yeah. So it's here. An older version of the answer. I will update that soon. So it's 137 context here. So, however, we do have 139 that's the correct answer so we should find 139 context created by fly. The next question is how many were left after running the polishing by Medaka. So we go again and check it for barcode 10. It's here called a sample old context, and where we run Medaka as you can see here. And we found that we have the same number of sequences 139. But Medaka did not remove anything. So Medaka tool has a lot of arguments where you can set a lot of thresholds to keep or remove the context based on your, on your goals. And actually, what Medaka says have seen that the fly output is good enough that did not remove any of the weeds, or any of the context that were created by fly in the first place. So the same thing also be written here. And finally, let's check together the graph, or of the bandage image which drove the assembly for, or the context for barcode 10. Yeah. It's very nice for the reports or for people who are studying the context and simply to view such figures in the reports in the end. So the next part of this workflow is that we wanted to search the MLS T database for schema and see which schema our sample is a part of or can be referred to. So let's go and check the output of MLS T for barcode 11. So let me show you first without the header. And then let's see it with the header. So it's actually without the header, where we have a report where we have a sample ID and the schema itself. So it's Salmonella and to Rekia as well with this name is a schema name where you can go and search for it I have put a link for that. And let's see the report after we have added the header that we uploaded together. And now we always just what we did is that the file that tabular file that we uploaded is the, the header of the MLS T so that when you read it, it's nicer in the, to understand which column refers to. So as you can see here there is a link so that you can read more about the schema where the barcode 11 was part of by, or was found to be related to in the MLS T database. The last but not least part of the this. This part of the workflow is to identify the antimicrobial resistant genes. And here we ran aggregate tool using the AMR finder plus database which is found in the aggregate tool. So let's see whether or not we had AMR genes, and how many did we found by answering this question so let's check the output of aggregate for both barcode 10 and barcode 11. So here we go for barcode 11 first let's see. And as you can see here for barcode 11. There is no column or there is no rose found that means that we have no AMR found by the tool aggregate. So there is no AMR genes for the barcode 11. Let's check again for barcode 10 the same exact thing, the AMR identified by NCBI. We have a bigger tabula so now by having this rose. That means that we have found AMR genes or antimicrobial resistant, resistant genes, and here we have found a five. And from this tabula you can learn more about this found gene, like in which context it's found in the sequence from which starting position to which ending position. So how much or how well is the coverage of the of the team, and how much percentage of the coverage, the accession ID which is global you can take and search globally and you would find which gene exactly we're talking about which product. And how much resistance again is and so on so that's actually the tabula that you can learn more from the genes that you are found, and maybe people who are experts in the MR genes will understand this better. Let's go to the next step and see if we answered correctly. Yes, for the barcode 10 we have found five different rows or found genes in different locations, and no AMR genes were found for barcode 11. Again, to the main core of the of the training actually and the main core of this part of the workflow, where we find the violence factor, or we identify the violence factor to identify the passage and in the end. So we run again abrogate using the violence factor database and let's go and check the output. So, for the violence factor. It's the same tool so you have the same exact column names so the fire the which in which content in which start position which ending position, the short for the gene name in which coverage percentages, the accession ID which is also global, you can take this ID and you will by searching you'll find the name and exact species and actually up to the strain level identified so here we have found Salman Allah and Turkey and the strain name is LT tool. So, you have all the information you need to to see or to identify to the identified already pathogen so as you can see since we have found already vf support barcode 10. And then barcode 10 sample in the passage and example, and here is the strain name of this passage and and if there were more than one pathogen you will find here but since we already know in prior, which strain it actually was and which species, it will be your own kind of species, maybe different violence factor or different levels of severity in every in different locations of our sequence. So that's actually a nice tableau but it will be nicer with a visualization in the last workflow. Let's check it also for barcode 11 maybe we found maybe not. So, let's check here. Yes, we have also found pathogens and how many. We can check the line so we have found 97 lines so we have fun 97 different pathogens for barcode 10 let's check again because I forgot to check the number. So it's 134 different pathogens found or different violence factor found identifying that this two samples are actually pathogens. So yes, it's yeah numbers need to be updated but this one at least it's 97 correctly and this is this one is almost the same. So now we are done by identifying the passage and so now from the taxonomy profiling we knew that we have bacteria, and now we know for sure that these bacteria that we found one of them is Salmonella is definitely a pathogen was the strain So now we have identified agnostically the pathogens and now comes to another type of analysis that we'll do together in the next workflow. SMP based a pathogenic identification is our next part of our analysis or the next workflow the small one that will be running together now. As you remember, it's actually the third one that you can run in parallel after having the pre processing output, or the filtered read that you had in the end, you have these three parallel parallel workflows we have run the ready that it's on me profiling identifying the bacteria, and all the different other taxes as how and how many reads are assigned to every taxa, and then the gene based pathogenic identification where we knew that we have pathogens and in which context in which location, and in every sample separately and we also have other processes and you can add whatever tool you want in this part of the workflow and then you can do other analysis now comes to the third workflow that will be running after the pre processing, where we will be finding the variance and our sequences. In this part, mainly, it's like, for example, when we have corona viruses, and or source of source of two data sets, and you want, or you have a lot of samples by running this workflow you can actually find novel alleles and you can find other or new variance of the data that you already have. So, from this part of the workflow, we are no longer agnostic Lee we are now having a guess. So, for example, when we are guessing that we have been selling a lot. The first thing that you have to do is to have the reference you know what's selling a lot. And then we do a mapping, the one that you might be have ready taken in the previous training if you have already passed through the previous or the this training so you have already passed through the mapping. So what we do in the mapping is that you have a reference genome. And you have your sequence reads filtered definitely and then you start to compare. And then after this comparison or by taking the BAM file output of the mapping, you detect the variance so you see how many nucleotides or what are the nucleotides that are different from the reference genomes and they are repeated among all the reads, and then you can call them variant or in a single SNP. So, this is actually important, as I told you to find novel alleles and new variants. And it's also important if we are trying to find the pathogens but not agnostic Lee. So for example, if we have the reference genome of the pathogenic sequence. So we have a reference genome of Salmela with a passion, then we map it to the sequence that we have. And we are sure that our reference is passion and then our read is mapped in the location where the context that we are sure that they are mapped completely then this this will also be a passenger in our sample. So the first step in order to do our SNP calling is to upload a reference genome that we had from public databases, and to do so and we had it here for you so what all you have to do is to copy here, the link for it and put it in in your history. So you paste and fetch and you paste it here, then you press start. Afterwards, what you have to do is to download your workflow, as we did before you click on this workflow this blue button. Then you'll have your workflow being downloaded. So I choose the location locally and you press save. For me I had it before so not do it again, and then you go to your galaxy interface again into the workflow area, and you press on import by doing import you can browse and choose your workflow from the location that's already saved it in. And then for me I will press cancel I already have it before for you after being here, you press on import workflow, then it will open for you the workflow panel again, if not just pressed on the workflow area. It will open all the imported workflows as well as the, the ones that the important ones the one and also the ones that you are already created before. So all the workflows that you have in your user accounts you'll find here. So now I will go to the SMP in based pathogenic identification imported from uploaded file, you may will find it on the top of your list because you just uploaded it now or reported it now. So press and run the workflow. And here again back to the multi, a multi file datasets or yep, and you choose your pre processed reads. You did not found it you can click here and choose it yourself. The process sequence reads metaphor process sequence read for work at 10 and for work with 11, which was both of them. And by you can select both of them by pressing on control on your keyboard and press okay. You choose your reference genome that you have just uploaded to set numbers 3774, and you run your workflow. So for the sake of time again I will be going to the history that I already pre created before for the SMP calling and to check the results together. Let's see what are our questions. And speaking firstly about the tools that we use to do every step. So we said at the beginning that we in order to do the SMP or the variant calling we need to have a mapping first for the mapping. And as we have nanopore datasets we use a tool called mini map to this tool is used for the Nanopore datasets to do exactly the same mapping as the other tools that can be chosen for Illumina sequencing. Data set all you have to do is to replace this mini map tool used for mapping and Nanopore datasets to other tools that are used for Illumina sequences that data sets. And then you're all good to run the same exact things, things to Illumina sequencing data sets instead of nine or four. Then after you having the BAM file from running the mini map tool tool, you do the variant calling step where you have where we used here a tool called clear three. This tool actually is the one used nowadays in the current studies. And it's also up to date and it's always in an update in this version. So it's always under updates and so on. So that's why we prefer to use something with still a contact to the developers of. And we have it on Galaxy as well. So that's why we also use it. Another tool that you can use is Medaka consensus tool and Medaka variant tool. These tools are also used for the same purpose to find the variant calling and it will have, it will give you the same results as clear three. However, they are very, very slower than clear three so the time is dramatically different between both of the tools. So that's why we kept using clear three, instead of Medaka consensus tool and Medaka variant tool. So after having the results from the variant calling, we should do normal normalization using the BCF tools, and then we can run some of the filters in order to keep only the better quality or the best quality of the, of the variance so after we put from the clear three or the variant calling we have a bigger tabula indicating the reference position and the position in our sequence and and saying that they are different for example and what is the quality of identifying them as variant or identifying them as different. And we should be skipping only the ones that are with good quality or passing quality at least. Also, we can use either SMP sift filter or low frack filter and both of them are exactly exactly the same in time and results. So you can choose whatever you want to choose. And here we have chosen SMP sift filter. And finally, for your reports and for reporting you don't have to export all the tabula with all the details about every single nucleotide or every found variance so you don't have to report the full row. So you can choose or the full columns of the of all the rows. Here you can choose to create the tabula from this bigger tabula to keep only for example, the reference the, the your sequence and the quality for example only three columns from all of these columns to do so you can use a tool called SMP sift extract field. So let's together see and check our results and our tools output and check and see if we can answer these correctly. So the first question is how many variants were found by clear three or barcode 10 so all we have to do is to go to clear three output or barcode 10 and check its outputs. So what you will find here it's 2651 lines and 15 comments, 15 comments are mainly the explanation of the ID of every title of the header. So Chrome, what's Chrome is and what's so every one of the titles can be explained here. What is the first one and every every single letter. So it's just the 15 lines or rows for explanation. The first thing from here is the 2651 lines which are the variants found so how many variants found is 600 to 2651. Let's see. Yeah, it's a little bit different. That's before filtering because answers are the are a little bit late. So now it's, if you run it with the current work those that you have now in the training, you'll find the numbers that we showed previously just a slight difference of one different variant. So the how many variants were found after quality filtering. So let's check after we run the duck what is the results so remember it's 2651. So after running the filtration using Medaka it's 2488 after we filtered or removed or the low quality that we had from previously from clear three, and we now have all the, let's see. We have all the passing quality from the filter. We have instead of 62600 we have 2400. So let's check. Yeah, it's quite the same. And what strain, can you detect the strain from this tabula or not we already knew it from the gene based pathogenic identification workflow, but you can also see it from here. If you take anyone of this accession ID that we also have the same one off from the gene based pathogenic identification and search for, you'll find that LT two, and by pressing the link you have in the training you can go on and read more about this. Now, one of the things that also this workflow does is to build the full genome, or the full. Yeah, the full genome of the samples that you have. And why is it important, it's important actually if you want to compare the full sequence or the full genome of the samples together in one of the coming steps by growing philogenetic trees for example of the, the, the samples, all of the full genome together. It's really useful when the sequence is short like, like in the SARS-CoV-2 case, but for salmonella the reference genome or that the genome of the salmonella is really big. And it will not be that sufficient to draw the full thing for the all the samples together, but it's really useful to. For example, in the case of SARS-CoV-2, when you have the full genome sequences you can draw this philogenetic tree and relate variants together and relate them and see how they are having common ancestors and they are from which other variants they are coming from which are the parents of these variants, until you reach the final notes. So it's really nice. But in our case will be also drawing philogenetic trees but instead of the full genome we'll be doing it for the context. And the context locations that are having the violence factor but this will be in the last step that we'll be doing right now. So, to check the solution on the last part here, or to do it in the first place we have run a tool called PCF tools consensus that takes the output from the BAM mapping from the minimap tool where we run the mapping, using the reference genome of Salonella, and we built the, and it builds the full genome sequence of the, of the sample so we check it for barcode 11. We go up up up to the BCF tool consensus and see its output, and we found out that it gives us two sequences. And the question here is why we have two sequences. And the answer is that the reference genome that we used to do the mapping the one that we had from publicly available databases the one that it's public for the for the Salonella and or official for Salonella actually it has the file of it has two sequences, same exactly as here. One of them is the complete genome and the other one of them is the plasmid with the plasmid genome. So let me show you. It's a very long file. I will search by the sign to go quickly in it. Yeah. And as you can see here the second one of the file is the plasmid genome. And that's why also for the built consensus genome of the of the our sample or barcode 11 it's also to sequences one of them is for the complete genome and the other is for the plasma genome. So we have reached the final part of your training today which is the last workflow that we'll be running together pathogen tracking among all samples. This one, if you remember from this nice figure that we're will be running, you can run actually directly after the gene based pathogen identification. The output of the gene based pathogen identification tabulas for all the samples together and group all them, all of them in a collection and manipulate the tabula in a ways in different ways actually then it started drawing of a heat map and phylogenetic tree, or a phylogenetic tree for every identified violence factor or every identified gene product of a pathogen gene product. That's how this workflow works. So, basically, at the start, if your gene based pathogen identification did not finish the running you can, we can all now go copy these outputs from the gene based pathogen identification, which we will be using in the, the tracking the phylogenetic tracking among all samples, where they will be exactly names like that after you finish the running of the gene based pathogen identification workflow. Another thing if you remember we said already that there will be another version of the training material, where we'll be running collections. So these are separate files and we're going to do now with them is that we will upload them into our history or use them directly if they are finished from the gene based pathogen identification, and then group every samples in one collection. There will be a new version of the training or the new so short, long and the collection version of training is finished, and it will be finished very very soon. The collections will be there for you directly so you'll be using the collection directly which will be outputted from the gene based pathogen identification, and you will be able to run the full or them. You'll be able to run the workflow without creating the step or bypassing the step so by running the collection version or new version of the training this section will not be there. So let's now copy them to our history, so I will just go and copy all the tabulas together. And I will go to the importing or uploading the data files, and I fetch or paste data, and I paste all the tabulas and I make sure that I set the type of them into tabula, and I will press start. I will go on and copy the faster files remember these are the context created for every sample after running the fly tool in the gene based pathogenic identification. Now I will again go and upload them as data sets from here and this time I don't have to choose a type it will automatically do that. I will go back to the top as I told you that will be creating collections. So if we are having two samples like now. So every two files will be grouped in a collection. However, if we are having 10 samples, it will be grouping of the 10 samples. That's why the collection way and galaxy is really useful that you don't have to be grouping every now and then or you don't have to run the workflow for every sample. So we are creating a sample ID like with the gene based pathogenic identification. So it's even, even nicer but still you'll still have this person if you really wanted to run on separate files or if you really want to, to do it separately. So now let's create the collection so to do that let's group every two files together. Let's go up to see if they're uploaded. To context we will group them into a list build data set list here, and we call them context. And created every two VF accessions for each one of the barcodes we will group them into a list and we call it. And we call it VF accession IDs. These are the X, the accession IDs of all the violence factor found in both of the samples now after we group them. And the next collection will be the VF accession with sample ID so let's group them into a list. And the VF accession. Mr. C here, no problem. It's just a collection name is sample idea. And finally, the collection for the VF themselves here for both of the barcodes so again selected, and we build data set list, and we call it VFs. And now we are ready to run our workflow in order to visualize all the samples and all the pathogens found together in nice figures. So let's see how can we do that. The first thing is that we download the workflow because we are doing the short version so I click on the link here downloads into a location and you're locally and then you imported by going to the workflow part in your top panel of your galaxy. And then you press on import to import it from browse and choose the location where you have, and then you click on import workflow, then it will open the panel again in the workflow area and you'll find that on the very top of your list here you can see me I already have it so I just go for it and choose it's not a poor data set. No, not this one. However, the longest one not a poor data set reports of old samples along with full genomes and VF gene phylogenetic trees imported from upload this file. So we go on and run this workflow. We're very simply filling in every collection in its correct place a context we'll put in the context collection. The Fs will put in the VFs collection VF accessions will put in the VFs accession collection. Finally VF accession with ID will put in VF accessions with ID. Then we run the workflow. The workflow what it does. The first thing is that it draws a heat map for all the samples that we have along one next to each other. We just that using a tool which existing galaxy called heat map with ggplot tool. And what is important in the heat map itself is, for example, when you have 10 different samples collected at different 10 time spots. And every hour you decided to collect the sample from the same location, for example, then by using this heat map, you can see when exactly the pathogen took place for example, along of the day, one of the samples or the other samples and at this point of time, by the heat map, it will draw all the samples on the x axis, and all we found a very lens factor or the pathogenic genes on the y axis and you will see where exactly this specific gene were detected. In our case, we have two samples and both of them are postogenic, and both of them are are salmonella. So let's check how this heat map will look like. So for my case I will not wait for it to run I'll just go to the history that we created before in advance to see it. Let's go for it. And let's see how the things will look like. Yeah. Okay, I run in the same. Now I run the same workflow in the same history again so just go and check its results. No, we're in the SMP calling that's why. So let's go for the history. Yeah. I will go to the history where I had done this before, which is called all samples from my case. And let's together see the outputs and check the heat map. So the heat map is run with the GG plots. Let's press on you. What we have here. So on the x axis we do have the samples with the sample ID we have put barcode 10 spike to and barcode 11 spike to be. And on the white on on the y axis we have all the accession ID that we have from the tabulas that we had as an output from the aggregate tool if you remember, from the gene based pathogenic identification. So red and white spots. So both of the samples are from Salmonella species. And thus, this will indicate why a lot of the red parts are common so red parts are common. If it's common in both of them that means that this gene or pathogenic gene found, or with this accession ID is found in both of the samples and this is mainly because both of them are from the same species so it makes sense that they share a lot of gene products. And some other parts are still white so they are different between some are not found in the barcode 10 but are found in barcode 11. And maybe that's because they are different strains and maybe because this differences. Some gene products are found in one of the samples and not found in the other, but because both of them are all pathogenic you can still see red. And in the results of the heat map. In another example was for example 10 samples with a different locations or at different time spots, you may be find all white in one in, for example from sample one until sample 10 you will find it all white. Sorry, from sample one to sample five, you find it all white no single red dots, and starting from sample six, the red started to appear that means that at this location, where we collected sample six, or at this time where we collected a sample six the pathogen started to take place and appear. So that's our first part of the visualization techniques we have here. The next one is the field of genetic tree to draw the field of genetic tree we use cluster w. And then, in order to do the multiple sequence alignment. So we have more than one contact from the assembly or the faster file that we have before. So we have a lot of of reads and a lot of context or sequences that we want to align in order to draw the field of genetic tree tree and relate them to common ancestors and relate samples together. And then after we have them this multiple sequence alignment we draw it by fast tree, and then we visualize it using new week display. So what is the output of the telegenetic tree. It's actually a collection where we drew a tree for every found pathogen so for every pathogenetic gene product or for every accession ID. We do a specific tree so we do a different tree. So for example when I clicked on this accession ID I will check. I find that it's only found in market 11. It's found in contact number eight from this starting to ending to this ending position. And again from this starting to this ending position so this gene product is found or the pathogenic gene product is found only at barcode 11. And it's in contact eight and on different locations on this contact eight. So what do we have in other questions. So we are asked also to check 461819 so let's see what is that 461819. Yeah, almost here. So as you can see, we are having we can found it in barcode 11, as well as in barcode 10, and in barcode 11 it's in contact one and in barcode 10 it's in contact 118 and different locations and if you check more and more, you can find them also in other context number or in other locations on this contact and so on. So some of these are the reds. For example, this will be the red location on the week and on the heat map which was common between both of the samples, and the one that would checked in the beginning was only found in barcode 11 so it was only read in the second part and white on the first part. Sorry, this is also good to be able to compare the samples together and see how concepts some parts of the context are related, and how the samples are related and here you can find a lot of common things between and this is because of both of them are exactly from the same species but not from the same strains. Here was the last type of visualization that you have today, and here are the answers to these questions that you have seen together so you can check it for all the other, or the other accession ID. And if you want to know what is this accession ID for you can just copy paste it to Google, and it's actually worldwide accession ID used openly in everything so you will find it directly once you search it for. Okay, so finally we have reached the last part of the training today so let's sum up what we have did. So this figure actually is the more detailed one of the short one that we have here, where we run together the pre processing in order to pre process or retain the quality of the reads by trimming the bad quality of the reads of them the beginning and the ends of the adapters and so on, by removing the host of chicken milk and, and meet or any other host that we are not interested in. And then we take its output to the process streets to run three different workflows in parallel, we run them one after another but you can really run them in parallel you don't have to wait for one of them in order to run the other, where in the beginning you can see me profiling, defining the weeds into or assigning the reads into different taxons, and we get more closer to what we do have. So, we knew which bacteria do we do have in our samples, and how many of the weeds are assigned to this species, or this species, so from the kingdom down to the species level. Then we identified the pathogens here in the gene based pathogenic identification for every sample separately. And we have a lot of tablets and other analysis that we did that we did and you can also add for your personal or your personalized version of the workflow. And finally, we did some variants and SMP calling by now that was before agnostic Lee but now we did it not agnostic Lee using the reference genome in order to find the variants and maybe discovered novel alleles or defined other variation of the the sequence or the And finally, the last workflow or the fifth workflow, after we the gene based pathogenic identification where we identified the pathogens agnostic Lee, we take all the tablets from every, every sample. And then we group them together and manipulate this tabulars and visualize them using heat map and field genetic tree. And as a reminder, the next version of the training, you won't be running for every sample on its own you will just be uploading collection of all the samples to the pre processing. And then you will have as a result, the collections for as results from the gene based pathogenic identification that you will be using directly in the pathogenic tracking without you doing that yourself by grouping the results in the pathogenic identification into collections to be used in the tracking. So, thank you for so much for your listening and doing that by hand. And if you have any questions feel free to write them here, or to ask me personally and also feel free to read all the references and if you are interested in any of the tools that we use you can click on them, and you'll find a diet link to their documentation, or any of the explanation I made sure to have all the links there in the training material. So thank you so much, and see you in coming trainings.