 Ladies, and ladies, and ladies, and ladies, and gentlemen, Lama Cisteria! Hi everyone, I'm Yvon Le Bras from French National Museum of Natural History and Scientific and Technical Coordinator of the French Biodiversity e-Infrastructure by National des Données de Biodiversité. Today, we will follow the meta-barcoding with OBtools Tutorials, so let's go for the show. So, first thing to do is simply to go to Galaxy Ecology through URL ecology.galaxy.eu. Make sure you are logged in with the user menu there. If you are not obliged to be logged in to use Galaxy tools, but you can have a particular restriction depending on the Galaxy instance you are using, for example, sharing or reusing some workflows or having the possibility to create more than only one history, so it's better if you can create an account and often you just have to use existing federating identification systems like Orchid, for example. So, the Galaxy interface is composed of four parts, the header banner here, where you can go from workflow to data and through visualized items. You can also log in from there. On the left, you can find all the tools available in this Galaxy for Ecology instance with the subdomain of usegalaxy.eu, the European Galaxy instance. So, you will find all these tools in the main usegalaxy.eu for Galaxy instance. And you can share for tools, for example, typing terms like, for example, OB tools, I don't know if it's a relevant one, yes, this is a relevant one, and you have all the OB tools fit available there. On the right, you have your current history, so here I have a history test. So, some tools we made with colleagues from Climate Galaxy initiative around X-ray, for example, and the data analysis of GIS sort of data. And you can create new history, something like that, so you see that you have one data set that's the origin of history and you can have several ones. And after that, each time you apply a tool to an existing data set, you create a new data set or new data sets in this history so you can track all the tools you are using and parameters and input data files and things like that. Finally, on the center here, you can see that I am just seeing a particular file, a geogysm file through OpenLayers, but in fact this part is free to look at the content of files like text files like this or to display the tool form so you can choose the data file on which you want to apply some parameters and execute the tool. What I have to say is so that you can expand this part by reducing the toolbox or the history, for example, like that. History panel, for example, with the arrows on the down corners. To start training, you can click on the little graduation hat on the top banner. We have this amazing Galaxy.U instance and I make not the good selection, I guess, this one. Okay, sorry for that. And access trainings, simply open it on another page if you want. The training.galaxyproject.org slash training material website, but it's really easier if you are on the same screen, just looking at the tutorial and when you have to go to your Galaxy instance, you can just click around the screen and go back just clicking on the hat and you will follow your tutorial like that. I think it's the best manner to do it. It's really convenient, I think. So now we are ready to start, so let's go for the show. So now we can just click on this hat. And just to see you, we are going to the ecology section and then following the meta-backording environmental DNA through obitur's hands-on tutorial. So we will do this tutorial, so now we can look at the entire tutorial and take a quick look at the overview. So here the question is to analyze DNA meta-backording and also environmental DNA data produced on Illumina sequence using the obitur's. So here we just learned to use this tool to deal with parent data to create consensus sequences and clean filter and analyze data to obtain some interesting results. Here we are searching for the diet of some wolves, for example. So those wolf scouts data are available, in fact, on the official obitur's tutorial website. And with this data set, we will try to establish a carnival diet of four wolves. To do our wonderful analysis, we will use a workflow called obitur's which is made of six tools, six obitur's and other classical galaxy genomics tools. This workflow is dedicated to ecological analysis, so we can obtain a list of species from environmental sample. The details on each tool are written on the text just below. And for this video, we will go through this details further in the tutorial to avoid too many repetition. So let's dig into this, this handle. We cannot start by creating a new history. So you can maybe already have a new history there or just click the plus cross button there and and rename your history. I don't know, for example, obitur's tutorial. And then you, you can, for example, just check once time. Another time the tutorial. And we can go further. So yes, you can see that you can add tags or annotation on your history if you, if you want. And now the idea is to upload data so to start working with this data. So there is a brief introduction about obitur's and obitur's galaxy on this tutorial and I invite you to look at it more deeply when you read the tutorial. And now we will start to do it with data sets so we have data sets on Zenodo. We don't want to facilitate too much your work so we created an archive, a zip archive. You can access through the past fetch data upload there. And just click start so galaxy will search to upload the archive is remotely accessible. And now we just have to wait. Data is imported to your galaxy story. So now it's yellow, it means that the archive is send it to your galaxy story from Zenodo. So I won't detail it. There are more things around data importation I think like that but if you want to have more information for the city to go to to hold their hands on tutorials, notably dedicated to toy like galaxy 101 one or one for everyone. So yeah we have our data in our history. In the format containing all the data sets needed for the tutorial managing zip archive is not the easiest way to to to work on galaxy. So here we are thinking better to have such compressed file so we don't take too much storage on Zenodo notably and we don't use too much bandwidth for data movements so just to take care to carbon footprint. So, and also to see you the real life and you have such archive you want to deal without the login. The archive on your computer then uploading it to galaxy it's really more efficient. Using the past fetch option so you can directly upload data from a web service to to galaxy. So now on zip the content to your history. And after that, we can maybe use each individual file independently. To do so, we search for the on zip galaxy tool from the tool panel and select the zip archive. We extract our files execute. So once the execution is done if the one step is successful resulting data sets are in green color and if something went wrong color is red. So here it's green and clicking on this data collection we can see individual data sets when deep from the original archive. Okay. Groovy baby. Data collection is really wonderful functionality of galaxy to deal with bunch of data sets and notably a huge amount of sequences files more over when you have parent data sets. We have heterogeneous data sets, as you can see there. So we don't really want to consider the files and as a normal genius group of files. We can then give access to each data set independently for further to executions and to do so we need to unite individual data sets from the collection on the history. Why we propose to act like this. And it seems French. Sorry for that. It's because in fact when the data collection is created on the galaxy history each data sets from the collection is an individual history data set automatically hidden from the history panel and accessible from the data collection view. Important point if some data sets or options information seems to be not updated on the history panel. Don't hesitate to refresh the history panel. We are on a user friendly graphical interface to cool that this can have some drawbacks. So once the execution is done you will have a new information on the top of your history panel you have seven data sets on your history. You can click on the files and won't hide or read the data sets. Last data and link preprocessing step need to be done here as worth diet and just filter is recognized as a simple text file but not there. It's tabular so you don't need to to modify it. But if it's in text in your case you have to manually change data type and specify that in fact this is not just a text file but in fact a tabular one. So in acting like this but yeah it's already in tabular. And there is two sequencing files in fast queue format wolf R and wolf F corresponding to the forward and reverse sequences of 12 s mitochondrial locus. So here we have overlapping forward and reverse rates. So the first to be to step is to apply a micro assembly like we can as we can see. There's a micro assembly of prevent sequences with using Illumina paired and tool sequence records corresponding to the same repair must be in the same order in the two files. So, first of all, I will search so be tools Illumina branded tool. So here we assume they're on the same. We have the same order of sequences. We specify the score the minimum score for keeping element here 40 and we execute it. If the element score is below the defined score here 40 the forward and reverse reads are not aligned but concatenated as the value of the mod attribute in the sequence either is set to join instead of element. Here we can check the result looking at the value of the mod attributes of the first sequences. So we have to wait for the execution is is done. So, we can look at the resulting data set and look at tribute there. We see we have an alignment statement. Yeah, so apparently, OB tools Illumina parent find the forward and reverse sequence that successfully. To create this micro micro assembly of the two reads with the scoring 40 so it's cool. This one to this one to so apparently we don't. It seems we we don't mix our files so it's it's quite cool. Okay. Now we can continue so the next step, in fact, is to use a big rep tool to discard sequences indicated as joined. Yeah. Where is the step. Yeah, this one. So we can use a big rep tool, taking the sequences. But what happened when I opened a big rep to form I can't select the Illumina parents assembled sequences as if input sequences file. Do you know why. In fact, here it's because we need to take care about the fast queue format we have notably the table related to the manner the quality is encoded to do so what can use fast QC for example classical tool you have to to use often at the starting point of your genomics workflow and executed on the results of Illumina parent with the common parameters and fast QC will give us information about the quality of our sequence. And also say to us how quality score is encoded then fast Q rumor can help you, you and me, you and me and me for you to convert the format into a fast Q songer format. So fast QC is finished. If you look at the fast QC web page, you will find some information like the encoding information so we are in Illumina 1.5 version of score encoding. So now we can use fast Q rumor with a classical tool to allow convert between various encoding score towards. So it's 1.5 so this one towards fast Q songer score execution of fast Q rumor. Okay, so the results to be okay. And now if you are looking at OB tools and particularly the OB grip tool. Yeah, you will find the good file here. So now we'll use OB grip and what is the thing we have to do with OB grip. So Illumina parent fast Q rumor file. Okay. Yeah, we have to specify the predicate and mode not join. So we will filter the real sequences. Where the attribute mode is not joined but as some. So now we can look at the resulting file. And for example, a manner to look at this is to look at. So look at the content file and maybe we can just as this is a big file we can't. We only can see the first megabyte of the file looking clicking on the high. Okay, I just search rapidly on the joint term in this big rep results and if I look at the input that I said you see that only on the first megabyte of the file. There was some mode equal to join. So just like this we can see that effectively apparently we filter that I said. And now we have a file is smaller than the input file, and we don't see any mode equal to joint searching on the on the file, just on the first megabyte. You can also use a galaxy tool like for example select lines that match an expression. Select lines and that's an expression matching for example so you can select both input and output file matching. For example, just to see. So this is managed to verify that something was filtering filter during the big step. So looking at the first megabyte file or at least looking at the difference in sizes. And then you can also search for lines that match a specific expression here joined and you can see that we have 186 sequences with module joined on the first file. And no sequences with what equal to joint on the public rep files. If you want to know more about the number of sequences who have been filtered you can also use galaxy tool like line word character count of a data set to count the number of lines of each data sets. And as this is fast you files you can divide it by four and, and have the resulting number of sequences on both input and output of the group step, but files of the group step. This is to answer other questions. There is on the, on the tutorial. Now we can assign each link sequence record to the corresponding sample marker combination to do so we use the hinges filter tool, providing marker sample informations from a tabular file. Who here is called parameter file. Here you need to pay attention to the data type. The tool is asking a tabular file. For example, we need to change the data type of the wolf underscore diet underscore and just filter that they take the if it's recognized as a raw text file so we can first look at this. Okay, but okay this tabular so we don't have to move to change that. The types but if you have to, you can act here and affect the tabular data type manually. So we pay attention to only use this manual modification of the data type if you are sure that the data set is in this format because galaxy will not look at the content of the file it just just think you know what you are doing in fact. So you can use NGS filter normally from the OB grid of OB tools tools and just filter. So the parameter file is worth diet and just filter. And the read is the OB grid output considering the number of errors you want to allow you can allow to errors for matching primers and specifying at put that type of a skew format. Like that, you want to generate a file with only notified sequences. Yes. So we see that we have 1384 sequences, not a sign on the original 44,717 sequences. The same DNA molecule can be sequenced several times. It is convenient to work with unique sequence. In order not to be to reduce both file size and computations time. The purpose of the OB unique tool, OB unique works through three steps to compare the reading a data set to each other group strictly identical reads together and output the sequence for each group and its count in the original data sets in this way all duplicated reads are removed. The unique is used here on trimmed and annotated file obtained by NGS filter with the sample option in attribute to merge here used to keep the information of the samples of origin for each unique sequence specifying merge a specific option. So sample and merge for OB unique or merge. I don't know if in English. This is the same setting fresh or more. So I take NGS filter attribute to merge. So we say sample, use specific option merge. And execute. Oh, cool. We're looking for the job to be executed. We can play citizen science project here so trying to recognize if this overflow is a male or female looking at the the eyes of the overflow it's not the beautiful picture there so it's quite complicated. What do you think about that? Easy. My screen is too small. Sorry for that. Oh, yes. This is a female. I feel likely it's just because the picture is not beautiful. Yeah, clearly a female. I think it's a female, but I'm not sure. Clearly. Oh, I clearly cannot see. But wait. Yeah, I can see OB unique reasons. With 4,308 sequences. So here we clearly reduce the number of sequences. And we see that the rate of OB unicast added to key values on trees in the header of sequences. The first one is called merge sample. And another one is count. So we have to look at it. This is a manner of it to swear he is writing a lot of information attributes on the name of the sequence file. Yeah, so here you see count equal to one. And merge sample. I don't see it. But I think it is written. Somewhere. We know. Oh, I have you got here. Okay. So now we can use the OB annotate command specifying keeping only this information. So we don't want too much information for each sequences only count and merge. Once so we have OB annotate. We will specify from OB unique fact that we want us to keep. Only. Okay, so OB unique supposed to input the confining keep only attribute with key. And delay to can create. Okay. Okay, sorry. Keep only attribute with key. What I am saying the first key is merge sample. Okay, count and merge sample. Okay, count. Okay, but for me or something I prefer to copy pass to be sure. And execute. Okay. So looking at the resulting file. We see that now we only have merge sample and count information so it's easier to read isn't it. So the set of sequences. I think to to their corresponding sample does not mean that all sequences are biologically meaningful. I mean, some of the sequences can contain PC hard and sequencing errors or cameras to remove such sequences as much as possible we first discard rare sequences. These are the sequence variants that likely correspond to artifacts. In that case we first use OB stat to get the counting statistics on the count attribute specifying the category attribute key. So we have to look at OB stat simply by a key of an attribute simply by a key of attribute. Count if I am not wrong and use a specific option. No. All right. It's okay. It's all right. So here we can see we have one sequence with more than 10,000 count and different partition. Okay. Then we want to keep only the sequence having a minimum count and length. As in previous knowledge, we set the cut off for keeping sequences for further analysis to account of 10 and we also remove sequences. Oh, sequences. Wow. Beautiful. Also remove sequences with a length shorter than 80 base pairs. And as we as we know in fact that's the amplified 12s v5 barcode for that the rates must have a length around 100 base pairs. To do so we use OB grass OB grape. Sorry. And your first time specifying count greater or equal to 10. And then the second time specifying 80 for Elmin option. Okay. So we will take OB grape. No, there is no be one OB grape. And OB annotate predicate. Because you are shocked. Continue cat. And here. Count greater or equal to 10. So I copy past it. Execute. And then I will execute another. Predicate. But here for. I mean. Of 80 base pair. Okay. On the result of. Oh, but it's not executing now so I have to wait. Okay, so now the first OB grip is green. Super green. I can execute this new OB grip from on the resulting. The first to be great execution. Okay. So we see that we. With the. Count. I can add a tag for example. I can only 172 sequences on the more than 4,000 sequence. And input. Okay. And the second one. Is only removing three sequences. Three more sequences. Okay. So finally we want to clean a little bit our data. Notably for PCR sequencing errors. To do so. With this use or be clean specifying we want to keep sequences with no variants with account greater than 5% of their own. Count. So with this parameter. So number one. Treasure ratio 5%. Attribute containing sample definition here, but this is merge sample. We want to set only sequences with it's tattoos in a list. One sequences. Okay. Let's go. For OB clean. So here we see. With the clean step. We know how. 26 sequences. So once it's denoising has been done. The next step. In diet analysis is. Of course to I think the barcodes to the corresponding species. In order. To get the complete list of species associated to each sample. So taxonomic assignment of sequences requires. The reference database compiling all possible species to be identified in the sample. Assignment is then done by the. Sequel. Sequence comparison between sample sequences and reference sequences. We here propose to use. Blast plus Blast 10. To do so we search. For the. Blast 10. Tool and specify. OB clean at input data sets. Okay. As it's a T blast. It's a mention of protein. So. Blast and. Yeah. T and T. T and T. Sorry. So nicotine query sequence. This is. For the reference. Sequences. We normally have. With. Deeper cave. Something. Yeah, not locally. On this collection standard from our. History from a faster file. We normally have. Yeah. Tb. V05. Are 117. We'll. Make a mega blast. Just fine. Very, very similar sequence. Yes. The cutoff file you can be. Yeah. For example. A little bit. Smaller. And. In advance option. We can specify we only want to keep. One. It. Okay. So here's a result of the. Blast. Blast. Okay, we see that. Blast and fine. 26. References. Sequences. Really highly similar to. Our 26. Corresponding query sequences. Cool. So everything is. Find. This means that for each of our. Sequences we have a hit with. This. Quite low value. We can verify we really have. All our 26 sequences using a tool as unique. Occurrences. Of each record applied to the. To the first column. Currency. This is. Tool we often use. And. I don't know how to search. It's unique. Like. Yes, unique occurrences. Of each record. I don't know why occurrences. Not much. The tool anyway. This can happen sometimes it's. You can have some. Some difficulties to find the tools. Even if the. The word you are searching for is on the description. So don't hesitate to. To go deeper on the research or. Going directly on the. Tool section on the two panel. So you can look at. We only have. 26 lines, but if you have. More lines, it can be. Convenient to use it. Just. To know. If there is no redundancies. Here on the. And the first column. Okay, so let's. Verify if we have. 26. Unique. Sequences. Sorry, that's it. Okay, so just to verify it. Okay. To know the meaningful of the hot. Put. Of each. Column of the output file. You can refer to the. Help of the. Mega blast of the. Blast 10. Tool. So here, for example. We have with, we have this kind of. Column. So the third one is a percentage of. Inautical match. For example. And the fifth number of mismatch. Six. For a number of gap openings. And then colon 11. The expectation value. The E value. So this common columns you can use to filter. And have more information about. The fact that your. Sequences really. Perfect or not. Related to the reference. We now want to re-associate all the reference sequences. Information, but the species name. So you can see which species are potentially seen on the sample. To do so. We will use. A tool called filter sequences by ID. And the third. And the first column. Of the mega blast on. Obiclin double output file. And that we only want. Positive match. Okay. So. This one. And after. We will make the other. So, and you, you can start to. To filter sequences by ID from the. DBV. Of five are 1175. All the. Obiclin. So I will. Make, maybe like in the. Tutorial. So starting with. DBV. Of five. Mega blast. Two. And just positive match. Filter sequences. By ID. DB. Five. Sequence either. Yeah. The mega blast. And I. Second. Two. Because. Two. If. Okay. Column two is the name of the reference sequences. We are searching. Okay. And. Just positive match. Okay. And we want the same. But with. Obiclin. Yeah. Obiclin. Maybe I'm last, but now this is to research a name of the. Sequences of the query. File. So helium. Blah, blah, blah. Yeah. So it's colon. One. Of the mega blast. File. Just positive match. Okay. So we just. Search for. Co-responding sequences from there. There. And. Take all the information. And from there to. There. Okay, that's it. Have the results on. DB. File. Okay, so we see that we have species name. And here. And here. We have. The query. Okay. Did you know why we only have 15. Sequences there. I suppose this is because we have. This. We have redundancy. There for. For example. Here and here it's the same reference sequence. So I make it by. With my wonder eye. Eagle eye. But. You can also use. Like unique on. The colon. It's like that. Yeah. You specify you want to look at. Unique occurrences of. Result. Second colon. And you will see that. Normally. You will find 15 sequences. Because. The query sequences. 26. 26. There is some. Well. Matching the same. References. Okay. Yeah. 15 lines. So. Now. What we have to do is to. To. Convert this faster files. On. There's two faster files. In tabular format. So we can. More easily. Deal with information on the files and create. Something like a. A table. A summary table. For results. To do so there is a specific. Will be tools. Program is called. To. You can also use. Galaxies. To. To. To. To. To. To. To. In. To. So I think we are done like this and now we can create a final synthesis tabular file joining these files. Okay, we have two tabular files, right. Wow, a lot of information. So now we want to join these files. So we can use a join to data set side by side on a specific field tool, galaxy tool, first run query, obitab file. So this is a query one, it's not 26. Okay, the resulting file to the reference, use which query can you use? You can use a colon one. Okay, the obitab on obitlin colon one and mega blast colon one, so helium mega blast. This is it. Fill empty colon, yes, single field value and NA fill columns by single field value. Okay, this is because we have a header on our file with high D definition count and so on, on the obitab file and we don't have either on the mega blast results. So proposing this normally will help you and me and me, you and me to conserve the header of this obitab file. Okay, so here I think I made a mistake because I forgot to, if I look at it, yeah, I have no header. So I think I forgot to specify I want to keep the header line. Yeah, now we have the header here for the obitab and obitlin results. Okay, and here he put duplicate the first line. Okay, why not? Then joining the resulting file to the reference database, obitab one. So is this one 28, 26 of this file apparently is the idea of the reference sequences and the reference sequences is 28, colon one. So now we have this file presenting of the joining of the three files. So we normally have all the information we need from reference sequences and query sequences. I can deactivate the scratch book. And now to facilitate to have something in fact easier to read and understand we can create a final tabular file containing only columns with important information. So to do so we can use the tool cut colon. So pay attention that there is different tools called cut colon from a table. There is an advanced cut one. But here you can't choose the order of the colon you want to to get. You see that you since the ordering is related to the increase of the number of the column. And you have this cut column from a table where you can specify each column you want to keep. So you can propose for example C1, C3, then C2 or something like that. So you can rearrange the columns if you want. Here I will propose something like this and I hope this is a good one. So cut colon C1 is for the query sequence, okay. Then C3 to C7 is the information around query count okay. And after that normally after 50 we have the reference sequences family, genus and reference annotation. I will see if it works or if I made a mistake. So we see if the information ID, family ID, genus, name and definition okay. So that's it. Now we have all the information, only the important information. And we see that this final tabular file can be filtered. For example looking only at sample with large count for example greater than 1000 on this column. So the column 2 filter data. So I can search for this amazing tool I often use. Filter data on any column using simple expression. Searching for IFs. We can work on the total count but maybe the best here is to specify that depending on the sample okay 3, 4, 5 or 6. So current 3 for the first sample. The same for C4, C5, C6. So we only want to keep a line if there is greater than 1000 sequences count for the line. For one of the four samples. Yes there is a header okay. Oh only four lines. So we remove all the sequences where there is no substantial count. Now we can say that for this worth we find this diet type for this one. This one is the same and this one is this. And to know what this worth is we can see here. Oh marmota. I think it can be really good with some cheese or something like that. So this sample, this worth, it's a marmota. This one, some servidae. I don't know what is serbus elafus, serbus elafus. Okay, beautiful, beautiful animals, a ser. The other one was capreolus, capreolus if I am not saying a mistake. So here you just did the necological analysis, finding diet from four worth faces. So now not faces but faces. So now you know how to preprocess meta-baccalling data on Galaxy, producing quantitative informations with quality checks and filtering results to interpret it and to have a sample stable you can share broadly. Not that easy to get immediately so I hope I didn't bore you too much with technicalities and that you enjoy this training with me. Don't hesitate to ask questions on the event chat or by email at this address so evan.le-brass.as at mnhn.fr and if you have any suggestions on how to make this training better don't hesitate to contact me as well or better to propose it on the Galaxy training material. You can for example go on the hands-on and you see that you have on the menu a possibility to propose to edit this tutorial directly on Github. A last point is related to this tutorial and all the tutorials you can follow on this event. Don't hesitate to give feedback and use the form at the end of each tutorial. Thank you very much for watching and see you soon!