 Hello everyone, I'm Colline Royaux. I work as a PhD student at the French Natural History Museum in both the Borea Research Unit and the French Biodiversity Data Hub. And this video tutorial is related to the written tutorial available on Galaxy Training, Compute and Analyze Biodiversity Metrics with Pampa Tool Suite. So I'm going to share my screen with you. So first thing to do is simply to go to Galaxy Ecology through the URL ecology.usegalaxy.eu, so I'm going to write it. Make sure you are logged in with the user menu just here. So I'm logged in, that's okay. You are obviously not obliged to be logged in to use Galaxy Tools, but you will have some restrictions and your data won't be saved. So it's better if you create an account and log in for this tutorial. So firstly, the Galaxy interface is composed of four parts, the head banner where you can log in, find data and workflows and so on. Then on the left, you can find all the tools available on the instance separated in several sections. And you can search keyword as well to find tools quicker, like I don't know if you want to manipulate FASTA files. You'll have all tools containing FASTA in their names or description. So then on the right, you have your current history containing all your data and every output of every tool you use. And finally, on the center of the screen is where you currently have the welcome page. But there is also where you get the interfaces of the tools, the visualisation of the data, etc. So you can expand this part by reducing the history part and the tools part on the site. Just using these arrows on the down corners. Then to start training, you just have to click on the little graduation at that is on the top banner to access the training. Then you are on the welcome page of the Galaxy Training. You can also open the Galaxy Training in another page just by clicking here and opening in another page. But it's really, really handy to do it this way, so I think it's better here. Now we can open the Ecology section. And our story is the first one here, a computer analysed biodiversity matrix with Pampa tool suite. And you can open the end zone just by clicking on this computer logo. And here it is, you have your training ready. So let's take a quick look into the tutorials overview. So in this tutorial, we'll learn how to evaluate properly populations and communities, biological states with abundance data through computing and analysing of biodiversity matrix on Galaxy. We'll use true survey data that are available on the Datras portal, carried by the International Council of the Exploration of the Sea. We'll try to assess how those exploited populations of fishes are doing over time in Baltic Sea, Southern Atlantic and Scotland. As we use raw data from the data portal, we'll start by pre-processing to make it fit into the proper format. I've avoid mistakes on the analysis, then we'll compute metrics at population and community levels to analyse it through a common statistical model called generalised linear model, also known as GLM, to respond to a classical ecological question and finnally generate interesting visualisations from it. So to do our wonderful analysis, we'll use a workflow called Pampa, which is made of five tools. This workflow is dedicated to ecological analysis, and it is divided depending on the biodiversity level you want to study between community level and population level. So to clarify these terms, you can expand this frame here that contains the definition of population and community. So a population is a group of individuals of the same species that are interacting with each other, whereas a community is a group of several species, so several populations interacting with each other. So here on this first figure you can see the organisation of the workflow graphically. So each part of the workflow starts by the computation of metrics, then the modelisation parts, and finally graphical representations. So the details on each tool are written just here on the text just below. So for this video, we'll go through the details while we're using it to avoid too many repetitions in this recording. Then there are also some technical details about the models we'll use in this tutorial if you expand this frame. So here is the technicalities on the statistical models. So I'll explain all this quickly when we go through this part on the tutorial. So let's dig in into the end zone. We can now start by creating a new history. So here is the step we are currently doing. You will just have to quit this pop-up part just by clicking anywhere around it on the gray area. So I already created a new history myself, but if you want to do so, if you have an history already created, you just have to click on this plus cross just here to create a new history. On the top right of the history panel. So then you can rename and just give a name simply to your history. Just as you want to name it, no obligations here, it's just for you to find it easily later. So you just have to click on the unnamed history, on the name of the history really. And I'll name it as it is proposed in the example here, so I just copy paste this part of the tutorial. So I opened it back just by clicking on this on the logo and I get back exactly at the same place. So this is a really, really useful tool. So I just copy paste the name in the tutorial and just paste it right here. So I have my wonderful name here. So you can also add tags and annotations to your history to find it easily or to add background on your analysis, anything you think can be relevant to add here. Now to import your troll survey data into this history, start by retrieving the Xenodolinks provided in the training. So you go back to the training. Here you have the links provided. So there are other ways you can retrieve the data that are explained just down there, but we'll do the easy way for this video. So you can either click on upload data here on the top of the tool panel just under the search banner, or you can click on load your own data that appears on every new histories on Galaxy. Here you have white window that just popped up so you can click on paste fetch data here, and you have an entry that you can write in. So just paste your three prylings in the frame that just appeared and just click start. Here you have your three data frames appearing. So first they are gray as they're waiting to be charged onto Galaxy, then they'll probably become yellowish that's that which means it's charging. And hopefully it becomes green as the uploading works as well so yes yellowish one here. If it goes okay so it's green but if there is any error, it will become red don't panic if anything goes red, it happens a lot. So don't worry just try try again maybe or try another way to upload your data that are explained in the tutorial, just no panic here. If you want more informations on how to upload data in other ways, you can go to the tutorial Galaxy 101 for everyone. So this tutorial is. So I'll just quit this one. Go back to the welcome page, then the introduction to Galaxy analysis topic, and you can find many, many useful tutorials to learn how to use Galaxy. And I advise you to go through these ones if you want to see other ways to put data and other options there is in Galaxy as well. So we're waiting for our data to be downloaded so it can take a while. If it seems to you that it takes a really long time maybe it's better to refresh your history. And there it is. It was charged but the interface wasn't following on this one so just refresh your history works here. So now we have our data properly charged but we have to check it. Maybe checking the headers that must be survey, year, water, area, AFIA ID, species, land class, CPU number per hour. So we'll also check the other ones. So it's alright so sorry I didn't say it but you can just click once on each and you'll have more information on your data sets and overview of your data sets. So it's a really handy way to see if everything looks alright. So let's see to the other one. Here we see everything looks okay. Here as well. The format is CSV each time so it's what we need. And we'll check if there is that separators are commas so to do so you'll have to click on the visualize these data so it's the chart logo you have here. And there are many visualization but I like to use the editor to see the separators. If it's not point commas for example so yeah. Our separators are right on this one. So let's check the others so I just have many visualization here. To do bar charts and everything you can search for the editor by writing just the three letters, the three foot lasers of the word. And then you go on the editor and you see everything looks right here. Also, just the last one. And that's okay so our data is properly formatted everything is right. So you can see there is three frames that has been retrieved from with different methodologies so we have to keep our analysis of the tree that I said separate so I said in the tutorial. I'll have to go back there. Sorry. Okay. And so on. And there we are. So I said in the tutorial, the CPU equivalent area is the survey. So it's a French survey on 1000 authentic bottom troll from 1997 to 2018. So we'll just add tag tags to each file to make sure we don't mix it up. So to add tags you just have to click on this logo here. Then here and you'll have to write hashtag and the tag you want to add so it has to be in one word. So here we have a void, then the second be file corresponds to SWC IBTS, which is a Scottish survey. So this survey is on Scottish Scotland, West Coast, bottom troll from 1985 to 2010. And the last see file is the by be it s survey, which is an international survey on Baltic sea trolls from 1991 to 2020. So I added an hashtag each time here. Because I want the tags to be spread through the history but if you don't want to do so you can just add a tag without an hashtag for this particular case. So I advise you to do so as we'll have a lot of data sets that are going to be generated. So it will be mixed up pretty quickly. So in Galaxy best is to have your data sets in tabular formats to avoid any problems. So you can convert your data types by clicking on the pencil here at the top right of the data set box. So then you click on converts. And here you can choose a target data types or there is only tabular. And then you can click on create data sets. And here a new box appears on your history with the target that has spread with our data frame. We're waiting to be launched here as it spray will become yellowish and hopefully green finally. So if you want to convert to we have to compare the two other that assets. So you can do exactly the same manipulation. But you can also click on the converted on the new that I said that just appeared click expand it. And click on this round arrow that means to run this job again. And here if you just click on this icon to select multiple that assets. You can choose your two other that I said by holding your click. And then sliding down. So that in one execution of your tool, you get the two other converted so it's quicker. And it's another way to run tools on Galaxy so good thing to learn. So now we just have to wait it to be converted. So here you may be a little concern about the fact that we don't have the tags appearing here but you just have to refresh the history, and it will appear once again on the new converted data. So now the data is properly uploaded onto your history we have to prepare the data to make sure the workflow goes well. So let's go to the tool we want to use and see in the help section. So, the tools we'll be using are in the species abundance section. So here we have compute GLM on population and community data. Here, we have calculate community metrics and presents up some stable and finally created plot from GLM data so as we are going to use. First, the calculation of metrics tools. I'll just go to the calculate community, community metrics so you see it opens in the center of the screen, or you can go as well. And just opening your training and going up there and also click on the highlighted on the two names highlighted in blue. So it goes right to the tool you want. It's also very handy way to use the training here in the pop up screen. So let's go to the help section to see what kind of input we need. So we see we need a tabular file with observation data. So it must contain at least three columns, a first one named observation units, which may seem a bit mysterious right now. So GLM actually identifies a unique observation on a particular geographical and time points. So it's just one event of observation. So that's more information on this field by going back to the training, going to the pre-prepare the data section and just expand what is the observation unit field and the observation unit nomenclature frames. So all the information you need are just there. But we won't bother with this column for now. We'll just use the year and location column. So as you can see, there is a possibility to use either these columns. So then we need a species code column that's identified the observed species and a number for the abundances of the species. So let's go check the other tool on presence absence table see if there is any difference and we see there isn't. So let's start the preparation of the data. So as said in the training, we'll just start to concatenate the data sets to head with cats. So here we just have to click on this and select these three data frames. So you can either on the multiple data sets option, you can either just click on the slide or click all your control button and click on each of your data sets. So then you just execute the tool. So the new data sets will normally have the three data frame reunited into one. So it is here confirmed by the three tags that has provided to the new data frame. So we'll just wait for it to be done. Don't hesitate to take breaks during the computation of tools. It can sometimes take a long time. So don't hesitate to do so. So now our new data frame is ready. So you can have just a preview of it. So it seems to have the same format at the previous ones. Just it has many more lines. If you check here, many, many more lines than we had in the first ones. So it's a good thing. And we can see there are the three tags here. So the best is to suppress these tags and just add a new one. Okay, so this tag is to identify these concatenated and not mix up with the other ones again. And now as we did naive concatenation, there is good chance we have the three headlines on the three original data frames in our current content contentated data frame. So we'll have to check it by using the tool count occurrences of each records as it is said in the tutorial. So just click on the highlighted tool, select your last one, your last generated. That I said so it's must be already selected here, and then just choose the first column as it will be detectable in anyone in any column. Your file is delimited by tabulations as we converted it just before. And we want to sort the results with the various values first. So let's execute the tool. So now our data frame is created. We'll just have to see if we were right and view the data by clicking on this I logo here. And there is three occurrences of the survey so the banner of the first the header of the first column. So we'll have to try and delete it, delete the two other headlines that would make mistakes in our analysis if we keep it. So we'll use the filter data on any column using simple expression on the concatenate data sets. So the seven one in mine, not the counts of course as it is just a report we won't be using it further in this training. And with the flowing condition so see one means first column then exclamation comma equal sign and the survey. So here we ask for filtering any line that has a survey in the first column. So this exclamation points and equal sign means different from and survey is our character string. Then we indicate there is one headline to skip as we still want to have one headline in our file. If we don't do so we won't have other, we won't have a line at all so that's not what we wish for. And then we just have to execute this tool. So we can refresh the history to have the concatenate tags that appears. So here it's maybe better to remove it from the count as it isn't data frame that will use in the future. And we'll check it again on this last filtered data frame if it works that we have really suppressed the lines we wanted to suppress. So here it says it's kept 100% which is, which is kind of weird but we have a lot of lines, maybe it went through the check. So we'll just open the count, rerun, run the job again with the round arrow, change for the new data frame, the filtered one, and just do the same exact computation to see if the headers were really suppressed from the data frame. So in the meantime, let's get back to our training. So here we are doing this step to check if there is really just one headline, and then we'll just start by formatting the data files. So we'll have to look into our generated filtered data frame and see if we have material to create year location species code and number columns to use with our current columns that are not in the proper format for now. So our count just finished and we have just one survey in the first column so the headlines has been removed. Yay. So in our new data frame, the year and species column are in the right format it seems. So we just need to change the column names into the column names right for the compile workflow. For the number column, the abundance seems to be just here the CPUE number per hour column. So we'll just change the name of the column as well. And the only particular case will be for the location column. We could just use here the area column to do it for of our current data frame, but there is a risk to mix up areas from several different survey as this concatenated data frame is composed of several surveys. If they unfortunately use the same names that we don't know that so we need to create the location fields by adding dash at the end of the survey column, and then we'll merge the survey and the area column together to avoid any mix up. So at this point of the tutorial, there is another subtlety will proceed these modifications, not only on the concatenated file. So the one I've got under my eyes here. Just make the tags appear so it's clearer. So the concatenated file will use is this one and we'll also use that three with surveys alone. So the data frames as well. We do so because as the community analysis has the possibility of being analyzed separately from each other. This option doesn't exist for the population analysis so we'll need to keep the data frames separate to avoid analyzing several populations of the same species. We'll need to interact as one only population would do. So first we'll have to use the tool column regex find and replace. So that is just here on the end zone. So if you haven't understood what I said here I know I don't have the best English, and maybe some are clear so please just pause the video and go and look into the written tutorial. Maybe it will be clearer for you. Don't hesitate to do so. And we'll get back to the column regex find and replace tool here to just add the dash at the end of the survey column. So to do so we'll use regular expressions, which is a little tricky as well. So I'll try to be to explain right here but there is a lot of documentation on regex online so don't hesitate to go there. So we start when we get here just to switch to the multiple data sets option as we want to do the computation on several data sets. So first the filter, then we hold our control key and select the three converted data frames. So here we have four data frames selected. We'll choose to use the first column as it is the survey column, and then we click on insert check to do a character string manipulation with a regular expression so also known as regex. So I hope this won't be confusing. But if you are interested, I just advise you to type cheat sheet regex on your browser and there are many effective documentations you can find this way. I will try to explain what I'm doing clearly but it's not that easy to to get how regex work. So I just erase here the in the find entry, I just erase in the replacement entry and come back clean. You don't want to bother knowing how regex work. I totally understand you can just copy and paste what's proposed in the tutorial you just don't don't have to understand this but it's not the center of this tutorial but I think it's still really interesting so I'll try to explain how I constructed this regular expression. So first, what you have to do is to look into the structure of the survey names we have in the column. So, luckily, we did our count on this structure so we know a bit on how it is constructed. We know what is everything in this. So, we can see our survey names are composed of uppercase letters. Only for the SWUC, IBTS, there is a dash so you can just ignore this dash we'll just use this part of the expression to construct our regex. Because we want the dash to be added at the end of the expression so it doesn't have to match the whole character string. So we can see that our survey names have a minimum of four characters here with the bits and the IBTS also. So we start by writing what stands for all uppercase letters in regex, which is uppercase A dash uppercase Z. I write it in the replacement, which is really not the right place to do it sorry. So, these have to be between square brackets. And here we have the expression standing for any single character that is an uppercase letter. So if I want to indicate there is three letters uppercase letters that I need. I put a three in between curly brackets just next to it. So here I have what stands for three uppercase letters characters. And then as we need to select character strings of four or more uppercase letters, I just copy paste the uppercase letter expression, just paste it after the three between curly brackets. I just write a plus sign after to indicate the prior expression. So this is for one or more characters. So here I have the proper regex expression to point out every character strings of four or more uppercase letters. So here it will go through this one, this one and this one. Okay, so we simply want to add a dash after the string. So we have to indicate we want to keep the original character string by putting it between brackets. So this way, I put the all expression between brackets. So it can be called back in the replacement entry with the backslash one expression. So this means first expression between brackets in the find regex entry and just add a dash after it. So it will likely add a dash into our four data sets. So we'll just have to just take a look if everything seems all right. It's the right for data sets on the first column, that survey column and our regex is good. You can check back in the training if you want to be sure. And then you can execute it. So here we have the four new data frame that appear. So it's a first good point. We'll just wait for it to, for the tool to replace to add the dash at every lines. So here we have one up here, everything is right. So we can just check for the few first column if our replacement worked with the by viewing the data with the iLogo here. So here the wonderful dash. Here it is again. Wonderful. And finally, yes. So it worked just fine. Now we have to merge the columns survey and area. Just to use simply as said in the tutorial, merge columns. So just refresh the history again to for the tags to appear. So it's easier to see it. You'll have to select the multiple data set option as well. Again, just click and slide on your four data sets and select to merge the column one with the column four. So it's really handy to have a preview here. As you can see, we can just check each time if we are in the right, if you selected the right columns. So that's amazing. So everything seems right here. First column, fourth column, just execute the tool. So again, we have four data frame that pops up in the history. Okay, so refresh again. Okay. Everything works fine. That's wonderful. So if you have any issue at this point, don't worry, everything will be fine. Just try again, maybe check back all your settings and I swear I have a ton of histories and there is more error than a good part. So don't worry if anything goes red. It's normal. So now if we checked our for newly merged column data frames, we can see there is, there is a nice column that's formed that is a survey area with our survey names dash the name of the area. That's wonderful. Same here. Same here. And same here. So good job. So now we just need to change the column names to have our data frames ready. So we use the regex find and replace. So regex again. I'm really sorry for the one that doesn't like it, but this one is actually easier because we don't have any expression. It's just character strings, just as simple as it can be. So just click here. You can do the copy paste thing again with the the tutorial. So I'll do the first one this way and then I'll explain you how I did this. So again, you select multiple data sets, you click and slide on the four that sets you just created and you insert a first check. So you can copy and paste here. Each arguments. It's okay to do so. But you can also just look into your column names. And you'll have to add a check each time. Each time you want to change another character string so with the insert check box here. So then you have the species. You want to change into species code. And then the CPU e underscore number underscore per underscore hour into simply number. And finally, the survey area into observation. Into location. Sorry. So here we have our four column that are going to be formed. That's amazing. And the four data frames here that we selected execute. So, after this, we'll just have to check if the columns are okay. So we'll have to wait for it. But if you want to use this workflow on other data of if you just want to preprocess in other ways, these data there is million ways to do it. It's just one path. You can do it anyway you want. I just added in the tutorial a frame that you can expand with tool names that can be really useful in proposing data. So don't hesitate to to look into it. There are many tools available in galaxy to do this preprocessing parts that are very efficient, you just have to search for a bit and when you find the right one it's. So now are for you and other frames are ready it seems we have here species code number and location. So here I looked for the large view but again you can go to the preview year species code number location. Okay. Same here. Just all right. And it's okay. So everything seems right again refresh your history for the tags to appear. We're done with the preprocessing of the data. So I know this part can seem very long and laborious but we wanted to give you a real life tutorial on this one. So the preprocessing is always a large part of the analysis in science. Very often it is even the longest part of the analysis, and it is very important to get to know your data set before analyzing it, just dig into it to know its biases and flaws to analyze it with with care and extract the best and list bias conclusions out of it so be careful on that to do good science really. That's the final objective. So now we can do the more scientific parts of the tutorial we can now compute our community and population metrics. And we'll start by the Catholic community metrics tool. So just by clicking my the highlighted in blue tool here. Here it is. And you can choose the concatenate data file as you as we talked already before we can do the community metrics on the concatenate concatenated. Sorry, concatenated file. As we can do the analysis by separating the surveys. It's okay to do it this way. And for the community metrics we want to compute will select all because present substance issues richness, Simpson and Shannon index. Also, Piel and he'll indexes are of interest so we'll just compute them all. So it seems alright we have our community metrics calculated just checking if we have everything. So we have the total abundance of the community of at each year and location. We have the species richness is Simpson, the inverted Simpson index, the Shannon index, the Piel, the heel index, and we have a new column named observation units. So we'll talk about it just later. If you refresh your history, you can see there is the concatenate concatenate tag that propagated here which is completely normal we just add a community tag here to identify our metrics file. And then we can do the same to compute our population metrics. So let's go back to the training. Click on the calculator presents have some stable tool. Then we select multiple data sets. And we select our three survey alone. Yes, so the 21, 20 and 19 execute. So just have to wait for the metrics to be calculated. And the next step of the workflow is to compute the statistical model. So with the compute GLM tool so let's just have a look into it. So we see we need an input metrics file so the metrics we are going, we are currently creating so it's, it's now down. And then we see we need an unit ops information file. So for now it's. We don't have such a file so let's see in the help section what is this file about. So here in the input we see the population we need with the population data. And then a second file tabular with observation unit data which contains at least as much columns as use explanatory variables in addition with the observation unit column. This file is to give information on each observation, such as where it has been made, when what type of habitat was it, etc. It can contain any important information that has been noticed on the field when the observation has been made, or how the observation has been made by whom, etc. So as I said before, an observation unit is a single event of observation at one time in one place. So the observation unit field is in the files that we can see here, and that we will need in the observation file as well. So we can check in the population metrics and it is there as well. So, yep, it has been created in each one we have to have obviously the same formats of the observation units, as it represents actually key to link the metrics with the observation on on the field and the audio we need actually. So if we, so if we have to build this observation unit file, we need to have it observation unit field that is formatted this way. So with the year of sampling underscore and the location name just next to it. So here in the tutorial, we have the nomenclature here that is detailed. So again, if it's not clear in my in this video, don't hesitate to pause and go and look into the written tutorial I just say quite the same thing but in another way. So don't hesitate to go there. So to format this new file, we'll just have to add the underscore at the end of every year values. So here with the column regex find and replace tool again. We'll do just as we did to add the dash at the end of the survey column just it's an underscore and it's the year column. So we'll do so on the concatenate data file to have the most informations. So it's the 22 here the data frame we used actually to calculate community metrics that still contains every information we have on the area of observation. So the column we want to modify is actually the year column so the second one. So we'll insert to check and in the final regex part will write the expression to point out any number from zero to one so it's a zero dash nine between square brackets so it has the same form as for the uppercase lasers. Then to indicate we have a four of this number. So the format of a year we had a four between curly brackets. And finally, as we did for the location column we put this expression between brackets to call it back in the replacement entry with backslash one and just add the underscore after it. So now we can execute this tool. Now as we did before we just have to use the merge columns. To. On this last data frame that is waiting to be generated. So we'll just have to check if it worked. Yes, it worked. We see there is there are underscores after each year here. So we just have to say we want to merge column two with the location column nine. Execute refresh to get the concatenates just to get the tags back and just wait for the columns to be merged. If the columns are merged, we can say we have our observation unit field ready. So let's see at the end we have our observation units that are properly formatted. So that's amazing, but to get a proper observation unit file. We need to remove the underscores from the year column with approximately the same manipulation we need to add it. So again go to column regex find and replace, select the last the last data frame we used column two insert check. And then again zero dash nine between square brackets four times with an underscore just after. And this time we'll have to select only these parts to put it between brackets. So only this part is between brackets, not the underscore and just call it back with backslash one in the replacement part and the underscore should be gone. So now if you look to your data sets, the underscores after each year value has been gone. So now we have to cut the columns that gives details on the species, the length class of the species, the number of it. As our observation units file and needs to have informations only on the time and place of sampling. So not the species, not the observed biological individuals there. So on the last generated file without the underscores will go back to the training will cut will use the tool cuts columns from a table. So selecting the last generated file, then ask to keep the column will add in the list of fields, the limited by tab. It is still a tabular file so the limited by tabs and cut by fields will keep our first column survey second column year third quarter fourth area and then will just keep location and your location. So nine and 10. So here we go. So I do exactly has on the, on the hands on so you can just refer it to there if you're a bit lost, don't hesitate to do so. I'll just, I just don't read everything in the tutorial for because it will be a bit boring, I think. So, yeah, again, don't hesitate to get this video on pause and just relax a bit and see if you missed anything on the tutorial on the written story. So now we have our file that is properly cut and we have the observation we want. So now we likely have a lot of repeated lines as we suppressed the information on species. So we'll have to suppress the duplicated lines and get unique values in the output. So in order to do so we'll use as said in the tutorial sort the short data in ascending or descending order tool. So these two not only permits to sort your data so just select the last data frame, but you'll see a bit further so I'll use the six columns so you as you can see I selected one headline. Then the columns to select to do the sorting according to is the next one. So the last one representing the observation units. So it's supposed to be the core of this file. Then we select ascending order alphabetical sort. So it's not this part isn't really that important. The most important part of using this tool is to check this particular option to output unique values. It's the outcome we want for our file to shorten it at the maximum so please check the yes box here and in your case we keep it this way. So just execute this tool and hopefully there will be less lines in our new file. So our data is finally sorted and we have unique lines and we see there is far less lines so here around 100,000 lines and in our unique lines we have just around 600 lines so we were right on the fact that we had a lot of repeated lines. So we just have now to change the column names with regex find and replace tool so I it's the same as column regex tool but just hasn't focused on one column so one or the other is really has its own use and not always proper to use on each on the same for the same manipulation so it's according to you according to the expression you're looking for and so on. So let's choose our last sorted and unique data table insert a new check find the year location and replace it by observation dot unit. So make sure you start by modifying this year location so that the first check is on this particular expression because if you do it after the second one. That is location to replace in sites. If you do this check first you'll have no more your location of course it looks for location you have it here it will change it into your site and you won't find it this way so just be sure to use it as the first check. And then execute we check we had yes we have the right data frame and execute. So if these steps works we have our observation unit file ready and we can just start to compute some models. So, we did a pretty good part here it was, I hope, not too heavy. So let's see. We have site and solution unit. It's perfect. We just had a unit of tag to identify it I don't put a. Nash tag here. Because I, it's the last version of our observation unit file so it's not necessary to make it spread through. So, the problem is the tag won't appear here when the data set is not expanded. So, it's problematic for the visualization but when you enter it in in a tool you can see the tag anyway so it's where it is the most important to see it I think. So okay, let's go back to our training. We'll start by the community analysis apparently. And that's when I have to talk a bit about generalized linear models so in these tools you'll be able to perform either a classical generalized model. And that will test the effects of your site and or habitat on any metric you choose. Or a generalized linear mixed model that will be able to handle pseudo replication of year and or site values by taking account of temporal and or special sampling replicates. I hope this is understandable. Explain this way. It's really statistical technicality so it's it's normal if it's not. If it's hard to comprehend so don't worry. So to achieve to this we have to set random effects. So that's why it's a model that said mixed. So we have to set random effects either on year and or on site. In the data we currently have there isn't informations on the habitat so we'll test the effects of year and site only in our models. There is often geographic pseudo replication in this type of sampling so there is often replicates of maybe the same individual that will be sampled in two areas so to take this into account will only set site as a random effect. So you'll have to know that if you want to put an effect random effect on year as well, you won't have effective results of your model so you can't have all effects in the GLM that is random. So we have to hear test year as a fixed effect. For the community analysis we use the compute GLM on community data that is here on the first hands on. So we use our community metrics file 23. That is tagged concatenate and community. The unit of it's the last we use. Of course, it's just here we just did it, it was. So then for the response a variable from metrics file. So we have a big choice between species. So total abundance species richness, Simpson index inverted Simpson index Shannon index peel index and he'll has the species richness is easier to understand because it is into interprets. Quantity of different species at a given time and location so it's just a count. And it's located in the first column so that's the one we choose. So to avoid the mix up of the analysis of the different surveys that analysis are of course separated by the survey column. We'll have one. One analysis per survey. Which is located in the first column of the unit of data frame. So then as we said before we set the year and site only so I suppose the habitat effect here if you want to add it back you just have to click on the entry and click back to habitat. So and to suppress another effect you just have to suppress this way with the cross. Just very easy. And a random effect on site as it is in default parameters. So you can specify advanced parameters. In this tool but we won't do it this time, you can get more informations on these advanced parameters just in the tutorial in the. In the frame here, if you expand it you'll have details on it. So don't hesitate to go through it. And then we just have to click on execute and wait for the models to be ready. So here we have our results. So there is three outputs. The first file is the GLM results file. It often has a weird display. Yes, there is. It's quite hard to read in Galaxy. So there is some ways to avoid this type of problems. You can go to the you can click on the enable the scratch book visualization on this logo here in the top banner. And if you click again on view, it should be right. Yes, you can now see it as a well formatted data file. There is another way you can see it better. You can just as it is said on the tutorial, you can use it's an optional step. You can use transpose rows columns in a tabular file. Just select the GLM results and execute. And it must display nicely for you to look at it better and make it easier to visualize. So both options are good. Just choose the one you prefer. So there are many, many, many details on the produced outputs in the written tutorials. So if you want to know what does each little part of the files mean, you can go onto the tutorial, read everything, open frames with details and you should know everything about what what's happening really in these files. So here our transpose just disable the scratch book and make you see our transpose data frame. So we have a lot of analysis results and indicators that are indicated in this results file. So on this first output GLM results from your community analysis, you can see there has been a four models produced. One global, one on bits, one on EVOE, one on SWC, IBTS. So as we asked for just the first one on global data, we won't look into it as it's not relevant for many reasons we already explained. We can see that the distribution set is the Poisson distribution. It has been set automatically by the tool and it is the proper one for using on species richness as it is counted data. So the Poisson distribution seems just fine. If we look only the significant parts of the results. So I know it's not readable this way. Don't worry. The representation will do next will be far more efficient to see it but I want to have a look on in these in these raw results for you to get to know it a bit. So here if you look into each CINIF lines on the transpose or each CINIF columns on the classical GLM results that isn't readable other anywhere else in the scratch book. So you can see there is always a no significant result, no significant. So it means that the community hasn't really changed over time in these places. So it's always no, no, no, no, no, until the end of the file. So no significant results, no significant effect of time on the species richness. So now, usually we're a bit disappointed, but we'll, we forgot something that is a major issue without data frame. It contains only part of the species found in the samples. All the rarest species has been removed from the data frame to get a smaller data set to use in this tutorial and avoid a too long runtime of the tools. So the data set we used isn't proper to make community analysis as it contains data only on part of the community. So here it's for the example. So it's not a big deal as we only wanted to show you how to use these tools. And now you know how to do a proper community analysis. So that's pretty good. Just the data isn't fitted to do so. So we can still move on to the second output simple statistics on chosen variables. So this doesn't have many informations really on the, on the, the model itself. But it's more on the interest viable we use such as we have statistics such as the maximum value of species richness we have in our data sets. So here we see the maximum is seven, which is very, very low for a survey on troll fishing. So it only confirms these data set isn't suited to make a community analysis. So the third output is actually my favorite. It will give you more keys to evaluate your model and give an indicative rate to your model. So here we have a global rate of 3.5 out of 10, which is pretty bad actually. But we have to see the details on every analysis. So I'm just going to zoom out a bit so we can have it on a same line. I hope you're still seeing. So if we look on the details on every analysis because the global rate is quite an indicative rate so not to be trusted too much, not to make real conclusions from. The best rate we've got is four out of eight on the evo a analysis. So this model has a complete plan. So each criteria is set here and defined just down there. And if you go to the tutorial, you'll just have details on the analysis rating file here in this frame and just under read and interpret road GLMMs output. So if you want to know more about all these criteria, just go there. So here we have a complete plan, which is very good, but it's not balanced. So a bit problematic. It has a few not not attributed values, which is good. But it seems dispersion and dispersion and uniformity of the drills isn't okay. So these two criterias, if they are not checked are pretty bad. Actually, it means that the Poisson distribution doesn't fit our data. So it's really not good for the model itself. Then in the last part of this file, we have rect flags and advices so major red flags you have in your analysis. And you can have some advices on how to make your analysis better. So, for example, you can try to ameliorate your model by trying another distribution law. But as we stated, the data wasn't proper to make a community analysis. So it probably won't help much in our case. So these advices are not to take and not always good. Just a little help if you need one, but it's not solving data problems actually. So we can still try to visualize our results even if these won't probably be trustworthy. It can be interesting to see what does a plot of a bad model looks like. So let's go back to our tutorial. Just go down there. There are all the details you need on the how to interpret every single part of these three output files. I won't do it here. It's too long. I'm too afraid I will kill you with this. I won't do it here, but just look into it. It can be also useful if you want to just interpret statistical models out of these tools, even if you do it on R and everything. It can be really useful, I think. So let's create the plots with the creator plot from GLM data tool. So the input is the GLM results file. So note not the transposed one. Just be careful about that. Then the metrics data table used for the GLM. So here is a really handy way to remember which metrics table you used. You can see in the GLM file name, you have the data you used to perform your analysis. So the 32 is the unit ops. So here, the metrics you'll need, the metrics data is the 23 and 32 for the unit ops table. So really a handy way to make sure you pair the right GLM with the right metrics and unit ops table. So let's execute and see what it looks like. So here it creates data collection containing four PNG files. So first I'll zoom out because it's in the high definition. So here we have the representation of the global analysis. So as I said, it's not relevant to look into these results as it gets data from several communities together. So it's not relevant. Let's look at the other ones. So the SWC IBTS isn't really interesting as well. So we see every lines is pretty flat. We have very big confidence intervals. So to give you more keys here is in blue representing the species richness variation estimated by the model through time. So not much variation from one year to the other. If any value would be significantly different from the first year of the temporal series, it would appear right here. Obviously it doesn't appear white. Then the lower plots represents the mean species richness for each year without the modernization and we still observe not much variation. You can get more detailed interpretation of this representation in the return tutorial, but I won't talk more about it here as we already stated these analyzes are of poor quality. So I'm going to zoom back too much. Okay. So here we can see we had the same results when before so we did the same thing. Here under the plots you can have a detailed interpretation I just talked to you about and we can go back to the population metrics. Maybe we'll get luck here with the population analysis this time. So we can go directly to the compute GLM on population data. Same as before the input just we have to do the visualization on several metrics file as we have several surveys. So here are the three files we need. Here there are bits SWC IBTS and EPO. And then the unit ops file is the same as before with the community analysis of course. So for the response via from the metrics file will choose just go back there will choose the abundance so the fourth column number of the species. So just as before we set year and sites as explanatory variables and sites with a random effect and same we won't specify any advanced parameters. So let's execute these models right away. And now we have a lot of outputs three per survey so nine in total. So we'll get the same outputs as for the community analysis. So you can go through it and try to analyze this as we just did for the community analysis. It will be computed for each species individually. So I'll just go through the SWC IBTS analysis in this video tutorial. So you can check for the interpretation on the by BITS analyzes in the written tutorial. And for the evoy analysis I'll let you go through it if you want and try to make your own interpretation of it. So here we have all our analysis so to know which one is which survey just refresh your history simply. And here you have all the tags you need to see. So as I find the plots easier to interpret and almost as complete as the row results, I'll do the interpretation directly on the plot. So I just open the create the plot a plot from GM data to again open multiple datasets of course what GMM results as well as for the metrics data table. It's in the same orders or it won't be a problem at all and then the unit ops entry just execute. So while I wait for the plots to be produced, I just will take a look at the rating file of the SWC IBTS analysis to see if there is anything I should preferably look into. So we can see all single species analysis has the same rate. So four out of eight, which is a medium rate. But of course we have to look into the criterias and we see there is the same criterias validated or not each time. So we can just do a meta analysis of all our analysis are all our single species analysis here. So we can see the sampling plan is never complete or balance residuals aren't uniform. However, dispersion is regular so it's hard to make conclusion on the distribution, but it doesn't seem to fit very well. Let's see how those the represent graphical representation looks like now. It's easier to know. So I'll have to zoom out again. So here we see the representations are the same as the communities. So we'll just look into the if there is any significant analysis. So here we see, as we don't have stars, significant stars on the global trains that there isn't a significant effect of time on the abundance of Gadus Moria. Same here we don't have significant effects. No significant effect of time. Again, there's no no significant effects. And finally here we have a little star, and we have a significant effect for us Chrome birth Chrome Bruce, which is simply the Atlantic mackerel. So it shows a significant augmentation according to the model. However, when we look further into the craft, we see there is a pretty doubtful augmentation of more than a million percent individuals from 1985 to 2010, which is pretty doubtful actually. We see there is quite an outstanding augmentation in row balances between 2006 and 2007 2004 and 2007 sorry, before coming back to average values. So it seems there are there has been some bias introduced or something. Let's take a look to the other ones if we see the same thing. So with the there is also an event like that with Gadus Moria between 2006 and 2010. Here we don't see such an event. It seems there is a problem also with this species. And what about the last one? No, we don't see it. So there are three species that seems to have issues and strange row abundances take off really. This kind of patterns often appears in bias data sets with a shift in sampling methods, for example, or mistakes while containing the individuals or typing the data. So maybe we'll get better results if we cut the data from 2004 to 2010. So you can try it if you want to after this tutorial. We've already seen tools to do so. So you can do it. I'm sure. And I hope the models will be better. So for the three other species we observed variations are smoother. The three variations are just seems okay, even if the model isn't significant. But we can observe some strange behavior sometimes in the plots as well. Like here we have a major augmentation then a major going back. There seems these two points are quite. So maybe there was mistakes from 1992 to 1993. We don't know about it. It happens a lot in this kind of long term surveys. So that's why you have to get to know your data before making any conclusions. So because many ships can have occurred in material methodology, people working on the project, the way we count, many things can happen. So make sure you take all these modifications into account when you do such analysis on data frame that you don't know a lot about. So here it is that we just did this tutorial. I'm just going to zoom back because you must not see a lot. Yes, we did everything we had to do today on ecological analysis. So it has been made on both community and population levels. So now you know how to preprocess abundance data on Galaxy, compute ecological metrics, construct a generalized linear mixed model and interpret it. So it's not that easy to get immediately. So I hope I didn't bore you too much with technicalities and that you enjoy this training with me. Don't hesitate to ask questions on the events chat or by email at this address. So my professional address, colin.royo at mnhn.fr. So if you have any suggestions also on how to make this training better, do not hesitate to contact me as well. And also don't hesitate to use the feedback form that is at the end of the training to give me your feedback. Thank you very much for watching.