 recording for the last hour and before the lecture I already showed the slide so by using the 3d matrix we have really easy to make these kinds of pie charts because we can just use the one dimension if we want to look at the fish dimension we can use the second dimension if we want to look at the lake dimension and then making the pie charts teaches us that there is significant differences between the experiments that have been done in the different years and this comes from the fact that the different lakes have been fished very differently in a way. So the contribution to each lake of the to the total amount of fish caught varies very much and this also causes the different species to have a very different influence on the total amount of fish that that were caught so the observations that we made was that there are some significant differences in fish composition per year there is different years as such the contribution is different and that means that we're probably not allowed to draw any conclusions across years and across lakes so we have to probably limit ourselves to looking at individual lakes and then we run into the issue that the sample size that we get from these lakes is relatively low compared to the total amount of fish that we caught so is there a solution and of course yes there is group similar lakes together which is the first idea that comes to mind we just pretend that all of these fish were caught in like one big mega lake but of course that's not something that we can easily do but if lakes aren't too different we might be able to group them win-win one two three thank you for following so if we group them we increase the amount of fish and then we also increase the amount of statistical power that we have so that's kind of what we want to do so head out of all of these piles of Legos that we have so all of these different lakes we're just wanting to kind of see which lakes are similar and we put those together if the lakes would have been all fished in a standardized way would that change the picture yes yeah definitely if you science or or not so much science but if you do an experiment and you then repeat your experiment you have to repeat it in the exact same way to get data which you can can merge together and the standardized way here like of course there it could be that from one year to the next the dominant fish species in a lake changes and then the year after another fish species happens to be in the lake right these lakes are maintained by people some of them will be filled with fish head there's organizations that are managing these different lakes and they have different ideas of how these lakes are managed but yeah if you if you want to set up an experiment like this you would probably want to say well we look at three lakes right so then we have three more or less well biological replicates that seems a little bit weird because it's not a biological replicates have it it's you have three replicates right so if I would if someone would ask me and would come to me like oh I'm writing a project proposal the first thing that I always say to them is limit yourself in a number of groups that you are creating because the more groups you create the smaller the sample size in each of the groups the less statistical power and the more likely that you're going to miss some very very significant effect and so I would say look at three lakes it could be that there's a massive nutrient input leading to a massive fish abundance increase yes that might be the case but if you would standardize it and say well each lake every year we fish for one hour at one specific spot right so we do like an electro fishing and we fish for one hour right then the first year we might catch 200 fish if there then would be a massive increase in nutrients then the next year we would fish this lake and then we would catch 400 fish but then that is something that we can more or less compare to each other right that the increase is because we now have standardized the time that we are fishing right so if I'm fishing for an hour and catching a hundred fish next year I fish for an hour and catch 200 fish then I can kind of conclude from that that there is an increase but in the data that I or in the in the data set that we're looking at it doesn't seem like something like that happened it seemed like there is we fish a little bit here we fish a little bit there we fish a little bit here in a way like which it doesn't seem that there are standardized ways like we fish for an hour we use this technique and so that there are some some some issues there and of course it happens that populations increase or populations decrease but of course it's very strange to see that enough in one of these lakes and like the Lomar Lake that we looked at that in one year 90% of the fish we caught is type X and then a year later we fish again and now we catch almost none of the fish from type X because then you have to start wondering what happened right did did all of these fish of type X die but you can only kind of see that when you when you standardize your fishing methodology yeah so if you have different sampling efforts between years and lakes then of course you cannot compare between years and lakes it would be like doing a laboratory experiment and using method a the first time and then using method B and then using method C and the first time you use one kilogram as input and the second time you use 50 kilograms as input then of course the data that you gather is not consistent and you can easily group those group those but let's try and figure out if there are some of these lakes which are similar which we can actually group so when are two things equal right so when are two things considered to be equal so hey we can use statistics right so we can test if two the lakes are different if they are not different then they should be equal the big issue here is is that we have only nine fish species per lake we have more of course when we before we started minimizing it but there we run into the issue that that some of these fish species are very rare we only look we only count like five fish in a single year and in the next year we all of a sudden have 200 of them so it's it's really difficult to use statistics in this case because you have such varying distributions but we can use a trick right like I showed you that we can just rely more or less on the correlation and normally I would say that if you compute a correlation and a correlation about 0.8 or higher that you could say that those things are equal right if I if I measure to or if I measure a whole bunch of data and I measure a whole bunch of data later on and there's a there's a correlation between the measurements that I did yesterday and today and this correlation is high then that means that I've probably measured more or less the same thing but you could use a factor that incorporates different efforts and catches catches per unit effort no yeah yeah that would be a standardization as well that if you take the for example the amount of spots that you fished for an amount of time right then you could define that and then you could calculate fish based on that but again here you have to remember that if you only catch one or two fish then standardizing based on a unit of effort will kind of artificially inflate that massively if you catch 200 versus when you catch two right so it's not easy combining these things and for every step in your analysis that you make you have to defend that you know in a way right if you want to publish a scientific paper then a reviewer will kind of scrutinize every step that you did and the review will have their own views on how to do this and if they don't align with what you did then you run into issues because then you end up in conflict with a reviewer or if you have three different reviewers having three different opinions on the same kind of technique that you did then it becomes even harder right because you cannot satisfy people who say hey if you have a reviewer saying you should use a t-test and the other one this yes you should use a Welsh test and the other one says you should use a linear model right you can't satisfy all three of them so and any approach that you think will then be rejected by at least two out of three reviewers because you can only satisfy one of them but then the way that I thought about it is that well we have three different years in my mind I'm just going to say everything which is correlated above 0.8 is more or less equal so if I sum the correlations from each of the years then when the correlation the summed correlation is above 2.4 then they are equal right so I'm just saying well 0.8 per year is good enough and I'm leaving a little bit of room because of course you can sum up to 2.4 by having a correlation of one and a correlation of one and a correlation of 0.4 have been on average if the correlation between two lakes across the three years is higher than 2.4 it means that from year to year to year the correlation between these lakes tended to be relatively good or relatively okay so we we kind of can assume that we've put in the same amount of effort and we measured more or less the same amount of or the same composition of fish within these two lakes so again using the 3d matrix it becomes really easy to do the correlation right because we can just say fishy 2017 hey use the Spearman method just to make sure that we that we don't end up having single outliers influencing our results a lot and of course there is already the issue with the fact that we only have like 9 or 10 species of fish so correlation estimates come with a relatively large margin but since we're doing it three times and then just summing them up I can just say well I do first correlation second and third for the three different years and then I just sum them all up and I just ask the question which ones are higher than 2.4 and then what I see is that like the Cothamster Colk is a lake which is unique it doesn't really have any other lake which which it is comparable and you of course see that every lake is identical to itself which is kind of logical because correlation always will tell you that but we see that for example the Steadorfer-Bachersee the Weidekamsee and Salzdorf they are relatively well correlated the three of them so these three lakes we could probably group into a single mega lake right because those three lakes have relatively good correlation with each other across the three different years that we measured so conditions might have been very very similar and of course we could also look at the map and say well if lakes are relatively close to each other then that might might help as well a little so conclusion here is we can merge probably likely treat right Ceadorfer-Bachersee the Weidekamsee and Salzdorf as a single lake but the question also becomes how is the stability for a lake from one year to the other year right if there's massive differences from one year to the other year then of course it becomes really hard to say something about stability over time right if we want to compare what we caught in 2017 with what we caught in 2018 then of course the composition in 2017 should be kind of similar to the composition we got in 2018 irregardless of how many fish we caught right we would assume that 75% would be rote auge and 25% would be barge so the stability of lake from one year to another again we can we can do this relatively easy so what I do is beforehand I create a new two-dimensional matrix which is filled with n a's right so it is a matrix which is filled with n a's I'm just going to take the columns from the fishy matrix so these are the lakes that we had the lakes that we have so lakes are going into the rows then we have three different years which are going into the columns and then we're just going to make we are going to add the names to it so we're going to give the row names and we're going to give the column names and we have three three different comparisons right because we can compare 2017 to 2018 we can compare 2018 to 2020 and 2017 with 2020 as well right so I just do pairwise all of the different years together it's getting very dark here actually but have we compute the correlations between the different years so we go for L in the the names of the lakes just compute the correlation between 2017 2018 and put it in our matrix and do the same thing for 2000 for the other comparisons right so we just fill them up and look at the correlation so how does this look so here we see the correlation coefficients and we see that there are some lakes which are very consistent so they they they are fished and then the fish that are being pulled out of the lake are very similar in 2017 to 2018 they are very similar from we look 2018 to 2020 and again for 2007 they're very similar right so the Cothamster Colk is a lake which is really stable across time in a way but we see that Hopples is definitely not stable and even a little bit weird because from 2017 to 2018 we see that the correlation is relatively low from 2018 to 2020 the correlation is also low but 2017 to 2020 we see that there's a perfect correlation so the fish that were caught in 2017 the distribution of the fish is exactly identical to the distribution of fish that we caught in 2020 while in the year in the middle the distribution was more or less completely different but we see that there's Hopples has like relatively low correlation from year to year let me actually put my webcam up a little bit because it's getting really dark I think I think we and it looks a little bit better I think we're getting a massive thunderstorm soon there are lakes which have medium correlation as well yes so we have to visit the mayor divide divide the come say state of her bacher say Salisdorf and block horse those are those are lakes which have relatively okay correlation and all the other lakes seem very very stable across time so but this is just something to keep in the back of our minds when we when we want to kind of compare two years to each other we can do that for most of the lakes but for Hopples we need to make sure that we never compare 2018 with 2017 and especially like here Salisdorf it's not that high but it's not that low either right but comparing 2017 with 2018 is a correlation of 0.52 which is not the best correlation in the world it only means that there's a 25% overlap which is 0.5 to the to the power of 2 so the squared correlation is known as the shared variance sort of the shared variance is relatively low only 25% from year to year alright so now we looked a lot of our data just to get kind of a basic understanding in what's going on which lakes are the lakes that we can look at but of course this experiment is much bigger because they also applied all kinds of treatments apparently so I load I loaded in the data set which is called the road auger where we have the size of the fish being measured so this is just for the road auger so only for one fish species every fish that they caught was measured the length of it in millimeters or centimeters that's still undetermined probably millimeters but you never know and there's different treatments that were applied so had the first treatment is tot holds I was like I'm not a native German speaker right so I looked at this so I saw tot holds so the first thing that I do when I see a word which I don't know is I just throw it into Google and Google told me that unter den Begriff tot holds versteht man stehende und liegende baume oder teilen davon die abgestorben sind es ist der letzte Entwicklungsprozess im Leben eines baumes und einer der wichtigsten struktur elementen unseres Wald so from that I kind of figured out that probably they took like that branches and wood and put it in the water then we had something which was called flachwasser so on a plus tot holds I have no idea what happened exactly there we have a controller group which probably means that nothing happened then we have fish besatz which someone told me is when you add fish to to to the water so you you put in more fish and then we have an unbewirtschafted the controller and then I was really confused because how can there be two control groups that I'm just like as a statistician when someone says no we have two control groups in our experiment that I'm thinking like okay and what's the exact difference between it and is it because it seems to me that there's some uncertainty about what really is your your default group so to speak but I looked at it further and these are the definitions that I saw and the controller and the unbewirtschafted the controller so the uncontrolled the other the one is fish the other one is not fished but you're doing an experiment where you're fishing lakes so you mean that the unbewirtschafted the controller was never fished or wasn't fished by other people than you because that makes a big difference right no activity by recreational anglers all right and that that means that you put a security guard around the lake and no one illegally fished it like there wasn't a group of foreign workers nearby that just every Saturday afternoon controllers anglers go fishing there all right but you didn't put anything in the water interesting I struggled with this because I I chose a completely different control group because for me the control group is before you put something in the water that's the control group that's something I don't know illegal fisheries never show up in data that's not entirely true that's not entirely true like if you're all of a sudden missing all of your fish in your lake right if you fish it one year and there's no fish coming up then you can be damn sure that someone fished it if it's your unbewirtschafted the controller fishing section but but like for me it's it's a little bit strange because natural predation yeah well it's really strange that all of the fish are gone right like then someone like that like predators are good but they're not as good as to decimate a whole lake and there's nothing in there anymore but for me the like the control group is so if if I would set up an experiment right I would go to a lake I would fish it for an hour and I would see what I got out of the lake then I would put in my treatment right so I have like this benzoic acid and I just throw a barrel of it in the lake or I take like my nuclear waste and I just dump the nuclear waste in the lake and then I wait some time and then I fish it again so now my control group is the fish that I caught before the effect of the treatment is seen when I put in the nuclear waste or the the benzoic acid right so for me that makes most sense as a control group having other people fish there of course is is logical right because you're dealing with a dynamic system but a control group or at least in in terms of statistics means that it's it's before and after the experiment right so you have before the experiment begins you fish this is your baseline and then you compare everything after your treatment to the baseline level and then of course there can be different situations like controller which means that people fished there unbewirtschafted to control is something different because no one fished there so these are different treatments but your control group or your real kind of zero group your null hypothesis group or the ones that is the ones before you started doing anything so before you did but let's just look at some basic statistics right so the road auger when I look at it it looks like this so we have a puzzle here because we have a lake a date and a point ID and then we have a time point like before and then we have treatment which is the treatment that we just talked about so if I look at the I totally noticed that I totally missed it doesn't matter like I'd rather not know anything beforehand right because then I can just look at the data with like an unbiased eye and say this is weird or this is strange or this makes sense and and this comes from the time that I was working as in human genetics because there you get no data like all patients are anonymized and all treatments are also doubly blinded and that's the best research double-blinded so that the researcher can't have any hypothesis and the data speaks for itself so but the lake the data point and the point ID is is kind of an interesting puzzle because in this road auger it doesn't have the it doesn't have the length of the fish it just has the the treatment that was applied and then where the treatment was applied and then there are some other data about when it was done and this end to end column which I thought initially was probably very important but I didn't touch it because it was strange but if we look at the treatment right so if we just look at the road auger treatment then it looks really really good right we have like massive sample size 17,000 fish which saw the treatment fish besides if we look at the the other treatments we also have really really high sample sizes which is really good because that means that we have a lot of statistical power but then I thought like let's include this time point thing right so before and after and the nice thing is the table function you can give it a vector and it will just table everything which is in the vector you can use the table function on a matrix which has two columns and then it gives you a split out across the treatments and across the time points so the table function really useful function really easy if you want to make two-way tables and then I saw something which was really good as well because we have as many fish caught before the treatment as after the treatment so that's really good if I look just look at like toad holds right I have 821 road auger caught before I did anything and then after I did something I have a thousand so that's a really good statistical test right I could do a test of if I would just do a t-test between the two groups I would t-test 800 an N of 800 versus the the N of around a thousand so that would give me massive statistical power and I could pick up like really really small effects of course I also split it out per year right because here is our kind of third massive dimension and if I do it like this now I saw this is strange because in 2017 there is only before treatment and no after treatment so that means here I do a t-test for toad holds on 821 fish before and zero fish after meaning I have no statistical power in 2018 it's the other way around and in 2020 it's also the other way around so yeah this is a little bit of a pickle because we just concluded that it's very hard to look across lakes across time because of the different effort put in and the different contributions of each of the lakes so we can't look at because inside a single year I would be very comfortable of testing before and after right because it's in the same year so that would be good but we can't do that so we run into an issue where we have no power to test within the year but we have power to test like alongside the year plus if I would look at this data in this way oh my god if I'm falling out that's because a lightning strike just cut off the stream but if I if I think about what happened is that in 2017 the different lakes were fished and then they put in the treatment and then they came back in 2018 and fished the lake again and then they did the same thing in in 2020 so they fished the lake again and again this is after the treatment because the treatment was applied in 2017 so yeah that's a little bit difficult but we can make some statistics still right because we can look not we can't group all of the lakes together across the time because that's something that we are probably not allowed to do statistically but we can look at a single lake across time when the lake is not hopples right because hopples is a very weird lake and then we have some of the other lakes which are also relatively weird but probably any of the other lakes would be fine right the lakes which are stable across time are lakes which we can use to do the analysis here so we have to look or we have to do the analysis within one lake comparing 2017 to 2018 and then comparing 2017 to 2020 to see if the treatment that was applied had any effect on the size of the fish so this was my hmm so in 2017 they fished before they applied the treatment so what is the control group then so in my mind it's getting really dark let me let me turn on a light here so that you guys can see me as well so statistically speaking the control group is all of the fish in 2017 because none of the fish that were caught in 2017 ever saw any treatment right so the the treatment column for 2017 is is actually a nonsensical column of course they were living under controller or on the worst of the controller but in theory we are interested in the effect of the wood that we put into the lake and if we before we put any wood in the lake then all of the fish were more or less living in different circumstances right because a lake is either fished or not fished or but the treatment itself the adding of the wood to the water was only done after the fishing in 2017 so the control group in my mind so the base that we have to compare to is all of the fish caught in 2017 because they never saw any of the experimental influence that that was applied to it and then we have of course 2017 versus the other years and then why do we have this but now we figure that out what is the difference so first let's look at the data and solve this lake data point puzzle right because we have two different matrices and they are coupled together by this lake date and point ID so the point ID is where was the fish fished and then we have the lake which is the lake where the fish was fished but they are they are in two different matrices so we have to merge this together to be able to say the fish that we caught at this point in time had these kind of treatments applied to it so let's look at the road auger right so we have the road auger and first let's just look at the road auger so if we if we look at the road auger across all of the different lakes then we see that that some lakes are actually useless like we can't use Lohmore for anything because no road auger was ever caught there the same thing holds for Xela no road auger was ever caught there and if we apply to this to the road auger matrix the sum across the columns then we see that there's like 149 the road auger caught in 2017 274 in 2018 and 330 in 2020 and if we sum them all together then we have 735 road auger caught and weighed measured length lengthwise but the conclusion here is that three lakes have not enough information on the size of the rogue auger before the start of the experiment because hand like the state or for buggers they only had five fish caught before you threw the water or before the wood was added to the water had the same thing holds for the the mitzler mitzler say only three were caught so we could be lucky and we could catch three big ones or we could be unlucky and catch three little ones but we can't compare the 89 caught afterwards to the three that we caught earlier because we can only compare across time and and and not between lakes we can't easily group them and so in two lakes we have limited information so the Cothamster Colk and the Salzdorf and these are also strange because hey in the Cothamster Colk in 2018 we only caught one so we can't say anything about the effect of the treatment from 2017 to 2018 but and so two lakes need to be discarded anyway so in the end we end up with three lakes which have a relatively solid before treatment group right so this these are three lakes we can do something with with them so let's look at the Rodeauge in these three lakes so first I'm going to just make a new variable which is called lakes which has the names of the lakes in there then I'm going to use fishy Rodeauge lakes apply to this thing the sum and see how many fish I get so I get 93 in 2017 53 in 2018 and 152 in 2020 it means that in total there were like 300 fish caught across the three different years I make myself a little helper function to which as a input gets the fish data set of that year right so the F data set is the is all of the fish caught with the length of the fish so this will tell me which rows are containing Rodeauge right because the the the data set the Rodeauge variable which has the treatments that were applied to it doesn't have the lengths in there so I have to fit the links from there and then of course I want to just look at these three different three different lakes so I use my helper function here to determine from the F 2017 which ones are caught in this lake and which are Rodeauge and take those ones out so I just make subsets right I could have used the subset function as well but I generally just use this kind of structure where you use selection from the matrix so I determined that so I just made three subsets visu de mir key steig and saalstorf and those are the have for 2017 because I just wanted to first look at my control group but you see that we we started off with having 753 fish measured but from a statistical point of view only 300 of those can probably be used to do any kind of statistics on because the other ones had there could be a massive lake effect it could be that in Lake number one a Rodeauge on average is like 500 or 500 millimeters big that's pretty big could be that in that in Lake number one there is a big lake effect making all of the fish bigger in that lake compared to the other lake so so we start off with like a sample size of 753 and then we already have to reduce that to like 300 just based on the fact that not every lake has a reasonable amount of Rodeauge caught and then I need to subset this per lake as well because I need to know for each of the lake how many fish there were so my first question was how big is Rodeauge so I created histograms and the code will be online because the code was just too big to fit on a slide to when you want to make it look good so here what we see is we see just the basic histogram it's not counted numbers but it's the density and here we see the size so had some thoughts when I saw the histogram was like well the size I thought it was measured in centimeters but a hundred plus those are some pretty pretty hefty fish but we figured out that it's a millimeter so that's all fine but my first question is is why is the vis of the mirror so different from all the other lakes why are like like a bunch of the fish in the vis of the mirror much smaller than the other two lakes and and why are big fish or fish like above 70 millimeters only occurring in the vis of the mirror and not in the other two lakes that makes no sense if you would just randomly sample fish from a lake then you wouldn't expect two lakes to be very different in the range right the range of the data should be very similar the smallest fish in lake one should be comparable to the smallest fish in lake two and the biggest fish as well so here there's a there's something really weird going on and I I don't know what a road out is but I would definitely look into this a little bit more and see how big a road out really is because the average road out in in vis of the mirror is much bigger than the average road out in the other two mirrors in the other two lakes that that were fished and that was just that was just striking to me when I zoomed in a little bit in this area to compare the other two all of the three lakes show a very very different distribution so so none of these three lakes have sizes of fish besides treatment no that can't be the case in this case but it could be why did you change color you're now bread in all of a sudden but it could be it could be that there is some kind of treatment going on but for the statistical power because we actually want to use these 300 fish we need to be able to group these lakes and we definitely can't group these lakes at all because every all these three lakes have very different size distributions so we can't just merge them all together that's statistically not allowed if two things are different you can't merge them my other big question is is why is there no Gaussian distribution why does this seem like more or less a two-part distribution where you have relatively big fish and you have relatively small fish yeah we're only looking at 2017 this is the control group right nothing happened here so no treatments were applied yet that's that's that's the thing that I found really strange and and of course they're something happened to the lakes before you started doing your experiment as well right that's just the thing that it is because it's a lake people can go there like but I would definitely check out the visa de mire and for example Solstorff and see what the difference is between the lakes when before you started the experiment because either all the big fish in Solstorff were not there because of some reason or so that there's something strange going on and I didn't know what why there is no Gaussian distribution in any of the lakes also puzzled me a lot because if you would just a fish growing in a lake and there's no real pressure or anything on these right if they're not like if there's no sluice that they have to get through to get to the lake and that's not the case because all of these lakes have no in and output so I was I was really wondering about why there's no real Gaussian distribution I would have expected a relatively normal Gaussian distribution especially with the number of fish that we're looking at so grouping them together I would have expected a really beautiful Gaussian they're massive morphological and productivity differences between the lakes yeah that's what I thought and that's also what you see from the data you can't really group lakes together because every lake is more or less a unique environment which makes this whole experiment a lot harder because the lake is actually not a replicate of the previous experiment they're actually all individual little experiments that you do and and that makes it really hard but yeah so let's compare them right so let's just take one lake and let's look at the three different years that we have so I went for key steig Brechlinger here this one because it had 32 fish in 2017 53 fish in 2018 and 37 fish in 2020 road auger so head that so that that's the that's the kind of it's kind of a relatively okay sample size you can compare 30 with 50 with 37 so when we zoom in a little bit more and this one didn't have the weird thing here as well so I was thinking that that's a good lake to compare right so I just take out the three different subsets create a new histogram and then we see that they are looking relatively normal so that's kind of good so first instance it doesn't look that normal but when you zoom in a little bit it kind of looks like three different normal distributions so that's kind of okay these two peaks here are a little bit strange but of course when we just compare them by eye we see that there's something going on and 2017 seems to be more or less in the middle fish seem to be a little bit smaller in 2018 they seem to be a little bit bigger in 2020 okay so let's see if that's true so when we do statistics we first want to check if things are a normal distribution and I just had did the Shapiro test which we discussed in like lecture number six I just Shapiro test all of the three distributions that I have all three distributions past the test so they are not significantly different from normal so we can just use parametric statistics which is really good so let's pairwise t-test them so pairwise t-testing teaches us that in 2017 compared to 2018 there is a significant difference in the size of the fish and the mean in X so in 2017 was 52 cent 52 millimeters and the length in 2018 was 49 millimeters 49.3 so they became indeed a little bit smaller in 2018 and there seems to be like a 3 millimeter increase in 2020 compared to 2017 and I'm only doing 2000 because 2017 is my control group for my statistical perspective I would say that that is the control group it is before you did anything so how we compare all of the years after treatment to the year before treatment and we see that there are significantly significant differences right so I do the t-test to determine the year effect because some years the weather will be warmer so fish will grow quicker or another year there might be a lot of rain so there might be more water in the lake or whatever right so there is a year effect so there is a if we would do a linear model then we would say the size of the fish is determined by the year that they were caught plus the treatment that was applied to them and that's kind of the model that I'm trying to build but first I want to use some t-test to see if there is really a year effect so there is a year effect so we should include a year effect into our linear model so let's let's do some statistics has so the outcome from the t-test is that they're relative to 2017 in 2018 fish were 2.6 millimeter smaller which is statistically significant and I also wrote down the confidence interval so had the real value should be somewhere between 0.6 millimeters and like 4.8 millimeters and relative to 2017 in 2020 fish were like 3.3 millimeters bigger so confidence intervals so there is a significant year effect so we can do a linear model for a linear model we need to have our lengths in a vector and we need to have our years in a vector so what I did is I just say take the have so from the subset that we created so keys 2017 take the length of the fish from 2018 take the length of the fish and from 2020 take the length of the fish combine them together and we call this our lengths vector then I'm going to make my covariate right so the covariate is just 2000 or the character 2017 repeated the number of rows that we have in 2017 I'm going to repeat 2018 the number of rows that we have it in 2018 and I'm going to do the same thing for 2020 so I'm just creating two vectors I'm going to create our model and my model is just going to say well make a linear model of the length of the fish compared to the years that they were caught and then indeed we see exactly the same as what we say with the t-test had the intercept so in 2017 they were like 52 millimeters long 2018 they were smaller 2020 they were bigger so very similar to the t-test so the linear model and the t-test come up with the same answers and this is of course because the assumption of normality is more or less it more or less holds so that's good all right so now we get to the treatments so can we now say anything about the treatment on the size of the fish so I made this little function which is the get treatments function which takes the lake and the point in the lake that was fished right so it takes the road auger matrix it takes the lake column out of there and then make sure that we only take the ones which are the lake that we are interested in and then I say well and take the point ID is the point ID so head take this so I call this I I so these are the rows in the road auger for which the lake is Lake and the point ID is point ID I take the treatment right so the treatment is the unique of the road auger II treatment because the road auger is the is the animal that this is the head because I need to take the unique because of the fact that the same lake and point is in the road auger data set multiple times I don't know exactly why I think it's always in there three times once for 2017 once for 2018 and once for 2020 but if the length of the treatment after the unique after making it unique is is larger than one than I want to throw a stop error because that means that there's something really wrong right because that means that at the same position in the lake two different treatments were applied and I don't think that that's possible that are at least at least for the the the totholes right because I'm I'm only interested in the wood because that's something that you put in the water so is there an influence of that and if if that is not the case right so if there's only one treatment applied to this point in this lake then I just return the treatment so 2017 is our before group so I'm going to make a vector called treatment which will just for each fish in 2017 have the word before right before treatment before treatment before treatment and it has the length of the number of rows in in 2017 and then I'm going to go through 2018 so the fish the measurements in 2018 and I'm going to get the treatment from the lake that we are investigating and then for this fish I'm going to take the point at which the fish was caught and then I'm going to use my get treatments function to get the applied treatment on this fish and then I add this to my list of treatments right so in the end I'm just making a new vector the same as that I made the year vector for 2017 all of the fish will have treatment group before treatment and then in in the in the rest of the fish the vector will contain the treatment that was applied to the fish and so I have to do the same thing for 2020 as well and then I just did a did a table of the treatments to see which treatments were applied and then I did my linear model again right because I now have the length of the fish I have the year in which it was caught and I have the treatment which was applied and now something strange happened because when I looked at my model it says well the standard fish caught in 2017 was 52 millimeters in 2018 they were smaller and in 2020 they were bigger the effect of the treatment of Toad's hold is NA and I thought NA what what's going on here but it turned out and let me switch you guys to R and let me just load in this part of the analysis just so that I can show you let's just have all of the code just run by so just to make sure that I have everything loaded correctly all right filling up the fishies counting them showing some of the total fishies recounting them making the plots figuring out which lakes we can merge some of the other figures we already saw some of the histograms our helper function and then our treatment right so our treatment is before and then we have to figure out the treatment for 2018 and 2020 and now when I do a table of our treatment treatment like this then I see this 32 fish were caught in 2017 and fall into the before group but all of the fish caught in 2018 saw the Toad's hold treatment all of the fish caught in 2020 also saw the total treatment so that means that when I when I look at my model right and I do a C-bind of the how did I call it the lengths the years and the treatments then it looks like treatments treatment sorry then it looks like this right so I have the length of the fish the year it was caught and the treatment that was applied to it so the 2017 are my basic group but then I see that all of the fish caught in 2018 saw the Toad's holds treatment all of the fish in 2020 also saw the total treatment and this is a clear sign of collinearity because there was no fish in 2020 which did not see the total treatment so that means that we cannot investigate the effect of the treatment because of this collinearity effect right where all of the variance that could have been explained by Toad's holds is actually explained by the year by year effect which we saw was a very significant effect so we cannot leave here out of the model it has to be in the model and so when I then do the do the linear model then the linear model is just saying I cannot estimate any of these effects right because like I can estimate the effect for 2018 I can estimate the effect of 2020 but I cannot estimate the the treatment effect in this case and this is this is true because collinearity is one of these issues where where you run into collinearity because every woman in the control group so if you have a case control study right and all of your cases are males and all of your controls are females and then of course the drug that you gave them doesn't really matter because every change or every difference that you see can be explained by the difference in males versus females so that's why you generally randomize the treatment versus being male or female right so you say 50 males get the treatment 50 females get the treatment the control group 50 males don't get it 50 females get it but you never randomize in such a way that the treatment is only given to males and and females are not giving any treatment because then you your your your experiment is collinear with some other effect so then all of the differences that you see could come because you are comparing males to females or when you are comparing or and not the drug and that's the thing which happened here because here all of the fish saw the totals treatment in 2018 and in 2020 but the before group didn't see any of the totals treatment but we have no fish from this lake which didn't see the treatment so let's just go back so unfortunately the basic rules of experimental design were broken because every fishing spot was treated this means that treatment is collinear with the year effect this means that we can make no statement about the treatment since it could just be the year effect it could just be the fact that 2018 was a very warm year so fish were smaller and 2020 was a very hot year so fish were bigger but the treatment we cannot say anything about because all fish in 2018 saw the treatment and all fish in 2020 also saw the treatment so we we we can't assign the variance to the treatment we have to assign the variance to the year all right so that was it for today this is my take-home message from looking at it experimental design is hard and I already told you that this is the slide that I showed you before always consult a statistician before setting up an experiment use blocking and randomization techniques to make sure that you have a control group and a treatment group but that none of the other covariates are aligning with this I only looked at a single lake and I only looked at a single fish species it might be that the treatments in the other lakes or in the other fish species might be better randomized or might be better blocked but the distribution in 2017 is very different from the distributions in 2018 and 2020 so we can't group lakes to get another another group which did not see a treatment so comparing cross lakes since since the lake effect needs to be considered then because then our model would be that the fish right the size of the fish is dependent on the year that they were caught plus the lake that they were caught in plus the treatment that they saw but this won't solve the co-linearity that we just saw for this one lake between between the treatment and the year because even if another lake has these differences then that doesn't help us for saying anything about this key-stike Brellinger so data gathered in key-stike Brellinger cannot be used to investigate the effect of Tootschold on the road auger because of the co-linearity and that is that is something that I found a little bit of a shame and a little bit of a little bit worrying as well anyway this is my take-home message this is the data analysis that I did I thank you very very very much for the data set I enjoyed working on it like I said I generated 245 lines of code I will send you the code I will we should discuss later to see if we should put the data set on Moodle or if we should put put it on yes so that the other people can play with it as well or if we put the code on Moodle we should also discuss if you want to have the lecture today on on on YouTube I don't know but yeah we should definitely talk about this more but I really enjoyed working on it and I thank you for letting me look at your data and just kind of going through I hope you guys learn something about like how to use like a three-dimensional matrix I think we should definitely put the code on Moodle and that people can at least look at the code and have because the code for the histograms and stuff I didn't show so that was it for today so are there any questions is there any discussion that you would want to engage in are there any questions that you want to ask me and I think most of the time actually didn't went in writing the code it went in like writing down the slides if not YouTube can the lecture still be on Moodle yeah definitely I think that the lecture should be on Moodle I'm just wondering if it's a if it's an idea to put it on YouTube or if it's like well rather or not right like it's it's that's the thing it's not my data and I use the names of the lakes and stuff and I know that some researchers and can be relatively protective of some of the effects or lakes or cows or or other things that they use for example if we publish data about cows since they are production animals which live on farms all right you have a lot of questions I will write them up first and send an email yeah definitely definitely you're more than welcome to just make an appointment and come by in the office that we can sit together look at the data and just discuss the questions so sending my email is fine but need to rewatch the recording yeah sure that that's not an issue yeah so let me know what would you guys want more and yeah yeah just come by the office then we can just sit and talk and and look at the data and and play with it and and see because I'm not a fish specialist right like I'm I'm just a statistician so if people give me data I generally don't care if it's from fish or from from other things or like all right perfect so thank you guys very very much let me let me do something it's such a nice thing I really love it I really love it I actually have buttons to actually kind of show you the R window don't show show don't show don't show show that's much easier than what I used to do is just click on the thing and have to like figure out which one so I hope you guys liked it I hope you like the new style as well so any comments or discussion about the style of the slides compared to the older style which is which is this style right so with with the like with the wave which I know a lot of my colleagues don't like a lot of colleagues say that well your slide design that you always use you lose like 30% of your slide just by the header and I'm like yeah that's that's a choice that I make to not overload slides with text or not overload them with like pictures and stuff thanks yeah yeah the birthday yeah I should have actually put in like a birthday audio thingy but I didn't have that much time this morning to prepare and I spend almost too much time playing with it anyway so and I had to do other stuff and yesterday was also madness because we had to do an emergency mouse dissection because one of the mice was sick so to set up everything for that but I will have a good birthday I already had cake this morning so it's good waking up starting off with cake so thanks to my moderator of course and girlfriend and got some really really nice gifts I'm gonna have a good birthday all right so if there's no comments or other things then this was the last light that I made so stream just ended see you next time or if you're seeing this you're too late so if you're happy birthday yeah thanks thanks thanks yeah it's a shame that we can't have an in-person lecture then we we would have cake and and that kind of stuff but unfortunately we can't do that due to corona all right then thank you all for watching next year you're very optimistic you're very optimistic that I think it still is going to take two years like from a virology perspective like the second wave is just starting and then we have the third wave next so I think that in like July 2023 we will be fine again so all right then guys have a very very good evening and very very much good luck on the exam next week I hope everyone already signed up if you didn't sign up you're probably too late thank you for watching and I will see you guys sometime I don't know we don't have an extreme plan so it's gonna be tricky I might play a little bit of online games when I have a little bit of free time but see you guys soon and good luck on the exam