 Okay, so welcome back. If you're watching this on Twitch or if you're watching this on YouTube, um, I forgot to start the recording, so we're gonna fall in halfway. But the data that we're analyzing is data that has been collected by Seahum, one of my colleagues. So of course, like when I create an R script, right, I first give it a little header. So this is the basic gulp analysis for Seahum, basic gulp, because that's the conference that she wants to go through. And of course, I start by setting a working directory. So this is where I want to write my files and where I want to load the files from, at least for the thing. So what we did is and let me actually open up the file here so that you guys can see. So for each goat that or for each goat read that we are looking at, we define something which was a goat type. So these are the 32 different breeds of goat that we have from Africa. So and these goats, we classified into two types. Well, actually, we have two type definitions, but we're only going to use the first one. And that is we are saying that a goat breed can be primarily used for milk or for meat. And then we have a classification dual, which is a dual purpose. So then there's no real selection or there has not been any selection for the goat being a milk type goat or a meat type goat. Right. So we have all of these different goats like the Gumes, the Guera, the Cameroon goat, the Oasis goat and Saidi goat. So these are all goats that are living on different parts of the African continent and where we have some genotype data from. So what Seahum did is she went through the literature, she classified every goat as being a milk type goat or a meat type goat or a dual type goat. So that's the first thing that I'm going to load into R. Right. So that's the little table that Seahum made from literature research. So let's just read it in. Let me show you guys the R window. And I have to see the R window as well. So when we load it in, it's just a very small table, which has types. So we can see that the Abergela is a milk type goat, Ambo is a meat type goat, and so on. So then the next step is to source some functions. So this is just a little R script that I made before, which allows me to do some pre-processing because we don't have only African goats. We also have data from the 1000 goat project, which is called ADAPTMAP. And we also have data from our Swiss collaborator. So the first thing that I need to do is load in all of the data that we have. And this is going to take a while because it's a big, big data set. So let me just start loading that in and then we can just chat a little bit in the meantime. So my computer is having a hard time because it's loading it in. So I can show you guys the dimensions of the data that we're working with. But as you can see, it's not instant, right? So that means that this file is big. Let me actually check on the hard drive how big it is. So the file that we're loading in is 745 megabytes. So that's a pretty, pretty big file. And you're streaming. How do you mean you're streaming? Yeah, I'm streaming that. Why would I not be streaming? It fits into the lecture very well. It fits into me using BioMart and showing you guys how I do these kinds of things. And just to show you guys that it slows it down. No, not so much, I think. I hope. I hope. I think that what's slowing it down is that I've already loaded some data sets in this R window previously. But makes the computer work harder. Yeah. And gives me more free time. That's one of these nice things about being a bioinformatician. Like you can always claim that it takes a long time. Okay, so data.num is the data file, the numeric data file. I don't even think I can open that one up in Notepad++. I will actually open it up in Notepad++ as well, just to make the computer work a little bit harder. So this is how the genotype data looks like when we got it. Well, after I pre-processed it. So here we see SNP1s. So these are the SNPs that we are looking at. Data file. A person who really loves data. So here we have all of the different SNPs in the rows. So if we scroll all the way down, we can see that we have 43,000 locations on the genome, which were assessed to see if the animal has an AA, if it has an AB, or a BB, right? Because you have two chromosomes. So if we look at a SNP, then one animal can, for example, be AA, while another animal can have a TT in there. So it's a big data file, so it contains 48,000, 43,000 rows. And if we scroll all of this way, then we can see that it contains like a massive, massive amount of animals as well. So every column here is one animal, right? So the first animal, the et abr 0001 animal, had a zero, which means that it was a heterozygote. So it had an AT SNP. Then the next one here is also a zero, also an AT. The next SNP is a one, so that might be a GG. The next SNP also one is a GG, right? So these numeric codings are just codings that actually allow me to have a numeric value for the genotype. It actually finished loading in. So we can also now ask the question, how much dim data.num, right? So we can see that we have 3,401 SNPs measured on 5,741 animals. I also have my sample.annot file, which is the sample annotation. I can just roll it a little bit over the screen, right? And then we see that this animal, et abr 0001, comes from the other map project. So that's the Big Go 1000 Genome project. It was an abergale. It lived in Africa and it came from Ethiopia, right? So if we want to want to see some statistics, then we can say, for example, continent, oh, continent, and then we just do a table, right? And then we can see that in total we have 2,402 goats from Africa, 1,800 from Europe, only 91 from North America, and so on. And then if we want to look at the countries, we can also do that. And then we get an overview of all of the countries. And we can see that some goats from Algeria, some goats from Cameroon, Malawi, and these kinds of things. So it's a lot of data that we've collected over the last years and not just us, because most of this data is collected by the other map project, which is this big collaboration between dozens of universities. All right, so the first thing that we want to do, and let me switch back here, is fix some annotation errors. This always happens because in some cases people wrote Small East Africa, but the goat name is actually Small East African, right? So that's the first error that I want to fix. And then the next step would be is to do clustering to find if there are any outliers. So let me actually do the fix, right? And then do the clustering. So the clusters are here. And this will take a while because there's 6000 animals. And then I could just plot these clusters, right? Oh, plot clusters. All right, and then this will take a while because there's 6000 animals. And you can see that this is the big kind of tree of relatedness between all of these animals. So the big issue here is that, and you can't really see it in this plot, I also made a plot, which is an SVG, which you can zoom in and out of. Then you can zoom in more. But the big issue which we have in the data is that we want to remove outliers, right? So we want to remove outliers, which are animals, which are way too far out of their cluster based on what their type is or what their breed is, right? So I can have like 50 animals, which all nicely clustered together. And then animal 51 is all of a sudden all the way on the other side of the tree. And I don't want that. Some animals do not have any breed information, so I can't use them either. And then I want to cluster them again because I also want to retrieve the mixed animals because some animals are not from a single breed, but they are mixed between two breeds. And then I have this function called too closely related, which takes the distance matrix, which I just loaded in. And that removes animals, which are more or less the same animal, where the animals are way too closely related to be individual animals. So this might just be the same sample, which unfortunately was done twice on the SNP chip, or it's just some other errors. I combine them all together, right? So the outliers, the missing breeds, the mixes, and the related animals. And then I just throw them away from the sample annotation, right? So when I do this and I already did the cluster, so let me just show you guys. So it just calculates the animals which are too closely related. And then if I ask for the dimensions of the sample.nod, we are now left with 5268 goats instead of the 5,741. So just getting rid of like 500 animals, which are not proper. All right. So then the next step that I did was say, well, take breeds. So which animals are living in Africa, right? So and then take the breeds. And then I require at least 20 animals, right? So it's a big thing, which does a, so which animals are living in Africa, then take those from the sample annotation, then look at the breed column, make a table, and then ask which ones are above 20, and then give me the names. So in the end, this whole statement does nothing but select animal breeds, which have more than 20 individuals belonging to the breed living in Africa. So if I do that, then I end up with EX. And then we see that there are 45 different breeds, which are living in Africa, which have more than 20 animals. Okay, that's nice. We learned something. And I do a table. Wait, that's, oh, yeah, then I actually do the same, then I select them, right? So which animals are there and living in Africa? So I'm just making my sample annotation smaller by just selecting these. And then I want to see how many animals I am left with for each of the breeds. And this is just a check, because I want every number here to be larger than 20, which it is. So I didn't make a mistake. So here you see the 45 breeds that we have. And then for each breed, you see the number of animals that we have. All right. And then the next step is to classify them into meat and milk, right? So I'm just going to say, well, give me the breeds which are classified as milk, and store this in the variable milk, and then give me the breeds which are classified as meat breeds and store this into meat. So meat and milk. All right, so now I can see milk. So we have nine milk breeds, and we have classified 12 meat breeds. So that's, that's what we learn. So again, because we also had in our table, right in the goat types, we all said dual, dual purpose breeds. But initially, I'm not going to analyze them. So I'm just going to say, well, again, make a subset and only take the animals which belong to a meeked or milk breed, and then make this a subset. Right. So again, when I do this, I can see that we end up with the dim of the sample annotation, which is 1098 animals. So again, we filter down from having a lot of data all the way down to having a very manageable amount of data. It's still a lot of data still 1100 animals measured on 43,000 snips or something. So the next step is, of course, is now we have the animals that we're interested in. So now I'm just going to use the names of the sample annotation because every animal has a name, and I'm only going to take these animals out of my big matrix, right, because the big matrix that I loaded in has all of the animals in there. So I'm only going to take the ones which are a meek or milk breed living in Africa, having more than 20 individuals in the breed or annotated to the breed. And then I have to do a little trick because I have to make the whole thing numeric, because when I loaded it in, I didn't specify that it should be numeric. So this is just an R to make sure that the whole matrix that we have is a numerical matrix. And then, yeah, I'm just going to override my data num, and this will also flip it around. So it will also transpose the matrix. So let me guys show you how that looks. So if data.num, right, let's just look at the first 10 snips, first 10 animals, right. And now when I do the subset selection, then it looks very numeric, but it R didn't understand the type, because the R type system is difficult. So what this does, it just goes through every row, makes it a numeric value, and then does it again. So now, because I did my subset, I can now look at the first 10, and you see now that the animals are in the rows, and the snips are in the columns, right. So it also transposed the matrix. It's a minor thing. But now all of these animals are the animals that I want to look at. So the idea was to use principal component analysis. We wrote a paper a while back where we developed a new method on looking or using principal component analysis to split animals into to find snips, which are highly contributing to the first principal component. And for that, because principal component analysis does not allow me to have any missing data. So the first thing that I'm going to do is I'm going to remove animals. So MPI is missing per animal, missing per individual. So this is just a function where I just ask for each individual how many NAs are there in the data. Because if an individual has a large amount of missing data, I want to get rid of the individual because principal component analysis needs complete data. So the first step is to remove individuals which have more than 5% missing data or keep individuals which have less than 5% missing data. So that's what I'm doing here, right. So I have a function of x, just give me the amount of the NAs, so the sum of the NAs, and then divide the missing per individual divided by the number of columns. And then if this is smaller than 0.05, then I'm going to keep this animal, right. And I'm just going to remove them from the sample annotation. So just sub-setting my matrix a little bit. So let me go back to R. So the dim of the data.num, it used to be 1098 animals. It takes a little while. All right, there we are. So I can of course plot all of these variables, right. So if I plot the missing per individual, I get this figure which shows me how much is missing for each of the animals, right. So this animal actually had 14,000 snips out of the 41,000 snips which were missing. But you see that in general most of the animals have very low amounts of missing data, but some of them really have too much missing data. So we're just going to remove them. All right, so then the next step is to make the data complete, right. Because in my data.num, there's still missing values, because some animal, and if an animal had like 1% missing data, it's still in there. So in the next step, I'm going to do the exact same thing, but now I'm going to say MPM, so missing per marker. And then I'm going to just say, well, do the same thing. So count the number of NAs for each of the markers. So I'm going through the columns in this case. And then what do I do? Well, I select only the columns where the missing per marker is exactly zero. So no missing data at each marker. All right, so again, another little bit of a subset. So in the end, we end up with less animals. So here if we look at data.num and we look at the dimensions of this, then we see that we now have lost eight animals, which had too much missing data, because instead of having 1098 animals, we have 1090 animals. And we go from having 43,000 SNPs to having 21,000 SNPs, 22,000, which have complete data. So I could do principal component analysis on these. But this is not going to work, right? Because all of these animals come from all different places in Africa. And the way that principal component analysis works is that it reduces the dimension of the data. So it tries to catch as much variation in the first principal component and then in the second principal component and so on. So we had a little discussion and I had a little idea. And this idea led me to writing all of this code. And I'm not going to explain you guys this code, I'm just going to run it. And it will make some nice figures. And it will try and figure out if we if when we compare. So what it kind of does is I'm going to explain a little bit. So we have breed one in milk, breed two in meat, right? So I have a double four loop. So I'm going to compare each milk breed to all of the meat breeds. And then I'm going to switch to the next milk breed and then do that do it against all of the other meat breeds. Right? So what I'm going to do is I'm going to select the animals which belong to the milk breed, the animals which belong to the meat breed. So this is S1 and S2. I'm going to get the sample names, take from data num these animals and then say this is my subset that I'm going to use for principal component analysis. I do principal component analysis, make a plot and then I look to see if the first principal component separates the milk breeds from separates the milk animals from the meat animals. And if it does, right? So initially it doesn't. So here I'm going to see if I can separate them. And if I can separate them, then this comparison will could allow to teach us something about the snips which are involved in milk or in meat production. But of course, we are only comparing a single breed to a single other breed, right? So everything which is different between the two goat breeds will be in principal component one. Um, not just the, is it the meat or is the milk, but also the type of the air, the shape of the eyes, the body weight, the body length, the size of the tail and these kinds of things. But I'm doing this not just for one breed versus one breed. I'm doing this for nine milk breeds versus 12 meat breeds, right? So every time that I do a comparison, I expect, so after this, when I, when I did all of the possible comparisons, I expect that if a snip is really involved in meat or in milk production, that this one comes up more often than snips which are involved in like the eye shape, right? Because the meat breeds have all different eye shapes and the milk breeds also have all different eye shapes. So by comparing one versus the other, if the eyes are different, the snip will be coming up. But if the eyes are not different, the snip will not be coming up. And of course, since I kind of rely on the random randomness here in a way. So I use kind of the idea that the only thing which is consistently different between the animals is the meat and milk classification that we assign to them. And all of the other phenotypes are more or less random distribute. So that's the idea behind it. And then this is the code that we already wrote for our previous paper where we figure out the contribution of each snip to the first principal component. But let me just run the code because it takes a little while to build up this MM matrix. So if you want to know more, then just let me know. Let me actually show you guys the graph and we can talk a little bit about principal components. So we'll do the subset, do the principal component, make a figure. So in this case, in red, we see the milk breeds. And in black, we see the meat breeds. And you see that in many cases, there's a very clear separation between the two. So and then it will tell me here in the command line, it will tell me at which point it actually was able to separate them or not. It's a little bit heavy. And I'm streaming, so it takes a little bit of a while as well. So it will draw a line and then try to draw the line. And sometimes it can't figure out and sometimes it can. Sometimes there's a black animal, which is all the way over here. And then it cannot figure out the separation like here. It goes relatively quick and I can't really stop it. But it does all of these comparisons. So it compares the Burundi goat versus the Long Eared Somali. Then it compares the Burundi goat versus the Mubenda. And there is no clear separation between these two breeds. And it just goes and does all of the milk breed versus all of the milk, all of the meat breeds. So let me just sit here a little bit and just show you guys what's happening. Because I have to go through all of the possible comparisons. And fortunately, it's not that much. It's 912 faculty. So we're almost there because we're already at the G. So we're already at the third milk breed. So here you can see one of these examples where there's a black dot in the red dots. And then because the breeds overlap, we cannot be sure that the first principle component actually catches the difference between the milk and the meat breed. Because there might be other things that are at play here. And of course we have the issue of course is that in Africa, people are not very good at naming their goat breeds. So you can have the same goat breed being called differently in one country to the other one. An example of this is for example the Tunisian goat, which is called Tunisian goat in Tunisia. And it's called Oasis goat in the country surrounding Tunisia. But it's the same breed. So all of the animals would overlap in the principle component, which would tell us that, oh, it's the same breed. It's just that people give it a different name based on the area where they're from. All right, so we switch to the fifth one, I think, which is the Kefa breed. So again, just comparing every milk type breed to every meat type breed. And every time I do this principle component analysis, figure out if they split. And if they split, I'm remembering if the SNP was having an effect on this split. Because not all SNPs are contributing. All right, Kefa almost done, small East Africa. That's actually interesting because small East Africa should actually be small East Africa. All right, that's a little bug fixed. That's good. That will help us. I clicked the thing and it went like I'm going to crash soon. Because I'm really at the limits of what you can do with R. And of course, if I would have been smart, I would have wrote the first part into R SQL. So I didn't have to load in this like 700 megabyte file. I would have just did do the sub query, right? The subsetting this data set from like 9,000 different goats down to 1100 goats. The SQL query could have done that for me. So but I was lazy and I just loaded everything in R and used R to do that. All right. Nubian goats versus all of the other ones. So and here we can see that there's no clear separation between Nubian and Long Eared Somali. Here there is kind of a separation. So we're only looking at the first principle component. The second principle component is just there to make the plot look a little bit. Here, this is one of these examples where the Nubian and the Nailotic are actually the same breed. So Nubian is, it's called Nubian in some areas of Africa and Nailotic in other. Here it's not, but you can see that we can separate them out. All right. So while we wait right here again, one of these examples where there's no separation between the two. Okay, almost there. We're at pool. All right. But let's just not wait for that. Well, we have to wait for it. I can't, well, I could write the code without having the data or let's analyze, but let's just wait. We're not in a rush. At least I'm not in a rush. So I work till five, like some people. All right. So these are the what's kind of smaller breeds. So almost there. We're already at the S, Sophia. And of course, all of these breeds look slightly different. And if you're really interested in goats and stuff, then it's really nice to look at them. But like for me, it doesn't matter. All right. So we're done. Right. So now we have this MM matrix. So if I look at the MM matrix, I just say give me the first 10 comparisons versus the first five snips. That looks a little bit bad. So let's do this again. Then it looks like this, right? So I see that when I compare the Abu Ghalla versus the Angora, the first snip got a p value of one, meaning that it is not involved in separating these breeds. Where does it become interesting? Not in the first 10 snips, actually. So let's look at the, not in the first five snips. So let's look at the first 10 snips. So here we see a snip, which is more interesting, right? So here, when we compare the Abu Ghalla goat versus the Angora goat, we see that we get a p value, which is significant. And of course, I did multiple testing correction as well. So here we can see that this, this snip is important when you want to separate Abu Ghalla from Angora goats. But it is also an important snip when you want to separate Abu Ghalla versus Boer goats. It is also an important snip when you separate. So it comes up kind of continuously. So this might be one of the snips, which is really important for milk or meat production, right? Because that was the hypothesis or the idea behind the whole analysis, is that if you find that a certain snip only is responsible for like two or three comparisons, then it's probably not related to the meat or the milk production, but it's probably related to some other difference, some other phenotypic difference. All right, so what we then do, or what I then did is to kind of visualize this a little bit. And I want to load in the snip map. So the snip map is the map, which has the ordering of the snips, because the snips here are in the order on which they occur on the microarray. But we always want to look at the snips in the order in which they occur in the genome. So we want to say, well, load the snip info, only take the snips, which are left in mm, because we went from 43,000 snips to 22,000 snips. And then reorder the map, put in chromosome one first and chromosome 29 at the end. So go from one to 29. So let's just do this. And then here I'm saying is diff. So I just for each of these, for each of these columns in the mm matrix, I want to know if the p value was below 0.05 and then sum it up, and then give me a plot. And the plot will show me something like this. So like this, and then we'll order the map. And then we get a picture which looks like this. So here we see the 29 chromosomes of code, so chromosome one, chromosome two, three, four, five, six, and so on. And here we see our cutoff, right? So I say that if a snip comes up more than 25% of the comparisons, it is probably an interesting snip, right? So probably something is going on at that snip or near this snip, which might explain why some goats are meat goats and other goats are milk goats. So we see that on chromosome two, we get a peak. We get a little peak here on chromosome three, chromosome four, chromosome five, then a whole bunch of chromosomes which are below my threshold. And then here, again, and here, this is a really interesting one, because we see two snips very close to each other, both of them are being implicated as being responsible for meat and milk production. Alright, so then the next step is to kind of take the interesting right. So I say which ones are interesting, and then make a little table. So I'm just going to make a little table saying that take my interesting ones, take from the snip map, the chromosome, the position and add a column with the percentage and then that looks like this. So here we see just the summary of the snips which have a very high percentage of which come up in a lot of these comparison, where they are located and what the position is and what the name of the snip is. And of course, I can set the threshold different and all of these things, but these are more or less the seven snips that I'm interested in and these seven snips are snips which according to my idea or not so much my idea, but according to this method that we already published before, have a very high likelihood to be involved in milk production or in meat production. We don't know which way it goes, right? We don't know if the snip is for meat or for milk, but we know it has to be one of the two. Good, and now we start coming to the part why I wanted to show you this and this is where I ended yesterday because yesterday let me just move all of the way down here. I'm now going to, because I want to now know the genes, right? Because we now just identified single nucleotide polymorphisms, but these single nucleotide polymorphisms are markers. So they don't have to be exactly in the gene that is the gene that is causing the difference in meat and milk production, right? So the thing that I'm going to do is just say, well, I'm now going to use biomark to use the caprahericus gene ensemble, right? So I'm going to use this caprahericus gene ensemble. So I'm going to connect to ensemble, take the genes and do kind of the selection because now I want to know which genes are located near this chromosome in this position, right? So let me connect to biomark. I'm just going to paste that in because it won't show you anything. It's just going to take a little while. So now, of course, we need to define something which is, so we need to define three things, right? So we have to have my filters, we have to have my attributes, and we have to have something which is called my values, right? That's the whole idea about biomark is that you have, sorry, I mean, get a sip of water, is that we have filters, attributes and values. And actually, biomark is being a pain in the ass because I'm just looking at R and it's like the, it actually tried to move me to the Asia mirror, which is far away and non responsive because I'm in Europe and not in Asia. I don't know why it actually tried to connect me to Asia. All right, but of course, we have to do this, right? So my filters. So what do we want to filter for? Well, in our case, we are interested again in a chromosomal position, right? So let me get back to PowerPoint and show you guys the PowerPoint. And I'm just going to use my PowerPoint and I'm just going to say chromosomal position, I'm just going to copy paste it out and just be lazy because otherwise I have to type it in. I don't want to get the sequence. I want to, which slide was that? The filter is obvious, right? So the region here is, we need to define it like this. So let me get that out. And let me show you notepad again. All right, so the filter that I want is chromosomal region. So let me copy paste that out of the slide as well. So the filter is my chromosomal region. And in this case, I might also want to filter for protein coding genes. We'll have to see. I'm first going to get everything in the region. All right, so what do I want to get? Well, I want to get, of course, the ensemble gene ID, ensemble gene ID. Let me see if that's correctly written or not. Yeah, this. We want to have the chromosome name. We want to have the start position of the gene. And we want to have the end position of the gene, right? And the values, I can figure out because the values are actually stored in this variable here. So I'm just going to make this my interesting results. So MIR. And I'm just going to put that in. So fortunately, actually, are actually connected me to biomark now. So here's the my interesting results. So I'm just going to go and now define regions around these snips. So I'm just going to take a relatively big region, right? Because I don't know for sure how far away the gene might be from the snips. So I'm just going to do and I'm going to say, well, half a megabase before half a megabase in the end, right? That's kind of what we what we love to do. So I'm just going to say, so I want to now. So I'm just going to say here, I want to have my values. So I'm just going to say, initially is empty. So for x in one to the number of rows of the my interesting regions, I want to say my values, values are C. Okay, so I need to get the chromosome. So I have to say MIR, take x row number x, take the chromosome, then I have to paste all of this together. So paste zero. I have to do a double point because everything is separated by double points, because the value looks like this, right? This, so it's chromosome start and so I'm going to take the chromosome that I found, then I'm going to take the position position, all right, and minus 500,000. Very good. And then I'm going to say double point, my end position is going to be this, but now plus 500,000. And I want to have everything on the forward strand. All right, and then I'm just going to combine this with the my values that I already had. Right, so that's kind of how I'm building it up. So I'm in my mind, I'm just saying I want to have a string which looks like this. So normally, if I would do this, I might do this in R and I would say, well, x equals one. So take my interesting region, take one comma chromosome, and then do it like, okay, so that's two, then I want to paste this together with my start position, which is MIR comma one comma and then position, right, minus 500,000. And then close it and then okay, so now note that's wrong because there has to be a double point in between. So let me do like that, double point and then do the same thing. And then copy paste. And yeah, so that's kind of how I build it up when I look in an R. Good. So but here, if I would now have my little code, so let's go back to the notepad plus plus window, right? So this for loop does exactly the same thing. It's just that I kind of wrote it in one go. Because I know what I'm doing. Well, don't know what I'm doing. But okay, so I actually made an error somewhere. And here is where bracket highlighting is going to save me. I just copy pasted it into R and I got an error, right? So because I get an error, it means that there's probably something wrong with the brackets. So this one's okay. This one is closed. That's okay. This one is closed. Right. So what's going on here? Unexpected symbol in here, comma, near here. There's a comma missing. All right. Let's try this again. All right. So now I have my values. Right. And now for each of these region, I now have the way that I should give it to Biomart. It doesn't seem like 500 or is this 123123. Okay, so this is at 105 megabases from the number that I was looking at. I thought it was 10 megabases, but it's 105. So that's perfectly fine. So it's from 105.9 to 106.9. Snip is exactly in the middle. And then we see that here we have more or less the same region, right? Because they overlap quite quite big. So I'm just going to throw away the last one or nah, just keep it in. Why not? So now I have everything defined for Biomart. So I have my filter, right? So my filter is going to be filter is going to be chromosomal region. My values are now defined. And then the next one is my attributes, which I can just copy in. And that's like this. So for each region, I want to get the ensemble gene ID, chromosome name, star position and position. All right, and I'm done. And I can do the get Biomart call. So I'm just going to again, copy paste this from the slide. So I'm just going to take the slide copy paste this into notepad plus plus, which looks like this. Why write it when it's already written and go to values, which looks like this. All right, so I'm saying attributes, my attributes, filter, mart is called mart, everything okay. Good. So let's use Biomart to retrieve the data that I want. My region not found. Oh, yeah, because I called it my values. That's, that's okay. So let's just change the code a little bit and rerun it now with my values instead of my region. Good, good, good. So this will take a while because Biomart is relatively overloaded. I think that's why it sent me to the Asian server and not the European one, hands on bioinformatics that I like it. I generally like programming in pairs. So just sitting with two people behind a computer so that you can just talk about what you're doing. And I'm doing this and I'm doing this for what I'm doing it for this reason. And I think that it will do this. And then hey, you have two eyes on code. So it's much easier. All right, so now we have all of the genes located in these regions that we actually found. And one of the things that you can see is that the chromosome is not in the order that I asked it, right? So it starts off with the first gene on chromosome 12, and then chromosome 18, and then two and three and four. And this is one of these things which happens when you use databases or when you use Biomart, right? So the thing is with databases in Biomart is that they are generally parallel. So chromosome 12 just happened to be queried quicker than chromosome 18, although it was behind in the list. So interestingly enough, there's only one gene on chromosome two underneath this snip, right? So within 500 megabase, or within one megabase of this snip, so 500 and 500 kb in the front, 500 kb in the back, there's only one gene. So this is then directly a pretty good candidate gene. So let me rerun this Biomart query again, but now get also the name of the gene. But I don't know how it's called. So list attribute smart. It's generally in the first, like 20 or let's do first 40, right? So I also want to have the gene name. So ensemble gene ID, we have peptide ID, let's get the description as well. So I'm just going to copy paste it out and just add it to the list of attributes, right? So gene ID, then give me the description column back as well. External gene name, that's how it's called. I'm just looking into the window here and I saw that there's something called external gene name. So this external gene name is something that I want to get as well. Because of course the name of the gene, if the name of the gene is like milk gene one, then we're set, right? Then we're golden, then we already have found the milk gene that we're interested in. It's generally not that easy. So let's copy paste this into our again, redo the query. And now also ask for the gene name and the description of the gene, so that we can look at that, right? So that's quick. All right, so let's go to gene IDs. All right, so it came back surprisingly in the same order as before. We can't really read this, right? Not that good. So I'm just going to write dot table the results and call it fresh genes one dot txt separator is slash tab, right? Then I can say get working directory where am I? All right, I'm going to go there. So I'm just going to open up this file in notepad so that we can actually look at it a little bit. So it's better to do it like this. All right, so this is what we got back. So for some reason, these genes here have no name, no description. The only thing that's known about them is that they're on chromosome 12 at these positions. Then we have a gene called erx five, which is located on chromosome 18. Erx six is also there's a spliceosomal RNA. There's an other new kinase. All right, so let's look at that one gene on chromosome two, which is the u six spliceosomal RNA. That doesn't really seem like a gene, which would be interesting for milk or meat production, although could be, but it's generally got something to do with like splicing of proteins and stuff, which of course might influence your or if you produce good meat or if you produce good. Okay, of course, I only looked at the positive direction in this case. So let's also look at the negative direction to make sure that we get all of the genes in both directions. So values minus one, right? So I'm just going to change positive strand to negative strand and then rerun the whole thing. And then we will have the negative strand genes as well, because like, I don't have any assumption to assume that the gene of interest will be in the positive gene or on the positive strand of the genome. All right, finished quite quickly. So again, we can look at gene IDs. And now we see that there are other genes, which is good. So let's just save this as RS minus one, right? So negative strand. Oh, you guys can see this. You should tell me when I'm not showing you the R window when I'm doing stuff in R. But here we see that if we look at the negative strand, we see that other genes start coming up, right? Because like it's the it's the other strand of the genome. So I'm just going to save this rest genes minus one. And then I'm going to open up this other file as well. Lots of lots. And then we see go back to here. So that's the wrong one. So here we have the genes on the positive strand. Here we have the genes on the negative strand. So here we see some interesting candidates, right? Sodium-folded gated channels are things which control sodium, which are generally important in milk production. And these are SLCs, which are solute carriers on the left. Thank you for following. Thank you for following a lot. Welcome to the stream. And welcome. Thanks for following. Good. So okay. So now we have again like a whole bunch of genes, which have no name, which is a little bit strange. So let's just go to ensemble, right? And see why this gene doesn't have a name, because I'm under the impression, but that might just be I'm really confused why this gene would not have a name. Yeah, that's true. Novel gene, multi-drug resistance. Okay, so it does have a kind of name. It's kind of a description, but that's interesting. There's not a lot of things. It's a pretty big gene though. That's an interesting one. Why the hell doesn't not have a name then? It's so weird. Anyway, that's something that happens in goat genomes, because the goat genome is not as well annotated as the human genome or the mouse genome. So sometimes you just end up with, well, no genes. All right, but that's like I showed you now that I can use biomark to retrieve my genes. And then now for me, the next step would be to just fill in the table. Right, so let's just open up a little Excel window. Take a new Excel file, show you guys my Excel window. There we go. Right, so I take the genes on the positive strand. So I'm just going to paste them in. Description, chromosome name. Oh, I'm seeing that something goes wrong with the header. So, and then I just want to say strand. So these are positive stranded. Just click, click. Come on, Excel, you can do this. Oh, yeah, plus is not allowed. So plus forward strand. And then I can just do click, click. Okay. And then we're just going to remove these ones. Right, this name, start position and position strand. Good. And now I'm going to do the same thing for the negative genes. So just put them in, right, then delete this part here. We don't need to header twice because we already have the header. And here we say ref for reverse strand, just do click, click. And we just add these two together, we do control X, we go all the way to the top and we say put them here. Very good. Then we say, okay, Excel be a good guy and actually just sort and filter. So allow me to filter here. And now I'm just going to say, well, this is a little trick that you can do, right, we can actually sort these two from smallest to largest position wise. And then I can then say sort smallest to largest for this. And now all of the genes should be sorted in both directions. Right. So now we end up with around 40, no, not 44. We end up with 42 candidate genes. And we can see that most of these genes are very, very poorly annotated in the in the in the genome. But this would be then the first step into the further analysis. So this is how I would generally use biomark. So let me format the table a little bit more. Actually, the filter can now go. So just clear the filter. Make these ones nice and big. So take the thick borders, and then make sure that we can kind of see the difference between the different chromosomes. And like this. And then chromosome four, and then we have chromosome five. Very good. And then we should have chromosome 12 here. Very good. And then we have chromosome 18. And the nice thing is, is actually that because we have two regions or two snips on chromosome 18, but since we use biomarked, it saw that the regions overlapped and it would just take the the whole region. Good. So on 12, there's actually nothing annotated for some wonky reason. That doesn't matter. In theory, you could just say, well, you know what, I'm not really interested in genes that don't have a name. But that would be a little bit of a shame. But the next step would be is to now go through these list of genes one by one, and go to something like wiki gene, and look at the description of these genes, see if they are having anything to do with milk production. For this one, I actually know because this soil, soil, soil, so you'd carriers are very important for milk production, because they, they, they transport stuff from the inside of the cell to the outside of the cell. And it's a soil, you'd so it's, it's probably got something to do with milk. So if we do and then milk, and then we just search. Okay, so I was just throwing this one gene into, right, encodes a plasma membrane magnesium transporter uptake. Okay, so that's uptake. I was actually hoping that it would export magnesium, because exporting of magnesium is something that like happens in the, in the, okay, so that's plasma membrane. So it's actually not in the, I was actually hoping that it's one of these protein carriers, which would profiling so in key metabolic tissues during the postpartum evolution of memory grants from non-secretary to secretary. All right, so what do they say about this gene? All right, is it a, all right, so heat map different in memory glands, so liver and adipose tissue. All right. Oh, see, this is really good. So this gene is actually expressed in memory glands and in the liver. And it is expressed during lactation. So this makes it a pretty good candidate gene actually for milk production, or at least having something to do with milk production and the production. Right, if it would not be expressed in memory glands, like if it would have been like this one, where it's white, right? Interesting. So see milk production, milk production during lactation, changes in gene expression, glucose, transporter, milk fat synthesis, milk production, indian dairy. So it seems to be like a relatively okay candidate. So I'm just going to actually just add a little note for my colleague that when she looks at all of the genes, then this one might be interesting, might be interesting. And then has she can interest thing, because then she can, she can read it, I will actually put the toy of the paper there as well. So then she can actually figure out if it's something that is interesting enough. I'm just gonna take the toy and add it to her as well. Yeah, so it's one or something. Good. So that's what the bioinformatician does, right? I developed your novel method. So the whole kind of principal component method was already developed in the past. And then how we look to see if there are snips that come up as being possibly interesting, then when the snips that come up, we say, well, okay, so these might be possibly interesting, then let's look at like a region around them. And let's look at the genes which are located here, and then see if any of these gene might have something to do. And one of the other steps that we could do is, for example, do a overrepresentation analysis, right? We could take all of the genes that we found and see if any of the metabolic processes is overrepresented. So if you would throw them into an overrepresentation analysis, and the overrepresentation analysis would say like, oh, this has something to do with memory gland, and then of course that would be a relatively good indication that what we are doing is not complete nonsensical. But that's kind of what we do. So with that, it's 427, 430. So thank you guys for watching. If you're watching it on Moodle or on, no, not on Moodle, because I'm not uploading them to Moodle anymore. So if you're watching it on Twitch, thank you for watching. I'm just going to finish off my table. I'm going to go through ensemble and make sure that for all of these genes which don't have a name, I at least get like the kind of crappy name that ensemble gives them. This is so weird that none of these genes on chromosome 12 are actually annotated. So if they are not annotated, right, and I do want to figure out what they do, I would just get their protein sequence and then throw them in the protein database to see if I can find a human gene which is annotated, which is kind of synonymous or relatively compatible, right? Because if the protein structure of the genes are more or less similar, you can assume that it's also got kind of a similar function. But that's kind of the way that you would want to go and fill in these gaps in the table, which unfortunately it happens more and more. The stranger the genome you're looking at. And of course goats are not the weirdest animals in the world, but they're also not the most commonly used model animals either. So that's kind of a shame. So let me actually form up these as well, because we want to make these numbers no decimal places at a thousand separator, make it look like this. So yeah, that's how I use biomark to kind of automatically annotate part of my data and have someone else search through the database later. But like this is something that we can actually highlight that this one might be interesting. All the heat, just looking at the list of genes, like sometimes like a gene pops out where you like remember, oh, I read a paper like a couple of weeks ago, which was implicated in milk production or meat production. Good. So thank you guys for watching and see you next week. Same time, same place. Thanks for all of the guys that stayed until the end. If there's any questions, just just throw it in chat, right? I'm not looking that much at chat, but so no questions. If there's no questions that I'm just going to go home and eat a pizza. So thanks for the class. See you next week. Yeah, see you next week. That's it. Good. So thank you all for being here. Thank you all for taking an interest and I will stop the recording.