 Good, so welcome everyone. If you're watching it on Twitch or if you're watching it later on YouTube Welcome to the lecture, lecture number five already. So I think we have like 13 lectures So we're one third of the way of the whole course So today we will be talking about proteins That's what we're gonna do, but of course like always we first have to Do the answers so the overview for today is first do the answers then I will start talking about the history of proteins and how people do Protein separation and prediction and stuff We will be talking about structure a lot So we will be talking about the differences between like primary structure secondary structure tertiary structure of proteins There will be a part about how to purify proteins and how to identify the different proteins that you are working with We will do function prediction protein domains. How do I determine which part of the protein does what? Furthermore, we will also talk about protein families and I will try and explain to you guys what the difference is between an orthologue a Parallog and a xenologue And there's going to be a little bit of a structure about phylogenetic trees, although phylogenetic trees will come back in a later lecture Good. So first, let's get the answers to the previous assignments. So I hope I have everything set up properly So these are my answers and unfortunately, I forgot to open up the questions So let me do that as well So they are in Doc X Bioinformatics and then lecture number four Simons. All right, so Yeah, number four So the first things that I asked of you guys is to download the SARS-CoV-2 spike protein RNA molecule so the RNA description of how the spike protein is made and Downloaded from NCBI. I showed you guys that last time on stream for the envelope So it's the same as for the envelope except for instead of downloading the E gene. We now downloaded the S gene So it's a slightly longer sequence because the spike is more amino acids than the envelope so the first thing is is the analysis of Oh, no, and then predict the secondary function. So let me move to Firefox and Then let's first go to NCBI and We go to the gene database Because we're interested in the gene and not the protein so SARS-CoV-2 spike protein Alright searching searching searching Website is very slow. So here we see the S surface glycoprotein So this is a gene. You can see the structure of the whole SARS RNA molecule here. So we have the Rf1a and Rf1ab Which are the is the replica so so that's the thing that copies Then we have the S which is the spike protein which causes the entry of the virus into the cell We have E and M which make up the envelope and the membrane So that's those are the proteins that make the cup sit and then we have the N Which is the nucleo cup sit, which is the thing that surrounds the RNA But we're interested in the S protein. So let's just click on the S protein here And it's loading a little bit and then when we scroll to the back We can see some information like where it's located But in this case, we're only interested in the sequence. So we just click the little fusta button here When we click the fusta button, we will get a Overview of the protein All right, not the protein. Sorry, we will get an overview of the RNA code Very good, so we just do copy paste and then we go to the RNA fold web server to predict if there's any secondary structure. So RNA fold There it is and we just paste it in we can set some options, but the default options are fine and we click proceed and This will take some time because it's much bigger than the other one. So we'll just skip through the other answers So let's see if it finishes. No, so let's just wait and just Go to the micro array analysis Good. So I told you guys that micro arrays are these little glass plates Which you can use to measure 20,000 genes at the same time So I gave you an example of something that we did here in our group So let's just go through the assignment. So first things first, right? If you want to create a little R script, you have to create a new file Because like in anything that you do like be it working in a lab or be it a bioinformatician who's sitting behind the computer We have something which is called kind of a lab book When you're working in a lab But as a computer scientist or as a bioinformatician, of course the lab book is the code that you create And generally of course these files we put on the version control like in the first lecture We discussed about the version control system and I had you guys set it up So had normally a file like this for me would go under version control And I would just first make a make an empty script, right? So the empty script of course will contain a header So make sure that you always include a header so that you know what the file is for So in this case it clearly says that these are the answers to the assignments belonging to lecture four And then I always put my own name there just so that people know that I made it It's always good to have like a Last modified date or something like that. So had that people know when it was last modified So that people can know how out of date something is So any script that we do in our starts with setting our working directory and this is kind of important And it's kind of the way that I structure all my scripts It's first setting the working directory and loading the data file So the data file was available on Moodle. You could also get it from my website So I'm just saying go to this folder where I stored the the data And of course we will have more data files in upcoming lectures and in this case the data file is called GSE 7 7 6 5 GPL 9 6 series matrix All right, so let's go to R and just start Copy pasting in the first part and loading it in So let me guys Show you guys the R window All right, so it loads in quite quickly because it's not the biggest data set in the world So the first question was is how many samples are in the file and how many mRNAs have been measured using probes so and The question also had the hint, right because samples are in the columns and mRNAs are in the rows So we can just say n-col if we want to know how many columns there are So in this case when I do n-col it tells me that there are six columns If I am interested in to see how many rows there are I can use the n-row function And just give it the microarray data set and then it tells me that 22,283 probes on the array have been measured So that's almost one probe for every gene in the genome of the mouse All right, so then the next step would be And this was I think explained also in the assignment. So let me go back to the assignment very quickly So in the assignment using the box plot function We can plot the column by me the matrix by column every column will be used as the input for a single box plot This allows us to compare the different samples We net we need to set two additional parameters so that the labels end up being readable on the plot because if you don't do that Then it will take the column names and put them horizontal But if you put them horizontal You they are overlapping so you can't read them So the way that this is done in our and I think I gave you the whole code is just say do a box plot the first Parameter that we give is the data parameter So this is the data which it will use so it will look at this data frame and then for every column start making a box plot Las equals two means that instead of plotting the things horizontal Can you explain your micro array definition? Yeah. Yeah. Yeah, sure sure So here we see the read CSV function, right? So the read CSV function loads in a CSV file or a table So we give it the name of the file that we want to load in So in this case it's in this folder and then I have a sub folder called data So it's in data and then this is the file name Then I say skip is 66 because we have to skip the initial lines And actually let me open up the file for you guys. I think that should be possible in notepad plus plus It's not the the biggest one So we go there we go to data and then we go open up the Micro array data. So this is how it looks like, right? So we first see that there's a whole bunch of Additional information and so this is dioxin induced gene expression changes in human breast cancer cells And then it gives you the Geo Siri that it belongs to And so on and so on and it gives you like a whole bunch of additional information about For example, who collected the data? Who do you need to email? Their phone numbers in case you want to call them and and all of these things and have for example There's also supplementary data But then you see that the matrix right the data itself starts at line 67 so that is why I'm saying here skip is 66 Right because the first 66 lines are not a matrix They are just additional information about the file that you're currently looking at The next parameter is header is true. So header is true means There is a header, right? So every column has a name and in this case for example, the name of the first column is called ID ref The second column is called GSM 188013 All right, so that if there would not be a header, right? If the file would have had would have just started like this and it would not have any names for the column Then I would have said header is false. But since there is a header since there is a description Or a name for every column. I'm saying header is true Separator is tab So this is something that when you look into the file and when you use notepad plus plus you see that in my case tabs Are these little arrows while if I have spaces though those are these little White dots, they are very hard to see probably But there's a difference between a space and a tab and in notepad plus plus if you go to view And you guys can see that but on the top you have like this bar, right? And you have file edit search But if you go to view then there's an option underneath show symbol and then it says show white space and tab You can also say show all characters and now you see that it also gives me these LF things So these are line feet. So those are the enter characters in the file But those are not that interesting So I just save you and then I go to show symbol and I enable the show white space and tabs And now I know had that okay so this is a tab because it's a little arrow and Space is just if I would do some spaces here, then you see that those are these little yellow dots They're really clear when you're looking at it yourself So that's the next parameter and then I say row names equals one and that is because every Row has a name, right? So this is probe and this probe is called 1007 underscore s underscore AT And this takes a little bit of practice like Loading in files and ours not easy because there's like literally hundreds of parameters that you can use Fortunately this file actually wasn't coded in the correct way. So for the decimal separator It uses a dot and not a comma which can also be an issue if you're using for example a German version of Excel then Excel or at least the German version of Excel generally prefers the comma for For the separator of the decimals, right? So we have here the number 15,630 630 comma 2 but our requires commas to be dots to load it in There's many many more parameters to the read table function. If you're interested, I think on YouTube. I actually Chapterized the reading in data lecture already So that's one of the things that I've been working on in my free time is making sure that every YouTube video has like these Bookmarks so that you can just click on reading in data and then it directly goes to that part of the video So that's the that's the description of line number eight, right? So it give it the data file to load in I say skip 66 lines because that's not a matrix That's just additional information like telephone numbers and who collected it Heter is true because we see that it has a column column names and I then say row names is one For saying well, there are row names and because there's there's numeric data in the in the matrix But hey, you have to Every row just has its own name and furthermore we say separator equals top and the top character is specified by slash t So that's the the top character All right Here I use the dim function instead of the n row and n call that I just showed you so dim Does more or less the same thing. So let me show you how that looks in our So in our when you use the dim function It gives you the dimensions and the dimensions are it gives you first the rows and then the columns So it's two two twenty two thousand two hundred and eighty three rows and six columns All right, so let's use the box plot function to make a little box plot And so that you can see what is happening So here we see the first plot that we made so what we can see is that there is a very weird distribution, right? We can see that the box plot is very squeezed all the way to the bottom And we see that there are literally like hundreds or thousands of outliers on the top If we want to look at this in a little bit more detail, we can of course make a histogram So to do that We would just use the his function for histogram and then for example I want to look at for example the first column of the microarray. So the way that I do that is just say Give me all the rows, right? So I don't fill in anything for the rows So I don't select any rows it will take all of them and then I say comma one for the first Column if I then press enter then it gives you this this histogram Which doesn't show you much, but it shows you that Almost 20,000 measurements were in the order of zero to like five to ten thousand and then you see that there's this very long tail We can use the brakes to add more brakes, right? So if I say give me instead of Yes, so brakes means how many of these bars do you get so I get like one two three four five I get like ten twenty bars But if I say breaks is a hundred and then you will see that it actually makes that the individual bins much smaller So I can see more what's happening. So I see that there's a whole bunch of measurements which are kind of off, right? There's no intensity and then you see that the higher the intensity the least that the less values that you get But of course, this is just raw data And raw micro array data is the intensity of the probe and of course the intensity of the probe is not something that we directly want to model So the next step of course in analyzing micro array data is to do some Transformation right so the first transformation that you always need to do when you are looking at micro array data It's doing a log to transformation and that is because when you do a micro array Light works in kind of an exponential fashion, right? So if you have two molecules, you have double the amount of light compared to when you have one molecule of RNA binding And if you have four molecules, you have kind of the four times increase in the intensity So a log two will kind of undo this Exponential curve So what I do is I just say well take the micro array data and I'm using some shorthand here So the apply function is a very very powerful function It allows you to take another function and then apply this to either the rose or to the column of the matrix So what this does it's basically a kind of a little for loop Which goes through each of the columns of micro array and then what it does is it takes the log two of The column and then have because now you start off with a single vector of values You take the log two and it does that for each column so you get a matrix back, which has the exact same dimensions So I can show how that is done. So in our Have it just does it almost instantaneously. So I say apply to the micro array to Means the columns so for each column apply the log two function and Give me back a new matrix, which then stores this data So if I want you just look at it a little bit, right? So I can say well give me the first ten lines of the micro array file Then these are the original values, which you also see in the in the data set If I then look at the micro array log Then it shows now that the values have been transformed into their log to Counterpart right so if I would compute the log two of this number, then it should end up being 13.93 So if I just do log two of this one, then indeed you can see that it that it matches Good so the next thing is then to see how does the distribution look like now, right? So to see if our distribution it has become more normal because it kind of got rid of this exponential step so again, we do the same thing so we And the nice thing is in our you can use the up and the down arrows to switch Between like the previous lines that you already had So if I go up right and I go to my box plot function, then now here I can just say well Instead of using micro array do a box plot of the micro array which has been log transformed and again do the LAS is to so that the names are vertical and not horizontal and Cax dot axis is 0.7 means that the text gets a little bit smaller out. Otherwise the text would fall outside of the margin So if we do it like that, then it seems already a lot better Right because now the you can see that the average of the micro arrays is around 10 ish You see that some micro arrays still have some outliers on the top some outliers on the bottom But you can see that it already looks much more like a normal distribution because had the the me at the median is in the middle then you have to 50% of the data is in the box and then 95% of the data is between the two waxes So this is a way so that's what a box plot means If we want to look at it in a histogram head, so just take a histogram of the first one So just say micro array log take the first column Give me again a hundred breaks Then the histogram now starts looking like this, right? So you see that it starts to be kind of a normal distribution Hey, you still see that there's a little hump here And so it doesn't seem like a perfect normal distribution, but there's still a little humpy. Hey, it's not perfect So it's kind of a dual normal if you would look at it like this And there's like one normal distribution here and then overlapping with another normal distribution here So the first normal distribution seems to have a mean of like seven and a half and the second normal distribution Seems to have a mean of around like ten or just above ten All right, so but it looks already a lot better, right? So if you want to do statistics and you want to use like powerful statistics So you want to use things like Pearson's correlation or you want to use like linear models Then one of the assumptions in these kinds of models is is that you have a normal distribution So you have to get rid of this weird kind of wassalt like distribution And you have to transform it so in this case and for microarrays you always do this so microarrays are always locked transformed and Then you get a more normal distribution All right, so the question was here. Let me look it up quickly So make a box plot with and without So what does the LAS parameter do so the LAS parameter? I already said it Let me show you the box plot again. So the LAS parameter Flops the axis from being horizontal to being vertical So have we say LAS is one Then we can see that the names are horizontal LAS is two means that the that they are just vertical Next question was what does the CAX does axis parameter mean so that just blows up the text to make it more or bigger or smaller Right, and if we want to check that then we can say well make it 2.7, right? And now you see that the names of the of the axis are really really big That's what the CX All right, so before we can start normalization we need to log to transform so we already did that So redo the box plot on the log transform data. What is the difference? So let me show you my answer. So this is kind of how I When I do my assignments, I always do it kind of like this. So I give the code I make the box plots and then I just write the answers as comments So every line in R which starts with the hashtag is a line Which is ignored by the R interpreter because it's a comment line So the answer to question 3a is that the range of the data changed a lot And now the distribution also looks more okay for for the plot, but it also is more normal Right, so we can see that from the box plot, but we can also see that from the histogram Why do we log transform micro array data? So this is because the PCR step is there to amplify the original signal exponentially So it has to do with the way that probes work And it's just a common thing to do in micro arrays. So it has to do with the PCR step before All right, so now we can start normalizing. So to normalize we are going to use an external package called pre-process core Unfortunately, and I tried this today. You can't use the standard install packages function anymore The pre-process core package has actually been removed from the standard repository that our users So you have to get it from bioconductor, which means that let me show you the code for that So you have to now do it like a to install this package and you have to kind of use this magic incantation So if you have not got the package installed then first you it checks if you have bio the if you have the bioconductor manager installed If you do not it actually installs it from the standard repository and this is because like the cron standard repository for R has like 2,000 packages in there But it's really hard to get your packages on cron because they have very strict Demands on packages and help files in these kinds of things. So Bio bioconductor is another repository that you can use For packages and they are just more relaxed in their packages and they also allow you to update packages more often So a lot of people switch their package from cron to bioconductor, especially packages Which are updated very frequently the whole process of getting it accepted on cron takes around like two weeks And putting it on bioconductor is just sending an email with a new version of the package and then they directly update it within like 24 hours So if you did not have the package installed then this is the kind of magic incantation So instead of using the standard install packages function, which you can use to install the bio bioconductor manager You can use the bioconductor manager to install the pre-process core package So I do library pre-process core because installing it just puts the data or not so much the data But it puts the code that the package uses on your hard drive, but you still have to make it active So you still have to activate the package So that all of the functions that are inside of this package become available to you So that is what the library call does so the library call Opens up the library and gives you access to the functions in there So let's do that. Let's go to R and Let's load the library and if everything goes well, you don't get an error or anything Then the next one is is to take the log to microarray data and do a Normalize quantiles right so normalized quantiles means that I want to get rid of some unwanted Variation and like we could see in the in the box plot and let me make the box plot better So what we can see in the box plot is that? Not every microarray has the exact same distribution, right? Some microarrays the average is just a little bit lower and if you would just start doing Testing so if you would just say well do a t-test of these three samples versus these three samples You would find that there are a lot of genes Which are differentially expressed and this differential expression doesn't come from biology. It just comes from the fact that The first two microarrays just had a little bit more RNA put on them and compared to the last two microarrays, right? So the the mean of the microarray tells you more or less how much sample was put on and the variance is also determined by other factors like the Amount of or the temperature or the air humidity when you do the microarray experiment So that's that's one of these things that we want to get rid of right? So we want to make sure that every microarray has the exact same Average when we look across all of the 20,000 22,000 probes So to do that we can use the normalized quantiles function. I was looking at r again, and I didn't show you guys are That's bad, right? So here again if we look at the first two microarrays We can see that the mean of it is slightly above 10 for these ones the mean is slightly below 10 Right, so if we would just do a t-test between the first three and the last three Then we would find literally thousands of genes being differentially expressed so to prevent this We we need to normalize making sure that the average of each microarray is similar between all of the different Microarrays that we're using and we need to make sure that the variance also is the same So the normalized quantiles function will normalize the quantiles for us Good, so let's do that. So again, I define a new variable I just say normalized quantiles and I put in the microarray log data structure right so the the matrix holding the log Transferred microarray data, and I just store it in a new variable called microarray log q-norm for quantile normalization The thing is if we do this and let me show you in R That the original matrix right the microarray log for example If we look at the first 10 Rows and we can see that the original data has row names and column names But if we look at the microarray log q-norm Then we see that it doesn't have right so in the process of normalizing what it did it just Remove the call names and the row names so to prevent that from happening What we do is not so much prevent it from happening But we just going to fix it afterwards and fixing it afterwards is just saying well the column names of the microarray log Are the column names of are the microarray log q-norm are the column names of the microarray log and the same thing for the row names So we're just putting the row names and the column names back on the object And this is something because the normalized quantile function doesn't deal with the row and the column names Generally to prevent this you would actually just put the object back into the matrix that it came from Overriding the original values bit since since I want to keep the original object I just fix it by setting the row names and the column names back Okay, and then we think make a box plot to check if that the normalization went went okay And that everything that we expect to happen happened right so after log to Transformation and quantile normalization here. We now see that every microarray is the exact same Average expression we also see that the maximum value is similar for each of the arrays And this is similar on the lower parts of the lower value is also the same so it it removed And so what I did it for each microarray. It just calculated the mean and then kind of Substracted the mean out of the data and they did the same for the variance So I think in the our course actually I do it live so I kind of write my own normalized quantiles function It's not that hard to do you just calculate the mean and standard deviations and then you use that to transform your data Just like a log to transformation All right So have what was the question is the question was that what do we observe after log to transformation after normalization? So after normalization and what is the major difference? Well, the major difference is now that we can see that the range of the data is similar for all samples So every sample has the same average every sample has the same variance Which means that now we can start more or less analyzing our matrix So first things that we might want to do and these were some of the additional questions Is that you might want to define some clusters, right? Because the clusters is the thing that we're interested in because we are interested in How different these samples are from each other So if we go to our right, we see that we have several micro arrays. So we have 6013 6014 and so on and we want to know How different these measurements are from the other ones, right? Because it might be that there is already some structure that we can easily see right and of course you can think if these first three were done on brain and the last three were done for example on on Fat tissue right then, of course, there should be a very big difference between brain and fat tissue But within the fat tissue stuff should look very similar to each other So making a simple cluster plot Allows you to kind of reason about the data and reason about what's going on So the code I gave you guys in the assignment. So the code was actually a little bit difficult Because what we want to do is calculate the distance so and the distance function is For some reason different than the box plot function the box plot function works on the columns But the distance function actually works on the rows So because we want to know the distance between the micro arrays We first have to transpose the whole matrix, which means that if you transpose a matrix You just put it on its side more or less So the columns become the rows and the rows become the columns So because the distance function actually works on the rows We have to put the matrix on its side because we're interested in the distance between the different micro arrays So that is what the t function does so the t function just takes the matrix puts it on its side And then the distance function calculates for each row the distance to the other rows And then we use the haklist function to do a hierarchical clustering So it just is a unsupervised clustering method and then I just save the result from the Super from the unsupervised clustering method into a new variable called clusters And then of course we can just use the plot function to plot these So let's see what happens after Have we after we have normalized our data? So let's go back to our So what happens is that we see something like this, right? So we can see that there are kind of two groups in our data We see that two of the samples so 8013 and 8014 are relatively similar to each other And we see here that the other four micro arrays are also relatively similar to each other But they are not as similar as these two are to each other, right? Because you can see that the height so where for example this sample hits the tree to the other three So this means that this sample has around a hundred and forty Differences and or difference units right because it's not really differences. It's different units And have here we see the same thing. So these two micro arrays are very similar These four micro arrays are also very similar to each other But there is a big difference between these four and the other two So that's something that we learned that there are two groups in our data Two micro arrays are very similar and the other four are also very similar, but less similar to each other than the other two All right, let's go back to notepad plus plus. So the answer was that there are two major groups in our data One contains two micro arrays and the other one contains four micro arrays And then I ask you guys to do something complex, right? Because since we don't know the grouping in this data We don't know if it's a three versus three experiment or if it's two versus four We can't use any statistical testing, right? So we can't do a t-test or build a linear model The only thing that we can do is look at the genes which are highly variable So how do I define highly variable? So the way that I do this is I use the apply function again and this time I use the micro array log q norm data set I go through each of the rows And then I say function x so this function will be executed for every row of the matrix And inside of this function the row will be called x and that is something that I define myself I could have also said call it y or call it row or call it m row, right? So I can choose the name of how I want to call the row inside of this function, but I call it x So what I then do is I say, well, okay So I want to know how variable a gene is, right? So but I also need to take into account the average of the gene, right? A measurement when I have measurements that are 10 11 and 9 Then that is less variable than something which goes from three to four to two Right because the the bigger the numbers The higher the variance becomes anyway, and that's just because of a measurement error So I just say give me the scaled variation So scaled variation is defined as the variance of x defined by the mean of x And x here being every time one of the rows of the micro array that I normalized All right, so let's run this Should be relatively quick because it's not the biggest data set so When we now look at scaled Scaled Variation, right? So the the object that we just created and I'm just gonna look at the first 10, right? So one two ten Then it looks like this So it tells me that the scaled variance of the first probe is 0.006 The next probe has a higher scale variance, which is 0.009 And then the third one has a even higher scaled variance and then the fourth one is lower again Right, so I can just say well, okay, so I'm interested across all of them. So just do a plot Right, so just give me some graphical and then you see something like this, right? So you see that there's a that there are Like across the array we have our 22 000 probes on the on the x-axis and on the y-axis We see the scaled variance and so the scaled variance starts off for many probes being relatively low So that means that these These probes are targeting something which is expressed equally in all six of the arrays But you also see that some probes are targeting things Which are very different between the different arrays that we have and that is shown by having a high scaled variance All right So the thing what we wanted to do then is see if we want to get all of the genes where the scaled variance is Higher than one so the way that we can the way that we can do that is by just saying so, okay So we take our vector called scaled variance, right? Those are just the values so for each probe on the array There's one value and then I just say well give me the ones which are higher than one And this will give you a true false vector, right? So this will just for each element in the vector will test if the value is higher than one And if it is then it's true if it's lowered or equal to one it will say false so to to kind of Make so instead of head so I can I can show you guys how this looks in R, right? So if I would just execute this little part in the middle, right? So scaled variation higher than one if I would do that then you see that it will just run For each probe it will say if it was false or if it was true, right? So these ones were all lower than one Can I find one which is higher than one not really, right? But since I'm not really interested in for each element to know if this thing is true or false I use the which function so the which function just gives me the index so Which rows actually have a scaled variance higher than one? So I can just say which and which scale variance higher than one And now it will tell me that oh, there's a very limited one amount of probes that actually had a scale variance above one And that is true because when we look at the plot we see that there's only a couple Which have a scaled variance above one So in this case row number four thousand two hundred and thirty one Has a scale variance higher than one and this is the name of the row So if I'm if I don't want to know the row numbers But I just want to get the names of the probe then I can just say names Right, so there are in total 21 probes which have a scaled variance higher than one and these are their names All right, and then of course I want to store this so I want to store this in something called highly variable variable Or something like that, right? You can pick the names of the variables yourself That's up to you. So you can also just say it's a hv, but I would always advise people to make names Speaking names, right? So have have have the name of the variable mean something like temperature in celsius Is a much better variable name than just temp Right because temp can mean this is a temporary value. It's a temperature value, but then you don't know so head like and it's not bad, right because Long names are more meaningful to other people looking at your code But it also is good for yourself to remember what you were doing Right and in r if I type High and I press top right then it actually autocompletes the name for me So I it doesn't matter that the name is long It because you don't have to type the whole name anyway um All right, so let's go back to notepad plus plus So what do we want to do and then what I wanted to do is just make a heat map Right, so I take from the micro array log qnorm So from the normalized log transform data, I take the genes or the probes which are highly variable and then just ask r to make a heat map So let's see how this looks. So let's just go to notepad plus plus Let's go to r And now we see here the genes And we see their expression across the different arrays. So here on the on the x-axis We see the different array names and here we see the probes that we have selected and here we see then a clustering So what we can see is that these two arrays, which we already previously identified that they were very close to each other Also in the highly variable genes, they show that they are relatively close right a gene Which is high in these two tends to be low Or lower in the other ones. So the more intense the red color the higher the the similarity And the lower the color so the more white the color the more different they are all right, so What was then the answer to the question? So have we observed that the group the groups that we saw before right because if we look at the r window We can indeed see that there are two arrays Which are very similar to each other and then we see that there are four arrays which are also relatively similar to each other But now we also see some more structure here, right? So we see kind of that there are two Micro arrays done which are similar then there's another two micro arrays Which are very similar and another two micro arrays, which are again very similar to each other All right, so that was the kind of initial look into the into the into some of the micro array expression data Hey, it's a first look There will be more kind of examples like this But it's just for you guys to have a little bit of experience in r Because I do think that it's important that we That I kind of get across the the importance of being able to program as a bioinformatician Right as a bioinformatician you need to be able to program So you need to learn a programming language being it be it r be it python c or c plus plus like in the end It doesn't matter in which language you learn how to program But learning how to program is more or less essential, but since the course is about introduction into bioinformatics There's no real time to Teach you guys how to be programmers So that's why we kind of look at the individual databases All right, are there any more questions about the assignments? If not, then I think we should Start with the lecture I'm just gonna wait a little bit So if someone has a question and just throw it in chat, I will show you guys for now the glideron So the the overview So the overview for the rest of the day will be Me talking a little bit about history We will talk a little bit about structure of proteins about purification and identification Function prediction like protein domains. How do we figure out which part of the protein does what? And then we will talk about protein families and terms like ortholog, paralog, sanologs and these kinds of things Good see no no questions in chat. So then we will start with the history Um, although it's 146 What do you guys think should we do a break now or should I do two more slides? For me, it's Egal what we do But if you guys have very strong opinions like oh, I want to have a break now two more slides Oh, that's a big thing like I probably can do one more two more slides genie 88 for you I'm okay to god continue. All right How's the folding going? Oh, right. Yes. Yes. Very good. Very good. Um, Let me show you guys that It actually finished it actually finished very good. Thank you for reminding me micha So the folding is the answer or the result of the folding is here. So here we see the Sequence that we gave it Then here we see kind of a more graphical overview using like Opening and closing arrows where it tells you the structure. So hey The structure for a computer looks like this Yes, so that means that here you open up and you open up and you open up and then at a certain point parts of the protein are closed again So the results for the thermodynamic prediction look like this So we see that it looks like a Can I zoom in a little bit? Yeah, I can zoom in a little bit So here we see indeed that that there is some structure to the RNA molecule of the spike So and not only the protein has a very distinct structure But also the RNA is a very distinct structure. You can see actually that there's a lot of blue and green So it was actually pretty good at predicting The structure inside of the RNA and you can see here that in biology things always revolve around structure, right sequence is just Sequence but sequence folds in a certain way like the these are real molecules and the molecule structure has the function And you can see here that the spike protein The spike The RNA coding for the spike protein has a very distinct structure itself as well Which of course makes it able to be transcribed by the ribosome We see a slightly different structure. Um, well, it's relatively similar When we look at the the the other prediction that it does which is the centroid structure prediction But you can see of course that there are differences in the two prediction methods And so it uses the mfv prediction. This is the first mfv prediction And then here we see the centroid secondary structure prediction Which predicts a little bit more like circular structure and less of these kind of hands on the side Um, so very good. So we can actually download them if we wanted to and here across We see actually the the kind of representation of the difference in the different predictions So in blue you see the centroid prediction and in red you see the mfa prediction And then in green you see kind of the the how well the prediction fits between the two And so the the the the higher the green line the the better the two predictions are getting towards each other So an interesting tool, um, it works really well for very small, um RNA molecules But you can see that it also works for for bigger RNA molecules And of course has structures everything so structure determines function It's kind of instead of having form follows function. It's more like function follows form when we talk about biology All right, good Let's do a history slide So when we talk about proteins right the the history of proteins goes back way way further than the history of DNA and RNA DNA and RNA are relatively novel molecules that we discovered or that in the biological Molecule world Like we knew about proteins like a hundred years before we knew that even they existed So in like in like 1800, um, it was first determined that Proteins are a distinct class of biological molecules um So have when when people started analyzing samples like chicken eggs and had they looked at these things chemically They saw that indeed there were different fractions. And so when you would for example do, um a little When you do have when you do different chemical tests And then you can see that there are different fractions or different parts which make up cellular fractions So in in 1838 we have the first description of proteins So the first description of proteins, of course is a description where they say well, there's this egg And if we if we take the substance which is in the egg here, we have something which is not calcium Hey, but we have something which is kind of Fatty so we have a fatty fraction But besides the fat fraction we also have a fraction which is different. So it's not fat. It's not sugar But it is a kind of protein like structure and had the term protein was also coined back then in 1938 So people knew that proteins were a different structure Um, or that were a different type of biomolecule, right? So they knew it was not fat. Um, hey the end of the 1800s We also discovered a nucleon. Um, so had the nucleonic acids like DNA and RNA but had proteins people couldn't really do a lot with because They are like massive molecules, right? They are sometimes like hundreds and hundreds of amino acids big Or even thousands of amino acids and Separating out those different amino acids from each other to kind of hint at the structure Was very difficult. So people knew that proteins existed. They knew that they had biological functions so that they functioned at as enzymes That but like they also couldn't really purify them properly and like insulin was only purified like in the 1920s. Um, and, um They had that very difficult So for almost a hundred years, nothing really happened. The big discovery came in 1985 When x-rays were discovered So, um, madame Curie discovered x-rays and by using x-rays on protein mixtures People figured out that x-rays might be the key to kind of unlocking proteins and learning more about the structure of proteins Not just the structure, but also their function So in 1912 we see that they the first x-ray diffraction experiment starts happening So what happens with an x-ray diffraction experiment? And we will talk about this more But very basically what you do is you take a protein you purify it as best as you can And then you make a crystal out of it So you make a little crystal out of the out of the protein that you purified and you put this crystal you put it on a On a kind of pedestal and then you shoot x-rays at it and you have a plate So a photo sensitive plate behind it and when you do this you start seeing very specific patterns So if you take different proteins and you do different crystals of these proteins You see that when you shoot an x-ray at it you get a diffraction pattern Which looks completely different for every protein that you look at So after these first x-ray diffraction experiment, nothing really happened for a long time But for a long time for a couple of years like in 1926 people discovered that proteins are actually enzymes So that they can catalyze reactions and that they are involved in things like Glucose homeostasis so that that if you are Diabetic then you are missing a certain protein And have people kind of realize that proteins were the kind of the workhorse of the cell so that they are enzymes Catalyzing chemical reactions and that you can have a chemical reaction And if you add a protein to it this chemical reaction runs much quicker or it runs In a different direction So in 1933 we have the theory of the secondary structure of proteins So the x-ray diffraction experiments told us that proteins are made up of very simple building blocks which are More or less in different configurations, right? So it kind of led to the discovery of an amino acid and that a protein is more or less Made up of different amino acids that are chained together forming a protein And that's what you see from the x-ray diffraction experiments because when you do an x-ray diffraction of a protein crystal And the protein is not too big Then you see for example very specific patterns and these patterns come back between different proteins, right? So if you have the x-ray diffraction of an alanine Amino acid and then of course this diffraction pattern also occurs in other proteins But not in all because only the proteins that contain alanine have this very specific pattern And from that they deduce that there are around 26 27 different patterns, which continuously occur And so that's their idea. They got the idea that okay, so a protein is a very big molecule But it's made out of 20 to 26 very distinct subunits and these subunits. They actually call amino acids. So Um In 1933 we have the the theory of the secondary structure of proteins So people then started to realize that the way that these these amino acids are more or less chained together Allows a protein to have a different secondary structure So that that proteins are more or less folded back unto itself and that you see that a protein has a very Distinct secondary structure and that this structure is related to the function that the protein has In 1946 we have the development of nuclear magnetic resonance imaging And had that actually allows you to see proteins Function Nowadays so nowadays using nmr. We can make very detailed scans of proteins and we can actually see proteins more or less moving When they are catalyzing their chemical reaction In 1949 we have the synthesis of insulin. So her insulin is of course the The magic substance that um helped Diabetic children to not die. So before like 1920 when you actually Were diabetic There was nothing that people could do So they would just put you into a hospital bed and you would just in the course of like six to seven weeks Just whither away And not being able to take up food Not being able to get glucose into your cells And of course this was horrible because like a lot of children actually died before 1920s Because they had no way of treating diabetes. So if you have type one diabetes, you can't produce insulin at all There's a there's very much literature about it So if you're very interested in the history of insulin, which is like one of the miracle drugs at the beginning of the 19th century Then do read up on it because it's a very interesting story with three different scientists who were all Kind of fighting with each other in a way and like In the end it's good that they actually never patented their like extraction method because otherwise a lot more people would have died But the synthesis of insulin is a big step forward before 1949 if you needed insulin Then the only way to get insulin was to find someone who could chemically Kind of extract or purify the insulin protein And that would be extracted from bovine Slaughter houses So have someone would go to a slaughterhouse where they slaughtered a lot of cows They would get all of these pancreas is together They would squeeze more or less all of the protein out of the pancreas and then there there would be a purification step Purifying the insulin from the bovine pancreas But in 1949 actually they developed the method to chemically synthesize insulin Which means that you didn't have to go to the slaughterhouse anymore 1958 we have the first protein structures being unraveled using NMR and x-ray diffraction So at that point we we started kind of having an idea of how proteins really look like So they're they're quaternary structure not the the secondary structure or the tertiary structure But really the quaternary structure had to to see how a protein works So in 1964 we have a cristallo electron microscopy. So it's very similar to cristal x-ray So but here you use an electron microscope In 1967 there's the first protein structure by x-ray which has been determined and then in the 1970 We see the protein database being established So the protein database is the database nowadays where if you're interested in protein and protein structures Where you can literally find all of the information collected in the last 50 years Regarding proteins and protein structure In 1975 we got a really or a new method And this new method is called 2D GL electrophoresis and we will go into that in much more detail Basically what it allows you to do is it allows you to take a protein mixture of different proteins And kind of put it on a gel and see which proteins are there So it allows you to separate proteins based on their mass But it also allows you to separate proteins based on their charge. So their ph And this allows you to do Or this allows you to do like 100 people take blood from them separate out the proteins and see if people who are sick have a certain protein While others don't have it or if there's a difference in the abundance of certain proteins So it's it's still one of the most commonly used methods when you want to look at like protein and proteomics But 1975 was the discovery of 2D gel electrophoresis in 1976 We have the first visualization of a protein structure on a computer using of course the protein database And so this is the first time that someone created a 3D rendering of a protein And in 1981 we have ribbon diagrams being invented. So ribbon diagrams are a way to kind of draw a protein Without having to go into too much detail But still having an idea of what it does And then of course in 1999 we have the ribosome structure. So one of the biggest proteins in In in a human cell or in more or less any cell So at that point we were able to kind of Take something which is 15,000 amino acids long and kind of make a structure And of course in the last lecture we talked a lot about the structure of the ribosome And this is only known for the last like 20 30 years And so all of this this knowledge that we that we have nowadays and take for granted in a way is all relatively new And is something that we only discovered like 20 22 years ago So we knew that the ribosome existed. We knew exactly what it did, but how it looked like was only discovered in 1999 All right, so that's it for the first hour. I will stop the recording