 For the last stretch today, we'll switch the mode a little bit. I walk you through the code that you can find in the code SNPs file, which has been updated and uploaded to the repository. So if you click on your version control icon here and then click on pull branches, you should be getting the latest version of code SNPs. And then that's, you can find it in your files pane and access it. So what we've done, so basically going along this here will help us remain in the same state and it will be easier to download this. Any problem with pulling that data? Yeah. What's the error? It's asking to merge. Okay. Allow it to do that. Can you? No. Okay, so do the following. This is in the R folder, right? R folder. Click on the read fast A version that you find there, which is probably the one that you wrote. Click on rename and call it my read fast A dot R. So click on rename, call it my read fast A dot R. Once you've done that, then pulling the branch should work. Yeah, now it's there. No, that was there. Okay, does it work now? Yeah. Good. Okay. So what's your error message? Yeah, local change to which one? Code SNPs dot R. Okay, so apparently you modified code SNPs dot R and saved that. So there will be local changes. Do the same thing. Either delete your local version of code SNPs dot R, or if you have changes that you want to keep, then rename it to my code SNPs dot R. Once the file doesn't exist anymore, or it has to copy to a different name, then this can be downloaded from the repository. Everybody there now? Awesome. All right. So we're in code SNPs dot R. The first thing we did was we downloaded the GRCH gene model data from Biomart and put that into a data frame called gene models. Then we extracted the information for ENST, whatever 371905, and selected only that. Then the transcripts we found weren't ordered in ascending coordinates, so we fixed that by getting a vector of the correct ordering and then reordering the rows in our GNS2 gene model in the correct ordering. Now the next task is to build a data frame for GNS2 protein annotations. So we can do this in very, very many different ways. The one that I thought would be useful to have looks like this. Actual coordinates will have the actual nucleotides, codon positions, amino acids, and the position of where we are in the codon. This is the common information, and the row is the first nucleotide, the second nucleotide, the third nucleotide, the fourth nucleotide, so going through this nucleotide by nucleotide. So we'll have something like, I know, 1200 rows in that, each with this information. This is the number of the genomic coordinates, so it'll start from the start of the first exon, and it'll just increment in one until the end of the first exon, then it'll skip it. A lot of numbers, so it'll then skip to the start of the second exon, and so on, and that makes it very easy since we basically then have our coordinates defined. We can just use that as an index into mySeq37 and pull out the nucleotides. So we don't need to do additional calculations, so basically translating our gene model into a list of indices for the coordinates allows us to pull this out directly from mySeq. Of course, applying the offset. Then we can translate that into codon positions simply by dividing it by three and taking the ceiling of that, the floor of that. We can then take these nucleotides and translate that into amino acids by applying the genetic code that we find in the Seq in our package, and then we'll put in a vector for convenience that we might need later, of simply 123123123, which gives us the codon positions. So that's the goal here. So the first thing is we'll do a little function that translates a gene model to a vector of coordinates. So we initialize a vector, then we'll go in and for each row of the gene model, we'll take the number we find in the start, we'll take the number that we find in the end, we'll make a little, a sequence of integers from start to end, and we'll concatenate that to the vector that we already have. So that's a little function I call mod2coordinates, just a small utility. The offset we need is 57,410,000, and now our first column in the genus to protein data frame is simply called coordinates, and mod2coordinate, the little function we have up here, of the gene model which we've defined before. So we execute this line 148, we have genus to protein, right? So this is where it starts 57,466, and it goes on and on and on until it gets to the next exon and so on. So this is simply a list of integers which represent the genomic coordinates for the nucleotides in the coding sequence. So once again, this column contains the genomic coordinates for the nucleotides in the coding sequence as determined by the gene model. So this has 1143 nucleotides. So very quick sanity check, 1143 divided by 3 is an integer. That's reassuring. The coding sequence ought to be divisible by 3. So after we've put everything together from our coding sequence from the gene model, we actually get a number that's divisible by 3. That makes us happy. All right. So next, the nucleotides for that. Well that's just the first column which we've just defined, the coordinates minus the offset taken out of mySeq37, which we've already downloaded. So we defined the second column here, ATGGGCTGCTC. What do we expect in the first three nucleotides? ATG, the well-known start codon. So we get that. That's also very reassuring. We're kind of on the right track here, it seems. What do we expect as the last three nucleotides? Which is TAA, TAG, TGA, one of these three. So do we have that? Let's see. We have a TAA here as the last three. Yeah, again, that looks good. The third column which we want is codon positions. And we simply make a little vector, 123, and we assign that to the column dollar codon positions. Now the column dollar codon positions is much longer than 123, right? It's a data frame with 1143 rows. So that's a much longer vector than this here. Now if we do anything to a longer vector with a shorter vector, our implicitly uses vector recycling. So it takes the shorter vector and just repeats it as often as it needs until whatever we need it to do with the longer vector is satisfied. This can be dangerous because sometimes we're doing things with two vectors of unequal length because we're making a mistake and this will just silently pass. So you have to be careful. If vectors are of unequal length, if you're doing something from A to B, if B is longer it will be truncated, if B is shorter it will be recycled until A is satisfied. Here we exploit that and simply giving it the vector 123 into these codon positions will repeat 123, 123, 123 over and over again until the whole column is filled. The header of course 123, 123, 123. The tail of that is 123 and it ends as expected, still identifying the correct codon here. So there wasn't any frame shift or deletion in there. All right. Now to translate this, we use a function from library seek in R. You have already installed seek in R on your computer. If you haven't, if you haven't done the pre-work tutorial and you haven't installed seek in R, the first command that you need to emit here is install packages seek in R. Note that when you install packages, you put the package you need into quotation marks. If you load it as a library into your current session of R, you don't use the quotation marks. At that point seek in R is an object, an R object which is known to the system. So issuing the command install packages seek in R, accesses cram the comprehensive R archive network where many, many of the R packages are stored, it downloads it, it puts it into the right location on your computer and it becomes available. Moreover there is a team of volunteers working to keep cram updated and to make sure all the packages work and that everything that these packages do is as described and no malicious code is inserted. It really can't be overstated how useful the infrastructure of maintaining and distributing packages is for R and how well done this is. No software project comes close to what R is doing, no open source software project comes close to what R is doing with cram and with bioconductor. I've just recently heard of problems they are having with Node.js with JavaScript virtual libraries silently installing Bitcoin miners on people's machines. I am not surprised. JavaScript is, software packages is a mess and when we download software packages from Python projects or from JavaScript projects or whatever, we trust that they do what they want and they don't install ransomware on our computers. When you download things from GitHub, that's not guaranteed. So when you run your installation from my project directory, there is no guarantee that that didn't install ransomware actually and suggested that to me. We should be doing that. She's joking and I'm not doing that. So if you actually do find ransomware on your computer after this workshop, it's not me. I promise it's not me. Maybe you installed something with JavaScript instead. Anyway, so installing packages from cram just works and it can't be overstated how useful this is in the bioinformatics landscape. So install packages, seek in our library, seek in our, of course, you only need to install the package once or once every few months or so when there are updates. But you need to run library whatever in every single function. Having seek in our gives us a function called translate that translates nucleotides into amino acid sequence by default in frame zero, i.e. starting with an ATG in the first position. In the forward sense, you can tell it to translate backwards and you can tell it to translate in different frames, whatever, it's flexible about that. So if we take the nucleotides column that we just extracted from my seek 37 and translate that into a vector called x, it starts with mg, cl, g and sk and so on. Is this correct? How would we know? Well, for example, we can look back where we looked at the gene in uniprot that we accessed, which has the sequence somewhere, mg, cl, g and sk. Is that what we got? mg, cl, g and sk. Again, thankfully, we're getting the correct information. So the nucleotide sequences which we picked up with our gene model from genomic coordinates actually seem to be giving us CDS coding sequence that produces the right amino acid translation. Okay, so let's put the amino acids into our protein annotations. Here, this is the amino acid sequence. We can paste it. It looks correct to me. One thing that would be suspicious is if we have stop codons in that amino acid sequence, it probably means that we're looking at a frame shifted sequence or an incomplete sequence or something like that. But this goes on for quite some time. And it has no stop codons except the one at the end, which is good. We'll build a column, $AA, to hold amino acid information. But we only want that in the first positions. We'll put blank characters into all the other positions. So we do a similar thing to which we did before with a short vector here. Our short vector is just a single scalar, i.e., a blank. Now, we have a column, $AA, but every single element in here is just a blank. And we'll overwrite these blanks for position one with the amino acid sequence. So that's easy to do. We build a selection, which gs2 protein codon position is one. So this equals gives me a logical vector, the function which gives me a vector of indices for every position in which my logical vector is true. So if I build this selection here, I have 381 positions of one, four, seven, ten, thirteen, and so on. Again, that's what I wanted. One is the first codon position of the first codon. Four is the first codon position of the second codon. Seven is the first codon position of the third codon, and so on. And they're 381. And I can now fill these 381 positions with the result of translating my nucleotide code, which then adds m, g, c, l, g, and s, k, and so on into the correct slots of our little protein annotation table. So we have coordinates, nucleotides, codon positions, amino acids. What else could we want? Right, so we'll also add the actual codon index from one to the end. And the way we do that is we simply take the index going from one to the end of the file and divide that by three. And then we take the first position and subtracting one, dividing by three, and adding one again. So, here we go. So this gives me a resulting vector which looks something like this. Three ones, then three twos, then three threes, then three fours, and so on, and so on for a total of 1143 observations. We could have done something similar with paste and repeat, but just calculating this in this way is easy. So this gives us the codon positions here. So first codon, second codon, third codon. And that's nice because now we can basically take a nucleotide coordinate and we can immediately tell what the codon is that this belongs to in the sequence. Even if the nucleotide, the genomic coordinate is split across exons and there's thousands of bases in between. Again, looking at the end of this all. The last codon, 381, is actually a stock codon and that tells me there's 380 translated codons in the sequence here. All right. So that's protein information. If you've executed that, you now have this object genus 2 protein with that protein information. Now let's pull out the protein mutations according to some specifications. First of all, we want a mutation object called genus 2 mutation and of that we want all the rows from these gene mutations where the positions fall into the genus 2 coding sequence. So this is where in comes in handy because we can say which of the mutations start codons is in the genus 2 protein coordinates. So I could have iterated over that and just taking every single codon and then checked throughout the entire file of my protein annotations, whether that matches. But since I have the protein coordinates in my little file here and my little r object here, I can just do it with a single in statement. And this is why I decided to put in a column of nucleotide of genomic coordinates in the first place to make this selection easy. So we select rows in the intergen data, GNAS mutations for which the position is in the protein coordinates. And then we select mutations. We create an object which contains only the rows of these mutations, i.e. GNAS mutations with that selection as a row selection. So we have 270 mutations which fit the bill, which are somewhere in the protein positions and where the start coordinate actually is part of our coding sequence. Now I've written a little loop that confirms just by visually plotting them or printing them out that they're all correct. So each and every single one of these positions has the same annotation in the mutation data that we find in our little amino acid data. And then we can answer things like, how many of each do we have at each position? Are there codons that are affected more than once and so on? So this is the data we need. Now we have mutation data that is annotated to coordinates and we have protein data where we can match the annotated coordinates to actual codon positions. And from that, we can start building our little lolly plot. So what categories of effects do we have? So if we look for the unique here, we've removed the splice things. So we have missense variants, stop gained, frame shift variants, and synonymous variant. So if we plot that, we would like to have our plot show them with different colors. But since we're smart about choosing colors, we're not going to make them red, green, yellow, magenta, and cyan. We're going to choose colors that kind of make sense, where frame shift and stop gained kind of has a similar color because it has the same effect of a truncated mutation. Missense variant maybe has another color which is not too dissimilar because it's also usually something that affects the protein adversely. And synonymous variant should be a very cool and calming color because we have a variant there, but nothing much actually happened. Even though you can't guarantee it, sometimes synonymous mutations do have effects. And we also need a color. So that brings us to the question of how do we actually choose smart color palettes? There's different ways of doing this. One site that I found quite useful is color.adobe.com. And this allows us to find different colors. We see color patches here, and we can estimate how these colors work with each other in different ways. There's monochromatic choices, triadic choices, where we have three very distinctly different colors. Complementary choices, compounds, shades, and other customs. And the nice thing is we can start selecting here. So what I actually want is four different colors. I forget how to restrict it to show me only four here. Oh, I need a custom. And I can individually change the colors by moving the color around on the color picker and changing the hues, the saturation, or sorry, the red, green, and blue channel, darkness, and so on. For each of these, then, I can simply click into this field and copy and paste the value into my script where I can interpret this as a hex code. So things that I've tried out that kind of work for me give a frameshift variant, missense variant. But of course, you could have different choices for that. Now, this is a completely free choice in which you can get very creative. There are other ways to choose colors. And the data science community often uses something called Color Brewer. And that's actually a very well done site. It's intended for maps, but it also works quite well for the kind of analysis that we're doing. The important thing is that we also have choices that we often don't consider very well. So let's pick a diverging color scheme of something like that. So now we can see very easily assess large scale color differences. So diverging color schemes are useful for things like hot and cold or high and low. So where we have two things that are on and screams values. Qualitative colors are for things that aren't quantitatively related. But we can do colorblind safe color schemes here. So this is a color scheme that would be useful for people who are colorblind, something that's not sufficiently usually addressed in the data science community. Print friendly, so things that work well if we printed on a laser printer or that work well on a photocopier. So because the photocopier here would still translate this into an intermediate value, this one into a dark value and this into a light value, the differences between these three regions could still be seen on a photocopy. So a lot of practical intelligence has gone into building these color brewer values. You can choose the number of data classes up to 11 or 12. So things can become very colorful. And you get these pellets and you can download them. So for our case here, we might want to choose something like a four-value table of a qualitative strain, but we wanted to actually have three of them similar and one of them dissimilar, so that would be an argument for actually using a custom color frame. So there are several choices for choosing colors smartly, getting the hex codes and then bringing them in. One is coloradobe.com. Another one is the color brewer site. And incidentally, there's also color brewer R package from which you can just access and download the pellets. And we need another color to define our protein. So to work with that, on a lolly plot, I built a little data frame which I call EFF for effect. And that has three columns. One is the actual label of the effect, i.e. Frameshift variant, Missense variant, Stopgained and synonymous variant. The colors that I chose for that, the heights where I want my little circles to appear on the block, on the plot later on and we'll get to that. And of course, the labels. And often when I need to select things from categories from a table, from a data frame, it's very useful to have whatever ID I use here, i.e. in this case, the string of the name, to have that as row names because that then allows me to choose things by row name, which is good for the code. So we define that. Let's verify our colors. Incidentally, I have them not as raw color values, but I've defined an alpha channel to have them slightly transparent. And to verify the colors, we can just plot them as a bar plot, add the labels there. Yeah, so we can check. So these are the colors that are going to appear on our plot. You could easily change them. Now to actually plot them, we want to figure out which mutation effect we see on which amino acid. And that's something we simply do by building a matrix of mutation effects. So the matrix has the number of positions in our amino acids as rows and the number of effects that we have, i.e. four different effects as columns. So our positions in which we observe mutations are 49, 60, 83, 95, 127, 141, and so on. So not all amino acids are affected, so we don't need to test all of them. Only some of them are affected, and we select them here. We define numeric values as the length of our positions vectors, i.e. how many different amino acids are affected, times the number of effects, and this gives us a matrix. This is what the matrix looks like. So initially, everything is zero. We have a column for frameshift variant, missense variant, stopgain and synonymous variant. We have a row for amino acid 49, amino acid 60, and so on for all of the affected amino acids in our protein. Now, we can iterate over our mutations. For every single mutation, we just determine what is its effect and which amino acid is affected, what is its effect and what amino acid is affected. And then we can increment the value which we find in this matrix by one, whenever we have one. And after we've done, we're done with that, this mutation table contains all of the information that we want to plot. So here we go. For each row in the GNAS mutations, we select which protein coordinate this start position falls into. We translate the position into a character, we get the effect, and then we increment the correct value in our mutation table by one. So then we have this table here. 12 missense variants in position 49, 12 frameshift variants in position 60, 14 missense variants in position 187, 8 synonymous variants in position 170, 7 missense variants here and here and here and here. So I'm not actually entirely sure how valid this data is. It seems to me that there's a lot of double counting there. So in a way that transcripts were double counted because many of these are multiples of 10, even though not all of them. Nevertheless, this gives us at least a relative sense of how often each position is affected by the mutation. So then we can plot this data as a lollipot. So remember the task of the lollipot is to put a circle somewhere above a bar that symbolizes the protein and we kind of define three levels at which we plot circles, which we've already defined here in our effects. So the first, the frameshift variants should be at a height of 1.7. The missense variants should be at a height of 2.3 and so on. And the synonymous variants should be at a height of 1.2, i.e. closest to the protein sequence. So how do we plot circles? Well, how do we draw circles on a plot? Now base R does not actually have a plot circle function even though you can easily plot a circle by just defining a number of points and using sine and cosine functions to figure out what the coordinates should be. There's packages that have a lot of functions for auxiliary things to add to plots. One of them is the plotrix function and plotrix has a good function draw circle which you could use in principle. So we could use a little draw circle for every single one of the mutations that we see. But there's a caveat to that. There's a little snack. Draw circle actually only shows circle if we define the aspect ratio of our plot to be 1, i.e. if the x-axis is shown exactly as long as the y-axis, i.e. the plot is a square, then draw circle will draw circles. If the x-axis is longer, it will draw a nice ellipse which has the circuit coordinates but is not shown as a circle. Now what we want instead is to show circles on the plot regardless of what the axes are because sometimes we want to draw, pull our plots out to be very long or very tall and that shouldn't be deforming our plot. So we can't go with actually drawing the correct drawing an actual circle on the plot without doing a lot of coding in the background to compensate for the distortion which is induced here. So what we'll do instead is we'll use one of our plotting characters. So when we do scatter plots with R, there are a number of plotting characters available, something like this here. So there are plot symbols. In plots they're specified with the parameter pch. pch1 is the default which is just a small circle here. pch2 is a triangle, pch3 is a cross and so on. 16 is a filled circle, 20 is a filled small circle but importantly we have 21, 22, 23, 24 and 25. These are special because plotting characters 21 to 25 allow you to specify the fill color separately. So these can be filled with different colors and that's what we want here for our lollipot. We want circles that we fill with different colors. If we use plotting, circular plotting characters, they always appear circular regardless of what the aspect ratios of the axis are. So let's have a look at that. We just plot the numbers 1, 2, 4 on the x-axis on the same level on the y-axis using plotting character 21, defining the limits of the plots, showing no axis here and scaling the size with the parameter cx from 1 to 8, giving the symbol a gray border and coloring the symbol by the colors which we've defined previously for our effects. So here we have scaled circles. So if we distort the frame, as you can see, initially they change their shape but then when the plot is recomputed after every change of the aspect ratio they become circles again. So these things remain circular with different aspect ratios and we can scale them with giving them different values for cx. So in this case here the values are 1, 3.3, 5.67, and 8 and we can color them with values that we take from somewhere. We can either compute the values or we can take them from wherever we are stored. We have them stored or we can use single values that we take from the list of colors. So something like peachpuff, papaya whip. What was another one? Oh, we looked at misty rose, firebrick, so we can define different colors in that vector. Now the size that our circles should have should depend on the number of mutations we see. So we write a function to calculate plot symbol sizes. Number of mutations to cx values where we enter the number of mutations and some nmax value which is the largest circle that we want to see in our plot. And with that we simply return the value of mutations. We saw, say, 14 or 38 mutations and we need to rescale that to something in a reasonable value to appear on the plot which we do by this little rescaling function. So we're all set to go. We know how to work with circles. We've counted our mutations and now we just need to put everything on a plot. So how do we do that in principle? Well, in principle we draw an empty plot which has the correct size and then add elements to the plot. So usually when we plot, we plot a scatter plot or we plot some kind of graphic. But here we just use the plot function to set up an empty canvas and then we use auxiliary functions like lines, points, rectangle, polygon, segments or text to draw on that canvas whatever we want to see. Then we should also add axes and we should add a legend and a title to the plot. So the first task to plot is an empty frame simply to set up our coordinates. So we plot just two numbers, zero, zero, it doesn't matter what it is. We define the plot type to be n, none. So nothing is actually plotted when we do that. We define the x-axis limits to go from zero to the largest protein codon index. We define the y-axis limits to go from zero to the number of effects plus two. We define an x-label, genus two sequence position. We define a y-label of nothing. We define a main title and we switch axes off. So this is our empty canvas. That's what it looks like. Just a title this here and invisibly here we have a frame of coordinates that cover the length of our protein in x-axis and give us enough space for our lollipot in the y-axis. Then we can draw a rectangle for the protein. Simply rectangle going from one at a lower coordinate of zero to the maximum codon to a higher coordinate of 0.15. And we give it a color that we've defined before as a kind of a bluish kinge for our protein. So here's where our protein goes. So this now defines the coordinates from one to the end of our protein. In numbers here this corresponds to the amino acid positions of our codons. And we can put an axis on the bottom. So these are our amino acid positions. Our protein is 380 amino acids long. So you can see how that matches to these coordinates. And then we plot the mutations. Something that we need to scale them correctly is we need to know the maximum number of mutations which we see, which is just the maximum number in our mutation table. And then for each row in our mutation table and for each column of our mutation table, if the number of mutations is greater than zero, i.e. we've actually observed mutations in that position, i, j of our mutation table, then we draw something. What we draw is we define which row name we're going to be using. So what the position is, the X position, we define the Y position of whatever we're plotting from the effect table. We define the color that we're using. Then we draw a line along the X position of the effected amino acid going up. And then we draw the circle with the points command. So points draws individual points like on the scatter plot, but draws them individually. So the points command is here, put it on X, Y, which we've just calculated where it should need to go. Use plotting character 21. Scale it to the result of our N2CX function and use as a background color the color which we have defined before in our effects table. And running that finally gives us our logic plot. So for every single mutation we've now plotted the position, the scale, the color which corresponds to the effects of, for example, frame shift. I think this is frame shift. No, stop. We'll see. Only occur at this one particular position here. Silent mutations are here, and most of the mutations we observe in our coding sequence are missense mutation. And then finally, to top it off, we plot a legend. What these colors actually mean? Plots aren't complete without a legend. So there we go. Yeah, Pascaline. Let's say that you are interested in breaking down the line section of the Y axis into a different portion, but then by parallels. How do you go about that? This here? Yes. So I would like different, so a red line here, a blue line here. The blue line. The blue line here. Yes, and then breaking down into like sections. Oh, on the X axis. Like annotating domains on the protein, right? So that's very easy. Instead of just plotting one rectangle at the beginning for the protein, I would plot two rectangles, but with different colors. So for example, well, I can just do that. We plot an empty frame. So I could... Well, let's do it as an overlay. So here we have this one here, and let's assume we have a domain which goes from position 100 to position 200. So we draw a second rectangle. We go from position 100, slightly below zero, and we go on the right-hand side to position 200. Also make it slightly larger. And the color should be whatever, fire break in this way. So we can simply put rectangles over rectangles over rectangles, and we can add circles wherever we want. We only have to calculate the coordinates for them, and then we can use the function for rectangle and for line and for points to put everything where we need it. We can also add textual annotations with text. So for example, if this is a domain and we want to annotate the name of the domain, let's assume that we put the text at 150.5, and we add the actual text labels, and we can put text there and scale it and move it around. So we're very flexible in actually adjusting our plots, customizing them. We don't need to take plots and then import, print them to a PDF and then import the PDF into Windows draw and then start scratching around in that. We can easily find that. The advantage of putting this all into a script is that it's easily changed again. If you don't like the colors and you have this all in a JPEG that you constructed in Photoshop, it's going to be very hard to change. Here you just change your hex codes and recalculate your plots. All right. Any other questions about data integration? Now, there's so much more which we could have done. Actually, one thing that I was planning to do with you is to work a little more with numeric data. So right now, we've only used essentially text data, sequences and translating sequences and pulling sequences from databases, and that's a large part of our daily bread. But something that we also work with, of course, is data that contains numbers. And I've constructed a little example project. If you open your version, which is an updated version of init.r and source it and then type init. This has constructed a new file, MyNumericData.r, which contains a scenario of working with protein structure data and doing some annotations on the protein structure data. So this code is not as open-ended for you to explore things. It's code like the thing we've walked through in the code snip. It's written for you to read it and experiment with executing this and to code things and to try them out based on the templates which I provide here. So what we do in this code is we load a library by 3D. We load a file, 6AU6, which is the protein structure that was determined for GNAS2. So what you can do when you download that, you can look at it in your favorite molecular reviewer and you can ask yourself, where are these mutations actually found? Which parts of the protein are mutated? But what we do then here is we explore that as a piece of protein structure and we compute a few Ramachandran plots, i.e. plots of backbone angle coordinates. Who here knows what a Ramachandran plot is? Great. Not everyone, I think. Okay, so I'll just demonstrate this briefly. My favorite molecular reviewer is Chimera. I look at protein structures in Chimera. This is a model of 6AU6 in Chimera. It's a guanosine binding protein, so the guanosine molecule is actually bound in this cavity here. And the Ramachandran plot defines the torsional angle of the backbone. So a protein peptide backbone kind of looks like this, N-C-alpha-C-O, N-C-alpha-C-O. That's a single peptide unit for an amino acid. And the peptide bond, the bond between the C and the N with carbonyl oxygen sticking out from the side has a pseudo double bond character. So there's electrons injected into this bond here from the double bond here. So it kind of looks like a pseudo bond, and this means even though it's formally a single bond, it's very hard to rotate it. So this is usually planar in either plus or minus 180 degrees. So it's not freely rotatable. However, the bonds around N and the N-C bond is freely rotatable, and the C-alpha-C bond is also freely rotatable. So to a certain approximation, the three-dimensional structure of a protein is basically defined from these rotatable bonds before and after the C-alpha. And these rotation angles are called phi and psi. So this is the phi and the psi rotational angle. And what's kind of important about them is that due to steric constraints, you can't have all combinations of phi and psi values. Because at some point when we twist this in odd ways, the atoms in the peptide chain start bumping into each other. There's a lot of steric forces that make this energetically very unfavorable. So only certain values of phi and psi values have favorable energies. And something that's really highly non-trivial about that is that if something has a favorable energy in a protein structure, you will observe it more frequently. So proteins tend towards low energy conformations. So if we look at a so-called Ramachandran plot, which plots the values of phi angles against the values of psi angles, the combinations that we actually find are constrained to very defined regions, phi and psi is here. And there's a cluster in this region here which corresponds to... Anybody know that? Pyrochemistry 210. There's a cluster here which corresponds to alpha helical conformation. And there's a cluster here which corresponds to parallel and anti-parallel beta strands. There's a region over here that's actually disallowed for most amino acids because you have steric conflicts with the C-beta atom that bumps into the carbonyl oxygen of the next amino acid. However, if the amino acid has no C-beta because it's a glycine, then this region is allowed and you usually only find glycine residues in this region here. This protein is somewhat different. We have a number of residues here. Some of them are glycines, some of them are not glycines. And once you calculate in our little script which ones are glycines and which are not glycines, you can identify them on this plot here and you will find some of the non-glycines are in loops that presumably have very flexible conformations. Who knows? Maybe this is an experimental error. And the other ones are close to the binding site of the guanine molecule. So there's a strained conformation close to the guanine molecule which has implications for the mechanism of the protein. You actually find that relatively frequently. There's a kind of a bipartite, a bipartite, a division of function, a division of labor in protein structure. Some of the amino acids conspire together to build the structure and some of the amino acids need to adopt particular perhaps strained conformations to fulfill the function. And in this case you will see amino acids that are close to the guanine are in a strained conformation because they just need to somehow bind the molecule very tightly and that comes at the cost of constraining them through the other amino acids that basically provide the matrix in which such a strained high energy conformation is possible. So these are numeric values, Phi values and Psi values just to illustrate what these numbers are. We can analyze them and plot them in different ways and what I've set up here is a plot that's kind of nice. It's a three-dimensional contour plot of the frequencies and essentially this script will teach you how to do something like that. Of course it's driven by Phi and Psi angles but I hope that the code is self-explanatory enough so that you can readily adapt this to something that you might want to do at home. So that's the numeric data which is in your project folder. Now I would kind of normally conclude in saying this is all I wanted to say and thank you very much. This is certainly not all I wanted to say. I would have wanted to say a lot and a lot more. I think though I've said enough you've done a great job in soldering through this working with selections of the most various types which can be intimidating at first. I hope it became a lot clearer as we went along. The one key thing that you'll need to do is to practice this. If you don't pick up coding R in the next two weeks at all again probably most of this will be lost on you. So define little projects to be working on and start coding. Do a little bit, you know, 10, 15 minutes every day. Think of little projects. Try to break them down step by step. Try to solve them into scripts. Try to make your script do something amusing, inspiring, perhaps even useful. Perhaps something that will lead to a project which will later be published somewhere. But practice, do things, and play with it. If you're stuck, unfortunately we won't see your red post-its anymore. But you're free to shoot me an email and I'll see how I can help you with advice. Some of you, the very intrepid ones are still going to be around tomorrow. The other ones I won't see. I think you've been a tremendous class. I've seen amazing progress here from where we first started. And with that I'll officially close this part. Don't run away yet. Anne has certificates and surveys and other things that you still need to do. But as far as I'm concerned, thank you very much.