 So, but is that all there is? So, you know the whole idea of the human genome project was that once I know the sequence I am sort of home. It is like knowing the fundamental laws of physics everything else is a matter of calculating things out. But all the information that is to be known is in principle encoded in the sequence. But it turns out as it often happens whenever we think that we have cracked a problem we have sort of we have all the ingredients that we need to understand something it turns out that we were wrong and severely underestimated the complexity of the problem. So, once we did the sequencing we sort of found out the genetics or the sequence information is not all there is there is something above and beyond genetics and there is what is. So, that is given a fancy name that is called epigenetics and explain a little bit of few as it comprises of a few different aspects. How do you control the genome basically? So, it is at a level which is higher than the genome epigenome. So, I will talk a little bit about what comprises this epigenome ok. So, here is the thing that we were talking about right. So, if you think about the number of base pairs in the human genome we have some 6 billion base pairs. So, 6 into 10 to the power of 9 base pairs ok. So, if you stretched it out it comes to order of meters which is around 3 to 6 meters and you need to package it into a nucleus which is of the order of microns. So, 6 to 10 microns so, 10 to the power of minus 6 meters. So, take this meter long object and you are going to package it into a volume which is of the order of radius of 10 to the power of minus 6 meters. And remember that you cannot do it sort of randomly. So, I cannot take I cannot take a sphere let us say my nucleus is a sphere. So, here is my sphere and I need to put my DNA into here. I can say that well I do not care I will squish it as much as I can. I will squish it until all of all of this 10 6 meter long string is inside this 10 to the power of minus 6 this 10 to the power of minus 6 meter long nucleus and I do not really care how I am squishing it in. All I need to ensure is that this goes in and it stays in, but that does not really work for the cell because the cell needs to read information from this DNA right. Inside the DNA there are genes and these genes code for proteins right. So, when a cell needs to manufacture a certain protein it needs to know that where the corresponding gene for that protein is. So, that it can go there and it can read that information and produce the appropriate protein. So, if you are packaging your DNA into million different cells in a million different ways the cell has no way of knowing where exactly it should go in order to read a particular information particular piece of information right. So, this this arrangement it cannot be random there needs to be some sort of there needs to be some sort of structure as to how you take this object and how you package it inside the nucleus. So, here is where the physics well the physical nature of this string becomes important it is not just the information content which is the sequence, but it is also the fact that this information content is embedded in a string which is a physical object and that physical object is packaged inside a physical volume right. So, how you do that packaging that also has implications for how for the regulation of the properties of the cell itself right. So, if the gene that you want to read is hidden somewhere here very tightly packed then the cell might have difficulty accessing it. So, it may not. So, for example, let me say this a little better. So, we do not know the answer to this. So, we do not know how the cell packages it we know that there is some algorithm we do not know what it is, but let us say somehow the cell has done it and here is my here is my whole genome. This is how it looks some parts are a little loose some parts are a little tight. So, here it is coiled up a lot here it is coiled up loosely it turns out that if you take. So, let us say this is some cell type let us say this is a hot cell right this is a hot cell it turns out that you take it if you take a different cell if you take a different cell let us say a liver cell this packaging is different from a hot cell. So, for example, this part may be packaged very tightly and then you may have very floppy bits and then you may have a very tight bit over here again. What is the difference in these two? The difference in these two is that the proteins that are required for the hot cell to function they would be sort of loosely packed. So, that the cell can access it more easily the proteins that are not required for the hot cell to function. Remember that the DNA as a whole codes for all the possible proteins that your body can produce right, but not all proteins are going to be required by all different cell types. Some cell types will require some proteins some other cell types will require some different proteins right. So, hot cell might require a protein which is encoded here a protein which is encoded here, but none of the proteins which are encoded here and it turns out that similarly a liver cell might require proteins which are encoded here here here, but not over here and it turns out that depending on this on the particular type of cell that you are talking about. In fact, even depending on the time that you are talking about for a same type of cell you can have different sorts of packaging. So, it is a very dynamic sort of algorithm it is not a fixed algorithm it is a dynamic algorithm that depend changes depending on the type of cell that changes depending on the state of the cell cycle if the cell is dividing if the cell is resting and so on and so forth and we do not know how the cell does it. So, it is a very important open question. So, this is one level of epigenetic regulation which is packaging of the chromosome itself there are other levels such as histone modifications or acetylations which I will talk about as we go on. So, the sequence is not all there is there is other information. So, there is an information theoretic content which is the sequence, but there are other physical contents which is how this physical string stringy objects is itself packaged which is that I have my DNA double helix that is a very bad double helix, but whatever. So, here is my DNA double helix this double helix wraps itself around a protein complex which is called a histone complex or a histone octamer. So, this contains 8 histone proteins histone proteins histone proteins and the DNA comes in sort of wraps itself around this histone protein. So, let me see if I can draw. So, it wraps itself around this histone protein to form the sort of beads on a string structure. So, if you if you zoom out and look at the DNA you will see that they are wrapped around these histone octamer proteins histone histone. This complex as a whole the DNA plus the histone is what is called a nucleosome is called a nucleosome. In this animation we will see the remarkable way our DNA is tightly packed up to fit into the nucleus. So, here is my DNA double helix it wraps around this protein this protein complex. So, this is the histone protein complex to form this beads on a string structure and this whole structure as a whole is called this nucleosome. So, this up till here is sort of what we know that after here is all conjecture. So, the idea is that these nucleosome sort of coil on top of each other to form a very thick fiber which is called this 30 nanometer chromatin fiber which is actually now we think is not correct, but anyway. So, from here on everything is conjecture and then this fiber somehow. So, till there we at least have a conjecture beyond that we do not even have any conjecture how it exactly does, but somehow that 30 nanometer object at least that was the idea that this 30 nanometer object coils on itself to form this final folded structure of the nucleosome and then during cell division it organizes into this familiar sort of x shapes that you might remember from your biology books. When the cell is actually not dividing this is not how the chromosomes look the chromosomes in then look like this sort of random soup noodle soup inside the nucleosome. So, this is an open question as to how this how this chromosomes actually package inside the nucleosome we understand till the level of nucleosomes, but not beyond that. So, moving on from moving on from DNA what the sequence what the DNA sequence does as we know is that it codes for proteins right. You have this whole process of transcription and translation what is called the central dogma of biology that from DNA you get RNA from RNA you get proteins and then these proteins are again made up of another sort of language which is the language of amino acids. So, just like the DNA language the genetic code language is made up of 4 alphabets A T G and C the protein language the protein alphabet which is the amino acid alphabet is made up of how many amino acids how many 20. So, it is slightly more complicated language still not as complicated as English it only has 20 alphabets and. So, this is very nice database called the protein database of the protein data bank where you can actually go. So, whenever people manage to sequence a protein and by sequencing a protein it means that finding out exactly the exactly like sequencing the chromosome finding out the sequence of amino acids that comprise a particular protein. So, tryptophan this methionine this that what are this exact amino acids that make up a protein. So, whenever people manage to do that and they manage to find the structure of a protein they upload it in that protein database which is called the PDB database. So, here for so you can actually this is actually an open source repository which means that anybody can go and play around with it whenever you find a new structure you go and deposit that structure over there and then you can download that and see sort of what is the structure of that folded protein. So, for example, here is the human deoxy hemoglobin which is in the blood here is the human insulin in the second panel which regulates your blood sugar and then the final one is the human kinesin which is a motor protein which transports stuff inside the cells. We will talk a little bit about all of these proteins and how to model these proteins which is why I tried to show them as we go and go along in the course using different sort of techniques. So, till now or well at least till last year which is when I downloaded this slide people had sort of sequenced 41407 distinct proteins I am sure that number has increased even more if you were to go and look at it now. But so this is what I mean when I say that there is a huge amount of data that you can be playing around with all the information. So, this when you have a protein often the structure of the protein determines the function and the sequence of the protein determines the structure. So, if you have a sequence you might be able to take a guess as to what the three-dimensional folded structure of the protein will look like and then from there on you can sort of see how the protein performs whatever function it is supposed to be. So, here are the 20 amino acids and so when we say that the DNA codes for RNA and the RNA codes for proteins you can sort of say that how many how many bases do I need to code for each individual protein. So, code for each individual amino acid. So, I have four I have four nucleotides ok. Somehow I must combine these four nucleotides in order to build 20 amino acids right and then from these 20 amino acids I will build whatever protein that I want to build. So, what should the unit of unit be so that unit is called a codon how many nucleotides should a codon contain such that an alphabet containing four letters can encode for this alphabet containing 20 letters is the question clear. So, let me say so if a codon was one unit in that each nucleotide coded for a specific amino acid right. So, if I read an a in the sequence that meant I take some amino acid let us say this makes methionine if that was my unit if my codon was one unit then how many amino acids could I make given that I have four nucleotides four right. So, that does not work right if my codon was two units. So, if a t together coded for an amino acid then how many how many amino acids could I make 16. So, that also does not work. So, two means I could make 16 amino acids that does not work if I had three in a unit then I could code for 64 amino acids right in principle. So, that is at least larger than 20. So, that is the minimum number that is the minimum number that you can should have in a codon in order to build an alphabet of 20 amino acids. And indeed that is what biology does the codons that code for these amino acids are three bases comprise of three bases comprise of three bases. So, for example, here is the way you should read it if I look at let us say glycine which is on the top over there I will ensure I have a pointer next time you start off from the inner circle which is a G and then the out next middle circle which is again a G and then the outer circle which has four different options U C A and G which means that all of these code for glycine over there. So, let me see. So, the inner circle contains a G and then again a G and then U G G C G G A and G G G. So, because you have 64 possible combinations and you only have 20 amino acids that you are making you have redundancies built into the system it is not in uniform redundancy. So, you will see that some amino acids have multiple sequences some have fewer sequences. So, for glycine you have these four sequences or these four codons all of which whenever ribosome sees any of these sequences it is going to produce a glycine. Similarly, so you will see that for example, the next one which is on top over there glutamic acid that has only two redundancies. There are some which has a single redundancy for example, this one over there in the bottom methionine has a single one which is A U G. So, A U G which is methionine does not have any redundancies there is a unique codon which codes for methionine and methionine is often called the start codon in that proteins sort of translation starts whenever you see this sequence. So, you can have a random array of sequences G G C P A U G C C whatever. So, whenever transcription happens it sort of keeps on reading until it finds this A U G sequence which tells it that this is the start position it should actually record information from and then it starts reading the sequence and producing the appropriate protein. I think there is one more which has a single redundancy somewhere tryptophan over there everything else is at least two codons that code for it. And similarly just like there is a start codon there is also a stop codon over there there is two stop codons. In fact, so U G A is a stop over there and then there is another stop U A A which means that whenever you see these codons you stop reading and you say that is the end of my protein I am not going to read anymore whenever the next thing is going to code for a new protein right. So, you have an A U G and then you have a lot of sequence in between and then you have something which is let us say U A A U A A. So, this is my start this is my start this is my stop. So, it is going to continue reading until it reads the stop and then again there will be many things in between and then again the next coding region will start with another start codon A U G and end at another stop codon. Yes, what is U? U is thank you. So, you will see that instead of A T C and G in this I have A U C and G right. So, instead of this A T G and C I have A U G and C and that is because in the DNA it is A T G and C when the DNA gets read into RNA the thymine gets converted into a closed analog which is called uracil which is called uracil it is like thymine except in RNA. So, it is a very closed analog of T. So, whenever this transcription machinery sees a T on the RNA it produces a U ok. So, when this. So, what is the central? So, you have DNA. So, you have DNA going to RNA going to proteins. So, DNA has this A T G C the RNA has an A U G C and then that gets read in this 3 3 codon combinations to produce the appropriate amino acid and there on the protein.