 Welcome to MOOC course on Introduction to Proteogenomics. In the last lectures, Dr. Kelly Reggels have given you very detailed overview of genomics, transcriptomics and epigenomics. Continuing in the same theme, let us talk about SNPs. SNPs are most common type of gene polymorphism and they are located in the promoter regions of genes thus the bringing the changes in gene expression. Today's lecture will be given by Dr. Bing Zhang who is a professor of molecular and human genetics at Baylor's College of Medicine in USA. Dr. Zhang will introduce you to the concepts of DNA polymorphism and how they bring about the variability in a given population. The lecture also aims to provide an understanding on how genotype could influence the trait of a phenotype. So let us welcome Dr. Bing Zhang for today's lecture. So I will start with a simple definition of genotype. We know that genotype refers to the unique genetic makeup of individual organisms and it is encoded in the DNA sequence, we know that right. So the difference in the DNA sequence between individuals is called DNA polymorphism. And if we look at this example and let us say this sequence, you can probably download from the next see the human genome sequence and it is just a short fragment right. We can see some individuals like this guy has exactly the same sequence as a reference but this guy for example has a different nucleotide at this position. And then if we look at this position I mean there are two possible uneasiness, one is the G the other is the T right. So and so this is a bionic locus meaning there are two different types of uneasiness and then the major uneasiness is G because it occupies 80 percent of I mean this very small population right and the minor uneasiness is the T which is only 20 percent 1 out of 5. And there is another bionic locus here where we found two individuals have the G where all others have A and then this is also a bionic position and with the major uneasiness 60 percent and the minor uneasiness 40 percent. So this is a very simple type of DNA polymorphism and because the difference only occurs in a single nucleotide it is called single nucleotide polymorphism or a SNP. And this is a very common type of DNA polymorphism in the genome and it is very frequent for example in the human genome probably there around 10 meaning SNPs in our genome. And there are other types of DNA polymorphism like denation insertions or even copy number alterations for large fragments. So and the phenotype is obviously the observable traits or characteristics of individuals and depending on the nature of the phenotype and it could be binary phenotypes meaning there are only two possible selections like any disease you are either diabetes or non-diabetes or it could be continuous in nature like the quantitative traits like our head or weight and these are continuous or quantitative traits. And the phenotype especially if we were interested in this for example very interested in the disease phenotypes right and it is usually the interaction between the genotype and the environment and the interaction and the genotype plays a very important role in determining one's phenotype. So that is why a major goal in biomedical research is to understand how the genotype determines the phenotype and in order to do this in a very simple but powerful way is through the association analysis. So in the association analysis what we want to do is if we are interested in a phenotype we want to test whether the reason association between the genotype and the phenotype. And depending on whether the phenotype is a binary phenotype or a quantitative phenotype we need different types of statistical tests in order to establish that relationship. So let us first look at the binary traits and this is usually referred to as case control studies and for example disease is case and non-disease is control right and if you have a population or a sample size of N and then some of them are disease samples or case samples some of them are control samples right and then some of them have this genotype and this genotype and this genotype. So basically you can look at your individuals and put the numbers into this 2 by 3 tables and so basically and these are the cases with this small a small a this genotype right. You can fill in this table very easily up to if you know all the genotype and the phenotype for that all the individuals. And then we can use a very simple chi-square test in order to test the association between the binary phenotype and the genotype and this is a formula showing how chi-square test can be done right. But sometimes I mean when the for example the major area might have a dominant effect that means and whether you have this heterodactyl snokers or this homodactyl snokers you are going to have the same phenotype. So and then in that case we can combine these two into the same column in the table and the 2 by 3 table can become a 2 by 2 table right. By doing this we actually can power because we reduce the degree of freedom from 2 to 1 and then if the phenotype is actually and the major area has a dominant effect we actually can power by doing that. And on the other side I mean if under a recessive model and we would expect and this 2 and then only the a this one will have the same effect but this 2 will have the same effect and then again we can combine those 2 columns into 1 and then we can do the chi-square test but sometimes maybe the reason additive effect let us say if whether you are in this genotype or this genotype. So, let us say the count the number of the minor aneurysm is kind of linearly associated with the proportion of the case or the disease proportion in the population like this. And then if we simply do the chi-square test we will not be able to capture that relationship right. So, what can we do? So, people have come up with this I mean it is called a Cochrane amutage trend test. So, with this test by playing with the weights the W's in the formula. So, you will be able to for example, if you set the W as 0, 1, 2 you will be able to test this additive effect and with the formula. And the interesting thing about this test is that you can change the weight setting to 0, 1, 1 or 0, 0, 1 and then you will be able to also test the dominant model and the recessive model. So, usually I mean we do not really know which model sleep or phenotype is determined right and the people one way you can do is to test all the possible models and then you get the most the significant one and the use that for your report. And another way to deal with this additive effect is and you can consider converting the genotype count table into a near count table. So, basically each person will contribute to a near to your count table. So, your samples are actually increase from n to 2 times of n right and then you count if 1 have heterozygous then a near adocus and then it contribute to both of the stores in this table right. So, that way you can still do the chi square test and the degree of freedom is still 1 for this, but this also assumes the additive effect. So, on those are for the binary traits, but for the continuous or the quantitative traits and for example, the blood pressure or in cluster levels this type of sense. We can what we can do is to fit a linear regression model against the data. So, basically the covariates will be the number of minor unease in individuals and then you correlate with that the continuous measurement of the phenotype. For example, blood pressure and then you can test the goodness of fit of that linear regression. And the if you find a locus is actually associated with a quantitative trait because that locus a quantitative trait locus or QTL and probably you have heard about QTL many times. And so far we have been talking about just you are interested in one sleep and the one phenotype and then you try to establish the relationship, but oftentimes I mean you start with a disease of interest or phenotype of interest and then you do not really know which sleep or which position of the DNA sequence is associated with that phenotype right. In order to find out you have to scan the whole genome the whole DNA sequence and all the sleeps in order to find which one is actually associated with that phenotype right. In order to do that we have to do the genome wide association study which is the GWAS I think again probably you have heard about this term already. And in order to do the genome wide association study we first need a very kind of low cost assay on our platform in order to do the whole genome genotyping. And this can be done by array based platforms and nowadays arrays can go up to one meaning sleeps. So, basically you can scan one meaning sleeps at the time. And you can also go with the next generation sequencing based approach because array based approach we are talking about the term meaning sleeps in the human genome, but the array can only cover one meaning right. But if you do a next generation sequencing with a very good depth and then you this is truly unbiased assay, but it is still more expensive than the array based technology. And then you also need a large population with both case and controls always different variation for quantitative trade. And then of course, we need the statistic analysis in order to analyze the data, but the basic statistics is simple is the one we just talk about the either the chi-square table test, the trend test or the linear regression test for the quantitative trade. And one thing we want to remember is because now you are testing a meaning sleep for example, against one phenotype and then you are doing a lot of tests. So, this is the multiple test problem as Dr. Manly has talked about that yesterday. So, we need to be careful to adjust for the multiple test because I mean you did so many tests and the by random chance you are going to get some positive heat right. We want to make sure that what we get is not because of that reason. So, and when you get the result from the G-Bus analysis, it can be realized in this Manhattan plot. So, in this plot so, on the x axis these are the genome locations of the locus. And each chromosome is indicated by a different color. So, here is chromosome 1 and here is chromosome 22, you only we do not include the sex chromosomes here. And on the y axis this is the p values in the minus log scale because if you take minus log and then the smaller p value will have a bigger value right. And then it will appear at the top of the plot. And then immediately this is a very effective way for realization and immediately you can see at this chromosome 3 and this chromosome and this chromosome, the 3 chromosomes in at this positions there are the kind of the SMEAPs are associated to the phenotype you are interested in. So, that is very effective way to realize and understand your data. And well so, the G-Bus catalog is a resource that captures all the the SNPs associated with different kinds of diseases of phenotypes categorized into I think 17 categories. And then you have all the associations in that resource. And then they actually used the p value of 5 multiplied by 10 to the minus a in order to determine the significance. So, can anyone tell me why they pick this cutoff rather than just 0.01 or other values? The 0.05 is yeah 5 times 10 to the minus a, why that is used as a cutoff. So, if you are going to select the significance from a test of what p value are you going to use? 0.05 right. But why they used such a stringent? Because this is 0.05 divided by 1 meaning right because there are too many findings you want to get the most significant ones. Any other answers? Okay, it is because of the variation you think. Yeah, yeah exactly. So, it is a multiple testing we are talking about. And the reason they pick this value is because most of this data will generate it on the SNP arrays which measure a meaning SNPs. That means you have a meaning times of tests. So, they just did a very simple baffler only correction as Dr. Mani mentioned yesterday. So, basically you have to use a cutoff. So, basically 0.05 as you typically do if you only have one SNP you have to divide that by one meaning. So, that is why you get this number. So, I think so far and we talk about the basic genetics and the GWAR studies or association studies. I think most of you probably have already been familiar with those. So, the next part I am going to talk about the molecular traits as quantitative traits that is probably more relevant to this audience. So, we know although we are interested in the association between DNA and the phenotype, but we also know that there is a lot of other things going on between the genotype and the phenotype. That means the DNA has to be transcribed into RNA and the RNA has to be translated into proteins and there are a lot of regulations going on right when you have a DNA you not end up with the same protein you not necessarily end up with the same phenotype. So, in order to understand this whole process we can consider the gene expression measurements as also as quantitative traits. If you think about if you do an RNA-seq experiment for example, and for 100 patients and then you are going to get gene expression measurement for example, TP53 abundance 100 times and then that is actually a quantitative you can think that as a quantitative trait right. And then if you also have the genotype data from the same cohort and then you can actually correlate those. And if you do that this is called the MRI abundance as quantitative traits or its expression QTL or EQTL. So, basically you are trying to establish the relationship between a genotype, but against the gene expression not the final disease phenotype. And I think yesterday somebody asked about the CHIPSIC right you can also do CHIPSIC in order to get the actively transcribed translated region of the RNA and then. So, this indicates the ribosome occupancy and then if you associate those measurements with the genotype and then if you find the QTL it is called the RQTL or ribosome occupancy QTL. And similarly we can do protein abundance measurements and then we can associate the genotype with a protein expression and then we if we get an association it is called the PQTL. So, in these studies if you usually when you do a disease study you have one phenotype of interest and then you scan for many SNPs, but this time you have 10 maybe 10,000 genes with all the measurements and then. So, basically it further increase the tests you can do it is very powerful because you can test the so many hypothesis that is the same time, but just be careful and about your multiple test adjustment. But the beauty is that you can understand the regulatory mechanism for all the SNPs and all the gene or protein expression that is pretty cool. But the basic is very simple I mean because you treat each gene expression or protein expression as a phenotype quantitative traits and then you can do the same linear regression and to find the relationship. So, I will show some before we go to some examples and I want to mention there is a difference between the typical disease phenotype and the gene expression or protein expression or the RQT or ribosome occupancy this type of phenotype because both of them like the genes and the SNPs they have the location you know relationship on the genome like and then sometimes a SNP can be very close to the gene and a lot of times you can find SNPs in the promoter region because there are a lot of regulatory sequences in that region and then you can find the relationship between genotype in that region and the gene expression right next to that SNP and these are called the CISQT errors or similarly I mean if it is protein expression it is called CIS PQT errors. And as you can see in this example if you have a SNP here it may in especially in the promoter region and it may affect the expression of this gene that is very understandable right. And it has been shown that the CISQT error is very important because the disease associated SNPs are enriched in this CISQT errors. And let us assuming this gene actually encode a protein which is a transcription factor and then it could have additional effect on the other proteins not in the same region but maybe farther away from the SNP or even on the another chromosome you can still find that relationship and in this case it is called the trans EQT errors that means the gene expression or protein expression you are interested in is further away from the SNP you identified. Usually this effect is because it is relatively indirect effect it is relatively weaker. So, usually you need not your sample size to identify those that is why the fewer like trans QT errors that have been reported so far. But if you do find some of them it is very nice because and it will show you maybe how one SNP may alter a pathway or network because the multiple genes might be controlled by the same SNP that can tell you how the network regulatory network works. In today's lecture you are introduced to the association analysis which is used to understand the relationship between genotype and phenotype. Various statistical tests could be used to understand the relationship depending on whether the traits are quantitative or binary. Menstrual plots could be used to understand the results when GWAS or genome wide association studies are performed. Depending on the position of single nucleotide polymorphism SNPs and the gene QTLs it could be either cis QTL or trans QTL those could be analyzed using these tools. In next lecture you will be given concepts of the power of integrative GWAS and EQT analysis using various examples from literature. Thank you.