 Before we got started today, so I'm going to come back to the learning objectives for today. I wanted to briefly talk about the motivation for PCA again, which we did yesterday. I think late in the afternoon to the requestions, when do you do PCA? Why are you doing this? Why are we? This is useful. And I'm here to tell you it is very useful. Okay, so here's a major application in human genomics. When you have a data set with people of different ethnicities, and I don't just mean people with different continental recent genetic ancestry, even fine-grain ancestry, like if you are doing population study from one of the Nordic countries or something like that, you are going to have biases because of recent, you know, sort of, because of the genetic structure being different between these populations. This phenomenon is called population stratification. And if you do not control for it, you are going to be reporting a genotype effect that may be related to the population of your cases versus controls. Okay, so what you're seeing here is a PC plot, PC1, PC2, for SNP data, right? So genetic polymorphisms. And this is sort of a sampling of global populations, right? So you've got individuals from Africa, recent sort of African ancestry, American ancestry, European and so forth. Okay, and this is PC1, and this is PC2. And you can see they're completely spread out, right? You can see all the yellow dots are kind of together, all the pink dots, purple dots, and these represent different populations. What happened if I, you know, wanted to do sort of a disease study for whatever my disease of interest is? And I don't control for this, right? So there's a slight difference. My cases have slightly more American, you know, people of American ancestry, and my controls slightly have people of Asian ancestry. I'm going to be reporting population differences, not disease differences. So it is very common in genetic studies to take, you know, the first 10 principle components and add them in your model. So that model we talked about yesterday where you say, you know, whatever disease is a function of SNP or vice versa, you add multiple models for 10 principle components. The plot on the bottom is actually showing you the PCA for just individuals from, you know, the Han population in China. And even there, you can see that, you know, individuals from sort of different regions seem to cluster together in the PC plot. So this is one example. So whenever you have, you know, you're working with a complex population, which is going to have, you know, complex sort of genetic ancestry effects, you're going to want to do something like PCA, and those and tools you use are going to have functions to extract the principle components, and then you get them and you build them into your model. Okay. Yeah. Yeah. Is that, I guess, kind of from like, I remember like, So yes, they are low, but how low is your signal compared to that? So it depends on, you know, what, what you're looking for. So I would take the same data, and I would color code the sample. And I would color code the sample. Suppose you're looking at effective some environmental perturbation, right, and you're like the people exposed to air pollution versus people not just say I've got air pollution on the money these days. So, so if you color coded them right, where would they, what would they look like, would they be mixed in here. And, you know, is that going to swamp your signal or not, right. So that's, I would make a judgment about the relative magnitude of the signal you're looking for. And these, I can tell you population stratification, you want to just really, really control for it. In practice, you might want to run your model with those terms and without those terms and see if your effect is robust to that. So that's another thing. Even though it explains 0.2% or whatever, there is structure there. They're not, you know, yes, that number is small, but look at the look at the clusters, right, they're separating, they're separating out. And so if you have a very slight bias, just because of the nature of data collection, you have a slight bias, your controls tend to be slightly more from the south, and your cases tend to be slightly more from the north, then is that going to influence your results. So you just have to do the controls, like you have to add those in and show that it doesn't. Here's another example where we use PCA. Okay, this is a case control study. We were looking for sort of biomarkers in the DNA methylation layer in schizophrenia and bipolar disorder from postmortem brain samples. Okay. And so what you're looking at is the PC plots of DNA methylation data from a microarray, and each of those points is a sample, okay, and each of them has a long text because the sample ID is being plotted there instead of circles the way we did yesterday. Okay, and so what you can see is this is principle component one, principle component two, principle component three, right, and you can see there is a separation there's like two blocks here. And, and, you know, here we've color coded them based on which microarray slide they were on, whether they're case control status, whether they're male or female. Oh, look what's happening here with the PC to right. And what tissue bank we got the samples from. And, and you use this sort of as, you know, and then notice there's just like one sample. That's kind of sticking out. Right. So, you know, the thing that I'm not mentioning here is that these samples were either obtained or, you know, we took the brain tissue and we separated into neuronal cells and non neuronal cells. And so the first principle component showed that this one outlier sample was mislabeled as being a neuronal sample and it was actually a non neuronal and the PC plot picked up on that right. Males and females that gets mislabeled as well. You pick up on that right duplicate samples, you can pick up on that as well because you're like sample a has perfect correlation with sample be, and then you look into it and this happens. This happens a lot more often than you think so these tools are important to make sure your data are what you think they are, and then the inform you additionally about what not you know variables you need to build into the model. So, in addition to methylation we did snip arrays on these samples, and we took the first three principle components, and we built that into our methylation model. And this is how you use PCA, and this is you use it for data exploration, and you use it to kind of find the extra terms that you need to build into your model. And then people like what have you controlled for, you know, a lifetime exposure to medication blah blah blah and then those models. Okay, that's a motivation. And that's what your model looks like for example right at DNA methylation is a function of disease, and these three genetic components post mortem interval to vert tissue harvesting etc etc model grows and your sample size stays the same. And today we're going to talk about generalized linear models. This is an extension of linear models so you do you folks cover linear models in the intro to our course. Okay. So, so linear models are used for when you are trying to fit, you know, to continuous valued variables, and, and, you know, and linear regression seems to make sense. If you are, you would use the LM function. And I'm going to go over this notation for models. You know, but here you've got data that I've simulated, you know, why versus X, and the noise, which is this variation, which is always there in real world data there's always going to be noise. And the assumption is that this noise is sort of got a normal distribution like a belt distribution about each point on the line. Okay. But sometimes you have data where, you know, you've got a binary outcome, like survived or not survived has diabetes does not have diabetes. Right, and you've got some continuous valued exploratory variable. So if you try fitting a line to this, it's probably not going to be the best way to explain these data. Rather, you're looking to fit some kind of a step function, right, which has got an S shape. So this is called a sigmoid sigmoid is like S like, and, and at some point, you know, there's sort of gradual switch from the off position to the on position. In another case, you've got, you know, status of diabetes negative versus positive, and you're measuring blood glucose levels. Right. So when you want to fit some other kind of shape. You know, then you don't necessarily need to reach out for a linear reach out for a linear model. So this is how it goes. We talk about linear models, because they're very, very simple, well defined, and they actually fit the data well in a number of cases, but you cannot take a linear model and apply to all kinds of data. Right. And so you've got linear models. You've got a broader class of models called generalized linear models, which means that the data are not, you know, linear. But if you apply a transform for them, you can make them linear. And then you have a third category called non linear. When I say you can make them linear, I mean, you can make them fit the assumptions of a linear model. So, you know, why are we introducing this here, it's because it's a very common experimental design. We have binary outcomes and we have continuous value values that were exploratory variables, right. So for this, we tend to fit, you know, what's called a logistic regression. And in our, instead of saying LM, you use another function GLM for generalized linear model. And in this case, you tell it what flavor of generalized linear model you're using. And for this particular scenario, the binary. Yes, no, it's a it's a coin flip model, right. So, the more the increase in exploratory, explanatory variables, the higher the chance that the coin is going to come up heads. So we say it's a binomial family of GLM. Another very common one you're going to see in genomics is for count based data, such as that from next generation sequencer, right RNA sequencing what we generate is counts. And then we want to compare the counts of all the genes between groups and B. So for that, you need to use a different flavor model called a negative binomial model. You don't need to know all the math. The key thing is that the nature of your data informs the model choice, and you cannot fit a line for everything. So different kinds of generalized linear models are this logistic regression formula and this negative binomial. Okay. So model selection must be data driven. A bit of a notation for model fitting in our. Okay, so, for example, if we want to say, you know, we want to, we want to model income as a function of percent literacy. And we say, oh, we think this is going to fit a line. Then you use LM, and you say, you know, my why value my response variable, and then you use this tilde, you know, as a function of this explanatory variable, we don't use this intercept term are kind of understands that it's going to fit an intercept. And then we say, use this table and fit this line. The Y is income, and the X is percent literacy. Yes, that's just the notation. When you have something like a logistic regression model, where you got a response variable here diabetes yes no, as a function of blood glucose levels and say pregnancy status. And now we said this is a generalized linear model. We're going to use GLM, but the notation is similar. Why tilde and the explanatory variables. We don't have to put in these beta terms, these coefficients, because our just does that for you. And then you give it the data as before. And because GLM lets you fit different flavors of generalized linear models, you need to specifically say, for this case, I want you to use the binomial family equals binomial. So that's how it works. Okay. So now we are going to go to the online example. And just as we did yesterday with the dimensionality reduction class, we're going to follow along with the work examples, and everything is just going to work.