 So, in last session what we are doing is that we are discussing about analysis of variance and then we discussed that when we are trying to test which factor to be included in the final design of experiments. So, for that maybe one factor at different levels we are trying to check and analysis of variance is the appropriate technique which will allow us to understand whether the when I change the factor whether the mean value of y is significantly changing or not. So, we have taken some examples also and in and we have also seen that how to check the model adequacies like that. So, error is following normal distributions or there is a heteroscedastic scenarios existing or not or any that situation what is to be done that also we have discussed and in case Darwin Watson statistics also we have discussed which can be used for testing autocorrelation between the errors or errors are independent or not that we want to check and for that we have gone ahead with that. So, today's lecture we will concentrate on extended version of this. So, analysis of variance in certain situation what happens is that you may have a continuous variable like what I told is that x can be discrete and I want to see if I change the x level what is the influence on y but scenarios can be there can be variables which I cannot control and it takes some values like temperature and all these things. So, it does not have discrete values, but it can have influence on the output of the process like that. So, let me just show you by what we meant to say is that. So, this is the overall diagram process diagram that we have concentrated. So, there will be control variables over here there will be uncontrollable or uneconomical factors which also changes and then this is the CTQ which is coming out of the process basically. So, one component enters into this process and goes out of this we measure the CTQs and try to check whether everything is going fine and monitor and control that one basically. So, these are the settings which we change x1, x2 up to xp and this is uncontrollable variables or difficult to control. So, in this case and there can be also inputs that keep on changing that means earlier process it is coming from the earlier process. So, this variables which I cannot control, but this has influence on my characteristics why you were here this can some of the terms that is used is known as covariates this is also known as covariates. So, these variables can take different values and we do not have any control, but we know that this influences the final CTQ or Y and we want to control x at different levels like that and I am basically interested not on this covariates I am interested whether the x factor influences or Y or not later on I can take care of that covariates by different means in design of experiments, but what we want to know is that whether x influences Y or not. So, for that there is another analysis which is an extension of analysis of variance which is known as ANCOVA which is analysis of covariates analysis of covariates and for this what is required is that we use a general linear model over here. So, the idea over here is that immediate have offers you one options of analysis this analysis of covariates where ANOVA and covariates this covariates can also be considered in the model and this can only we can develop the ANOVA analysis using only general linear model one way ANOVA is not feasible over here because one is continuous variable over here that we want to check and for that what is one option is analysis of covariates and COVA which is and for the general linear model here regression which is another important concept which we will discuss just next after this and which is extensively used in design of experiments to understand the relationship between Y and X. So, and also to screen factors which factor to be selected like that in case X is continuous. So, in that case what we do is that we use regression techniques like that. And here we are talking about linear regression models and we will not go into discussion of linear regression model after this, but how to use the ANCOVA results how to interpret that results we are only interested to understand that there is a continuous variable there is a factor I want to understand whether the factor influences my outcome that is the CTQ or not. So, one of the examples we will take over here and so that it is it is quite easy for you to understand. So, what I am saying is that this covariates influences my final outcome but I am testing one factor at different levels. So, this can be having level 1, level 2, level 3 which are discrete levels, but this covariates allow me whenever I am changing covariates values are also changing. And I want to see that in presence of covariates whether X influence I can detect form on Y like expected value of Y like that. So, in this case what we will do is that we will take some examples and one of the examples we will take that is already we have with us. So, I will go to the file where this example exists. So, here what we see is that C14 column C15 and C16 gives you the information where ink is different types of inks is used over here and percentage transmission is basically Y over here. So, this is the percentage transmission that is Y over here this is the CTQ which I am monitoring over here and I want to see the influence of ink A, B, C categories over here. So, in this case whether the ink when I change the ink whether it is whether it is changing the percentage transmission or expected value of percentage transmission over here. So, this is discrete variable ink is discrete variables at different levels discrete levels we are checking and I want to see whether it influences percentage transmission over here. There is another important factor which is a covariates over here which is also changing accordingly when I am changing and this is the we can think of this is the covariates that we are talking about we cannot control this one, but this value keeps on changing at different when I change the levels of inks this also I have different levels over here and intuition says that this can influence my percentage transmission from my experience previous knowledges like that and also maybe theoretically we can also we understand that this influences, but at present we this as we cannot control this is a we can treat is an uncontrollable one. So, in this case, but whether the factor is important or not that is that is basically what we want to see. So, if we see C 14 and C 16 if I if I ignore this one C 15 and I do one way analysis of variance what is the results we can see. So, I am doing a one way analysis of variance and in this case response I am treating this as over here percentage transmission is a response and the factor over here what we have considered is ink over here ok. So, analysis of variance leaving out all the other other comparison test like that. So, I want to see that whether analysis of variance is significant or not what I will do is that I will just check this one and then I will see the analysis of variance stable and if you can if I can copy this as a picture and paste it in excel let us say and I paste it over here let us say this one and I just enhance this one. What I see over here is that p value is 0.5 p. So, what we are seeing is that ink is not a factor that is influencing my percentage transmission over here. So, ink is not a factor if I am if I am not considering that covariates over here we are not considering covariates we are doing one way analysis of variance at p value is not coming out to be significant ok. Now, if I consider this as a covariate and do the analysis what will happen that we have to see. So, for this one way analysis of variance does not have any option. So, when you go to one way analysis of variance where covariates dealing with covariates is not given over here. So, we cannot do this analysis over here what we have to do is that we have to go to here and there is a general linear model that that is available over here. So, general linear model what we will do is that fit general linear model over here then what we will do is that and this is using regression as a underlying theme to develop this analysis over here. So, anyhow we will discuss about regression just after this one and that will be more clear how regression is used to develop the models like that. Let us assume that we want to see only the analysis of variance and we are interpreting that from the p-value interpretation. I want to see whether the factor influences or not for this there is an option over here you see covariates one option is given over here in general linear model. So, I can include factor I can include covariates I can include response. So, what will be my response percentage transmission is my response factor will be ink over here which I want to understand and the covariates that we are considering over here is the wavelength which is the covariates over here there are many more options over here we can we can see this one there are storage we can store the residual and we can analyze that one model fits model adequacy we can do that. So, nothing else is required over here at this stage because this is using linear regression and we do not understand regressions at present, but we want to use and see that whether this covariates I know covariate this this variable wavelength has an influence. So, but I want to understand only whether the factor is important or not for screening like that. So, what I will do is that I will click ok over here and I want to see the analysis of variance table only. So, whenever I have done there is a analysis of variance table which you can see I am just copying this and I want to paste this one and see what what happens. So, apparently we have earlier done this one that this is 0.5, but when I have incorporated this wavelength what you see is that p value is significantly we are getting ink the factor ink over here is having a p value which is less than 0.05 which is not appearing when I have done only ink with percentage transmission or CTQs like that it was not happening because p value is more than 0.05 at that is 0.5 we have got. So, in this case what we are seeing is that wavelength if I consider this first row over here it is having a p value that means, this is wavelength values are changing and that is also influencing my outcome that is CTQ and ink also when I am changing the factors I have changed it at three levels. So, degree of freedom is 3 minus 1 that is 2 and this is treated as a regression variables over here wavelength over here and so, in this case what it says is that when when I consider this covariates ink is coming out to be very prominent factor over here which is not coming out to be prominent over here. So, whenever I have a covariate information about a covariates I should include that one to understand whether the actual factor which I am changing which is influencing why or not. So, in this case what we are seeing is that ink is specifically factor if I ignore the covariates what will happen is that ink is not coming out to be prominent factor, but when I include the covariates over here. So, some part of variability if I can segregate the variability total variability into ink and wavelengths over here what I am saying is that ink is coming out to be prominent over here. So, that is the way we should interpret analysis of covariates. So, that is that is one important thing that I wanted to say and last time what we have seen is that whenever there is a and we have to also understand that normally the assumptions are modular adequacy may not work in certain scenarios although the analysis of variance is very robust and researchers claim that it is very robust, but we have to also understand that the realized scenario is not this and there can be deviations, but analysis of variance is very strong and small deviations or moderate deviations does not influence the results as such and my conclusion will be more or less correct in that. But there are options like non-parametric options that you can we can adopt in case assumptions are not valid or assumptions we are not able to satisfy or transformation is not working like that. So, in that case when everything fails I cannot adopt the assumptions like normality assumptions heteroscedasticity many many scenario it can happen. So, in that case there is a alternative test which is known as Crucical Volley's test which is although we cannot be fully assured about this test that whether the results is 100 percent we cannot we cannot say like classical techniques for ANOVO analysis or Welsch test like that. So, over here what we can do is that we can we can just just just an alternative to analysis of variance one way analysis of variance what we have is Crucical Volley's test which can be used when groups have similar distribution groups means category if I have three labels A, B, C like that. So, every every category the Y characteristics is following a same distribution. So, in that case we can use that one and we can check that one. So, and we have done that also in previous cases. So, and the other if the shapes of the distributions are different in that case we can also go for Mood's median test which is also a non-parameter test. So, if it is similar distribution for the groups I can use Crucical Volley's test and if it is not that we can use a Mood's median test over here. So, in these examples what we have done is that we have taken earlier also marketing strategy card sales and it was found to be non-normal we can do transformations over here and on transform why we have seen whether the factor is influencing or not and based on that we have made a judgment like that. So, let us assume I want to go for non-parametric test and this quality dimensions what we have seen is that this is quality design and price over here these are the three variables and these are the sales information that we have and if you can see test this one whether this is normally distributed each of the groups if they are normally distributed or not. So, in this case quality we can take and the Sandaling test is used and what we are seeing is that now we see that it is 0.08 approximately and that is we can assume that normal distribution assumptions a full field over here. Similarly, what we can do is that we can we can see the next one variables and this we have already shown. So, we can we can just see the next one and we can we can do the normality test for the next variables like that. So, that is design flexibility and I go ahead and I see that the p value that we are getting over here is 0.939 that is also satisfactory and the group first group that is quality and design flexibility that is more or less normal and price the last one what we can do is that in that group normality test and I am doing this test again. So, price I have kept over here and the Sandaling test is done and here also what we are seeing is that p value is 0.117 over here. So, here also it is satisfactory. So, within groups more or less the distribution is more or less same and this can be verified whether it is normal or other distributions also we can verify and that is that option is available in Minitab also to check which distribution is follows like that. So, anyhow, so to keep it simple what I am doing is that assuming this one and assuming that transformation and everything fails. So, I can go to non-parametric test and I can go to Kruskal-Wallis test over here and what I will do is that I will I will give the card sales as the response over here and I will give the factor over here as marketing strategy and I will do the Kruskal-Wallis test and I will click ok and then what I will get is that this is one of the table that we are concerned over here. So, I will copy this over here and I will just replace this and I will place to over here. So, when I do that what you see is that there is p values and it is using that median information not mean information whenever is non-parametric they will use median information statistician suggests median information comparing and based on rank information that that comparison test will be done and in this case whether the median is same or the medians are different at least one median is different. So, in this case there will be two methods adjusted for ties and not adjusted for ties. We will go for not adjusted for ties values of p values over here there is more conservative. So, we will we will go by that. So, 0.581 means that the medians are not different over here. So, and that was also the analysis what we have. So, this is strategy over here. So, when we have initially done this one initially when we have done this one whether they are different or not. So, if we have done classical ways assuming that everything is ok. So, one way analysis of variance is also possible over here. So, what I will do is that data is in different columns. So, in this case quality design and flexibility. So, classical way if we have done this one also what we have seen is that classical way when we have done this one also the p value what we are getting over here if I copy this one classical approach but what we have it assuming everything is going fine. So, in this case what we are getting. So, I can place this one and what we see is that here p value is 0.514 here with not adjusted for ties 0.581 and this is. So, both the analysis Kuskal wall is and this one is giving me more or less even if the distribution does not satisfy. What we are saying is that analysis of variance is so robust the conclusion made by non-parametric is same as conclusion done by analysis of variance test that is ANOVA even if the final condition or final test at a modular basis is not satisfactory then also I am seeing the interpretation comes out to be same ok. So, classical techniques always we should adopt classical approach there can be deviations but most of the time we can we can expect that it is so robust it is robust and it can be it can be adopted like that there is always a chance that I can go wrong. So, that is also possible in non-parametric test like that. So, that is the way we should try to adopt this techniques like that and this is all I wanted to discuss about one way analysis of variance. So, we have discussed about all model equations and what scenarios we are adopting we are trying to screen the factors like that like from cause and effect diagrams some of these factors like that and there is another way of screening the variables like this is although it cannot be guaranteed whether the factor influences or CTQ or not, but preliminary some analysis can be done and which are the potential factors which I can isolate like that that can be we can use that one using one technique which is known as because you may not have you may not have done experimentation which is statistical experimentation, but you have some previous data historic data from where you try to interpret whether the factor influences the outcomes or not or CTQs like that. So, for that one of the important technique that we will discuss briefly over here is known as regression is known as regression and linear regression we will talk about only linear regression over here which is the primary aspect that is adopted in design of experiments and that is extensively used in design of experiments. After even after doing experimentation we will use regression for developing the function between y and x and from where we will go to the global optimal solution there from there we will go to the global optimal solution, but that is the primary idea that is required which I think is necessary over here to illustrate. So, what we are doing is that we are trying this is the conversion from control phase to improvement phase. So, we are just in the border line over here and we are trying to see the potential factors and if I have screened the potential factors and in that case I will do for a full full experimentation and from there I will identify which factor influences why and how much it influences and based on that we can optimize we can optimize the system or process like that ok. So, next important topics what we want to discuss is regression important topic that we want to discuss. So, our overall objective is to develop this mathematical function over here because if I can develop the function I can optimize this I can optimize this function over here. So, this mathematical function and there is this regression technique is used whenever a mathematical model does not exist that means physical model does not exist or mechanistic model does not exist then only we go for and this is true because machines are working for many years and you will not find that the previous when it was installed the scenario is like that, where at here due to where at here what we do is that we try to develop new models like that. So, it is stochastic that we want to develop at a given time point what is the scenario and what is the mathematical relationship and based on that we will adopt optimization and try to try to optimize that one. So, in this case this is empirical modeling what we what we call regression analysis this is modeling, but this is empirical relationship that we want to establish between y and x over here ok. So, these types of models are extensive not only in design of experiments or in processes you can you can find out the applications of this in football matches that how many how much ticket will we can sell based on the different conditions of the match like that who is playing and weather conditions and all this scenario where it is being held and all these things will dictate how much tickets will be sold like that. So, there will be some predictors and there will be some outcomes of CTQs that we want to predict over here. So, this is a prediction model basically what we are adopting over here and this is also used in design of experiments. So, then also we can we can see what what should be the change in heights or weights over. So, change in heights or weights per unit time like that. So, these things can be we can predict demand like in operations what we do is that we want to understand demand forecast what will be the demand like that for next period like that and for that what we do is that we we adopt this regression techniques also to extrapolate and try to see that what will happen the at t plus 1 condition like that. So, I have information up to t what will happen at t plus 1 like that ok. So, a room booking in service industry also we we can think of that when we are when we are trying to predict something. So, regression equation can be used. So, similarly CTQs of a process can also be we want to see and for that this is a y and we want to see how it is influenced by different factors which is x over here. So, x can be x p variables that is here or x z variables over here that we have talked about and over here and it can also be covariates that can influence my process like that what we are discussing just now what we have discussed. So, there can be variables which I can control there can be variables which I cannot control there can be covariates or input conditions which also keeps on changing and that can influence the CTQs. So, regression can also be used to understand the relationship it is why I want this relationship because I want to optimize that also I want to optimize the total process for that I will use some optimization techniques over here to reach to the optimal scenarios where or what should the x condition that will basically optimize my y. So, this is one of the easiest techniques that we will learn over here is linear regression over here. So, and also the simplest one is I have one predicted over here that is one x variables over here and one predicted that means one y and one x scenario that is known as simple linear regression that we are trying to understand over here. I will not go into the complexities of many other complexities of regression, but I will say what are the scenarios and how we can apply that in Minitab interface like that. So, easiest way of doing regression in Minitab like that. So, this is I am assuming linear regression over here. So, some theories behind this which we will try to understand and then we will adopt that one and apply that one in Minitab and we will try to see we interpret the results like that. So, this is y variables and these are the x variables x can be n number of variables over here. So, this is known as independent variable on this side x is known as independent variable and y is known as dependent variable. This can be known as output this can be known as input over here. This can be a cause and effect this can be effects this is known as cause over here this is the symptom and this is the problems like that this is what we are monitoring over here and what we control over here that is x variables that we control over here. This is also known as Minitab understands response over here and there can be inputs conditions like that process setting conditions that are x variables that we are telling. It can be predicted this is known as predicted and this is predictor basically predictors we can think of. There are different names for y different names for x over here. So, you can think about cause and effect. So, one is cause one is effect basically and so, you have to you have to select which I can control is x which I cannot control is basically y. So, that is the interpretation we can make out of this. So, what we will do is that this is y and x like that and for that preliminary what we have discussed is the correlation coefficient and scatter diagram that we have seen in visual visualization of the data like that what we have used is that whether the relationship is linear or not. So, this kind of relationship whether it is positive or not. So, for that what we have used is correlation coefficient that is important correlation coefficient. So, to develop regression equation we have to first see the scatter plot and try to figure out that whether the relationship can be linear and then adopt the regression model like that that then we adopt the regression models like that. And how this model is developed that is important some theoretical concepts over there how these models are developed and based on that we can go ahead and then many of the books and videos will explain how this more theory is about how these models are developed basically. We will take the simplest model over here one single x and one single y over here. So, it can have a positive relationship what we can see this can have a negative relationship what we can see. There can be a non-linear relationship that you curvature that you are seeing over here this is the curvature that you are seeing over here. So, this we can think of polynomial equations over here or non-linear relationship that exists between x and y there can be different types of relationship that we want to understand. So, but scatter plot is important over here. So, if we have single x and single y I can plot that in scatter plot and see the relationship and based on that I can understand that whether a linear model or a non-linear or a polynomial equation will work over here and based on that we will adopt that type of specific models like that. And if the x and y is these are the shapes over here this we can say that there is a positive relationship that exists if this is the scenario this is negative relationship. This is not so strong relationship because the variability because you see the width of this is much less as compared to the width of this over here. So, width is more means relationship is weak and width is very small and all the data are confined into a small cubes like that if we can think of. So, in this case what happens is that relationship is more strong and this is a weak negative relationship this is weak positive relationship what we can see and the measures that we will use is correlation coefficient to understand this one and correlation is nothing but the covariance between x and y that we can calculate and all the softwares will give you the covariance information and because covariance is not bounded over here what we do is that we take correlation coefficient where we divide it by standard deviation sample standard deviation. So, covariance by sample standard deviation. So, standard deviation of x standard deviation of y that will give me the R sample correlation and this can we need to have also take checks this one by P interpretation like that and T test is used for that and it will give you the P interpretation also so many times gives you whether the correlation is significant or not significant like that using hypothesis testing concept like that. So, there are different types of regression models types of different types of regression models over here. So, I am trying to understand first simple regression a simple linear regression over here and this is the one that we want to understand. So, this I go from this regression model I go to the simples model and this is the linear model over here. So, similarly multiple regression can we can we can see that linear models how it is developed like that. We will skip that non-linear relationship over here non-linear relationship over here. So, whenever I have x which is the predictor as one variable like that I have a single y n function of x we want to develop single y n single x this is the simple linear regression and if we have multiple x is more than 1. So, in that case what will happen is that this is up to n number of variables let us say to generalize this one and this is known as multiple regression over here. So, this is the scenario we want to understand simple regression and multiple regression and which is linear also we want to understand. So, how regression works like that and this was given by Gauss this regression how to estimate the coefficients over here. So, this was developed very long back like that about 1900 approximately at that time point maybe so, we can check that one. So, anyhow so, this is highly useful techniques and what it tries to say is that I can develop a mathematical function which is between expected response over here and the independent variable I want to develop a mathematical model because what you see is that when I am developing a when I have historical data or scatter plot what I have multiple points where x is one variable I have plot that one and I have multiple observations over here. So, this may be x 1 observation for x 1 what is the observation of y 1. So, similarly x and for given condition of x and y I have different observations over here ok. And and based on the many observations over here how can I draw a functional relationship or functional linear relationship over here that is important over here. So, there are many points and I know to develop a line equation only two points is sufficient. So, if you have two points I can develop what is the line equation over here and in that case what is the slope over here and what is the intercept that we can calculate over here. So, mx plus c that can be calculated based on equation over here. Similar approach is taken over here also, but only thing is that there is no two points here multiple points like that and I can have multiple lines over here I can have n. So, n number of points we have and any two points I can take. So, nc 2 is a combination that I can think of, but one of the best lines I have to adopt out of this. So, which is the best line I should ever adopt which will explain the relationship between y and x. So, I want to develop a line equation which is best fit and and for that some theoretical aspects is taken over here which is which cause as suggested. And and he says that expected value this. So, whenever I have a regression of x and y. So, each of this value at a given point of x what will happen is that I can have multiple over here like you know what we have seen is that for a given scenario if I change the condition over here and reset that one what will happen is that I can get multiple values of y over here. So, there will be some mean values and there will be some variations over here. Similarly, at a given different points there will be variations like that. So, idea of analysis of variance can also be extended over here in regression. Only thing is that in regression what we are considering at this stage is x is continuous also and y is also continuous y is also continuous over here ok. So, x is continuous y is continuous earlier x was discrete and y was continuous over here what we are considering is x is continuous and y is continuous over here. So, how this line equation is developed that is of importance to us. So, over here what what it says is that it can be expressed as like intercept what we have seen. So, this we can think of as c intercept over here this we can think of slope like that. So, interpretation remains like line equation like that and the one is known as beta 0 one is known as beta 1 over here these are the two important parameters we want to estimate from all the points that we are getting x and y over here and that is the function that we want to develop. So, f x is equal to beta 0 plus beta 1 x that is the approximation and this is the function of x that we want to generate over here from the given dataset which can be n number of observations like that. So, this we will continue discussion on this and we will stop over here we will start from here. So, this is a basic idea. So, what we are talking about is that we are trying to develop a linear models and simple linear regression one simple y CTQ and one x over here and we have a historic data points no experimentation over here. I want to identify whether the x influences y over here although causal relationship cannot be established by regression, but at least some hints of potential whether I can consider for further experimentation or not that some hints can be we can get out of this ok. For that we use regression if I have some previous data like that and here we are considering x is continuous and y is continuous just extension of ANOVA analysis it is more generalized you can think of. So, ANOVA is at the split x variable x points like that x levels over here x can have any values like that it can be continuous. So, this condition so, expected value of y for a given x is equals to beta 0 plus beta 1 multiplied by x with some error over here every model will have some error I cannot exactly model expectation of y over here for a given x like that I cannot have a perfect function like that every function will have some error that means I will go wrong, but I want to reduce that minimize that error like that. So, I want a function which is very close to the reality so, that I commit minimum error, but there will be some error we cannot avoid that one because this is the empirical relationship and this is and it depends on the scenario of the machines or scenario of the process like that. So, it cannot be exactly model because I will miss out some factors. So, I cannot be exactly close like that, but there will be some error always there will be some error when I am developing this mathematical function using regression like that ok. So, this we will continue from here in our next session we will continue from here in our next session. Thank you for listening.