 My name is Andrei Turinsky, I work at Zikid's Hospital Center for Computational Medicine, same center as Michael Brudno, whom you saw yesterday. My specialty is primarily data analysis and application areas, epigenetics, epigenomics, some of the developmental diseases, neurodevelopmental diseases. So that's been my trajectory for the last, I would say, about nine to ten years. And today we will continue the epigenetic and epigenomic topic, so started from the previous module. So now in this module six, what we will do is we will look at the DNA methylation and how disease-specific DNA methylation data sets can be processed. So we'll explore the methylation data, especially in the lab component. We will look for some interesting and primary questions such as what are the differential methylated patterns, loci regions, which ones are associated with diseases and to what extent. We will try to then see how to classify new methalomes and new epigenetic data sets into cases of benign and pathogenic prevalence into different types of diseases. We will see what methods exist to do that. We will also look at some of the technicalities that come very often in epigenetic data analysis such as effect of cell type composition, especially when you're talking about blood samples, which are very common. Also how to deal with batch effects. And of course we will do some of the visualization component, especially in the lab section. So we will see how to present our results in publication-ready figures and visualization models. All right. So to start with the presentation, okay, all right, so those were the learning objectives and I will go straight into what the genetics is. So you already heard the previous module about the nature of it, so I want to say that it's becoming more and more popular in general culture as well, so it's becoming part of the mainstream knowledge that epigenetics is an interesting part of science in general. So this is an example of a paper five years ago from the Scientific American Journal and the fascinating part here is that you have two genetically identical mice but they obviously look very different. The color of their hair is different and also they have different prevalence for diseases. So the mouse on the left is not only more yellow and also bigger but it's also more susceptible to diabetes, to cardiovascular diseases, to certain types of cancers. So this is interesting because these mice like I said are genetically identical but something there makes a difference and the difference is due to epigenetic effects between them. So of course epigenetic effects are not only in mammals but in humans and that is our primary area of interest in medicine. In terms of what types of studies are out there, the main one by far is the case control studies. So the reason for that is it's the easiest I guess to collect the data for. So it's easy enough to collect the number of cases for a certain type of disease. It could be cancer, it could be some developmental disease versus mesh controls. It's harder to do it with triors or with families because it's hard to find appropriate families. It's even harder to do it for monosagotic twins although there are some of these studies and usually they suffer from low power and small number of samples. So there are some prospective studies where you have the evolution of epigenetic patterns over time but primarily we will be probably focusing on the first case which is the case versus control or different kinds of diseases compared to each other. And to do this we have a plethora of packages and methods already developed. The last I would want to say about a decade. So this is an example of several packages that address typical data processing and analysis steps with regards to methylation data sets. And specifically here we are talking about 450K analysis pipeline. So it's a methylation microarrays that have roughly 450,000 probes. So in fact it's 480,000 roughly but it's known as 450K lumina microarray. And for this microarray there are typical steps of the analysis. We import the raw data files, the actual imaging scans of the microarrays, then we process them, we filter them, we correct for the background, we adjust for certain types of probes. There are multiple types there, two types. Then we apply a few more technical steps such as consider the cell type composition, consider the batch effects. And finally we come into the juicy part of the analysis. That is the detection of differentially methylated positions and differential methylated regions. So that is, I would say, the primary outcome of all this analysis. From which on we will find the genes that are potentially associated with the disease that we are studying and then basically speculate. So the black box at the very bottom says biological interpretation. So that is primarily what goes into the discussion section of the paper. And all these steps before that are the methods and results usually. Now on the left there are colored boxes and they indicate which are and by conductor packages are available to do these steps. So if you notice the second one, Minfi, it's the orange box, it can be applied to most of these steps except the last two. So the last biological interpretation is just interpretation, there are no packages to actually do that. But Minfi package is one of the primary ones that we like using because it's powerful and versatile and it's well maintained. And of course there are other packages it's sometimes useful to run a few of them and see if the results match and what are the differences between those results. Today we will be working with Minfi package during the tutorial so it's worth examining. Now in terms of how we view the data, switching gears a little bit to the data representation, primarily we are talking about tables. So ultimately everything is a table in which rows are genes or probes and columns are subjects or tissues or anything to do with the actual experimental conditions. Now we basically can interpret this as data points living in multidimensional sets. So in this setup the columns define the data points. So we have, well in this figure, which is a mock figure, of course we have three subjects living in a two dimensional space. The two dimensions are defined by gene X and gene Y. So of course in the real world we live in multi-multidimensional space so typically it will be in the order of half a million probes, 800,000 probes, million probes and so forth. So the space is highly multidimensional and that's where our data sets live so to speak and that's where we have to kind of visualize them at least mentally and see how to analyze them. Of course we can do the reverse, we can in some steps of the analysis switch gears so to speak and turn the tables and view the tissues or the samples as the features that define the data space and analyze the genes as living in the data space and seeing what the connections are between the genes or which probes in the microarray are co-expressed or co-methylated so that could be useful once in a while we can define clusters of co-methylated probes and then examine why those genes are co-methylated or co-expressive if we are talking about the expression microarrays. But for the most part we are dealing with space defined by probes and genes, probes meaning the probes of the microarrays or it could be cytosine positions on the genome if you're talking about basylaphyde sequencing and the tissues or samples or subjects being the primary objects for which we try to find associations, clustering, classifications and so forth. Now of course we are, we have to deal with both supervised and unsupervised scenario so in the supervised scenario, sorry in the unsupervised scenario we simply consider the geometry of the, of the space so we look at data points and try to find clusters between those points so in the first top left mock image we will see that there's one cluster on the bottom left and a little bit more points to the right so we will define two clusters as we collect more data points perhaps we will refine our cluster definitions perhaps we will find a third cluster somewhere upstairs so of course collecting more points is always better. In the classification setting as opposed to clustering our primary concern is not only to see the geometry of the data sets but also to find classes among them so these classes may or may not correspond well to the clusters if they do that's great that means the clusters actually are defined by the let's say diseases or conditions that we are studying if not then we have some trouble separating those diseases and trying to see how to draw the correct boundary between them again the same scenario applies the more points you have the better so in the bottom left scenario you would see you know two small clusters and it's easy enough to draw a linear separation between them in the bottom right if we let's say accumulate more data points and we discover that the third cluster that appears actually belongs to the disease rather than to the control it violates the previous border between the clusters so we have to deal with re-classifications and redrawing the boundary and perhaps rebuilding the models or choosing a different type of nonlinear models to separate them or you know deal with the problem as they occur of course another popular method is principal component analysis that's usually one of the first ones to just quickly apply to explore the data sets I'm sure you are familiar with it but just to recap we are looking at a again multi-multi-multi-dimensional data set and we want to reduce it to something visualizable to something that is humanly understandable so the essence of it is just to find the dimensions of the largest variation and then skip the rest so we'll kind of reposition the coordinate space just to make sure that the coordinate of the largest variation is aligned with our coordinate axis and then skip the rest of the axis so in the small example here from two dimensions we go to essentially one so we consider this data set and say okay the biggest variation is diagonal between the top left and the bottom right cluster and that would be our main first principal component and we ignore the other principal component in the real life of course we'll have many many genes and probes and from that we will reduce our space to just a few principal components and look at how the data actually is visualized in that scenario so here's an example from a paper that we published several years ago this has to do with methylation data sets of several mutation cases in the KDM5C it's a it's a histone demethylase gene so we have mutations those are the data points in red and we have controls in some of the benign cases those are the data points in green so we plotted the principal component plot and we saw that there was a nice separation between the mutations and controls and this is a good graph to actually publish and show of course there is a secret so we pre-positioned our space accordingly this was not at the initial analysis stage where you are looking at the entire well in this case it was 27 000 probes roughly this is after we selected so-called disease signature we selected the specific CPG sites or cytosines for which there was a difference in methylation so once you select those sites of course there's no wonder that you then discover that the two clouds of points are distinct and there's a clear boundary between them because the space was so positioned to begin with the original data set looked something like this so before we looked for differentially methylated positions, differentially methylated cytosines there was some overlap between the points there was some promise as well so you see that most of the mutant points the red ones on the left of the principal component number one and most of the benign cases and controls are more to the right of it so there is a promise of separation there's a clear distinction between the two clouds but it needs to be clarified which is what was the point of the analysis of the in the paper now in terms of how to find these differential methylated regions, law side positions there are multiple approaches so I would just point to at least three of them we can do something like running regression analysis a very popular package for that is lima so on the top you would see well a basic regression model where your methylation at each probe is regressed along the disease status let's say disease or control also sex of the person is included age is included so you can then separate the contribution of the disease perhaps to the difference in methylation there are other alternatives you can look for non-normal effects using non-parametric methods like men Whitney u-test or Wilcoxon test which accounts for perhaps differences between normal distributions and typical distributions that have thicker tails so you know more extreme values are more common in real life also another approach to separate and to filter down the differential methylated positions is by effect size so you can have a really good p-value but the actual effect size the difference in methylation could be very small so those are not very interesting sites for us and we want to we want to see law side and regions where the difference is really large like maybe 10% is good 20% is even better for for global diseases like cancer which really wrecks your whole metallome the differences would be really large like 30 40 50% difference in methylation and so forth for something more subtle like autism a 5% difference a 10% difference is good enough so autism is more subtle and heterogeneous and dense in that sense now there are whole like industry of making methylation software of course and I'll just point to this example of Glucoromics Explorer software it has lots of features but essentially what it does what it does well is presents principal component analysis dynamically so as you drag your slider you are essentially choosing your signature more or less stringently and that changes the position of your principal component axes and the points within so it's kind of going through that previous principal component plot one to previous principal component plot two that I showed before in real life so you just drag and drop between and slide between different positions of your principal components by changing the geometry of your point depending on your stringency for your signature now principal component analysis is one first perhaps approach in exploratory analysis of methylation data sets a very common one as well as hierarchical clustering so again this is a clustering heat map from a previous paper of ours from nature communications a couple years ago where we looked at a sort of syndrome dataset that is the one in pink and matching controls those are the blue ones on the heat map all all of this was done with blood samples so whole blood and again the idea here is to take your entire metallome and then filter it down to signature which is the differential methylated positions or it could be regions and then those positions are so chosen that hopefully the heat map will show you the difference between the two clusters the cluster of disease and the cluster of control simply by the nature of the geometry that we chose which is we chose the dimensions that highlight the difference between our groups and then we kind of visualize it and voila it is there so we find the difference indeed if the clusters are not well separable then there's a problem then perhaps all these groups are not as well separated perhaps you'll have some mixture between well in this case it would have been pink versus blue points so you will see some of the positions are switching switching the regions so to speak between the groups so all those interesting interesting results could be found once you have the signature once you have the differential methylated positions or regions for your dataset from the same paper we can go a step further and we can use we can use these signatures for purposes of classification so in the previous hierarchy clustering model as you keep adding new points of course the model changed itself right so if you start with your 19 cases and 20 controls and then keep classifying new new samples and new controls you add thousand of them they will cloud and kind of destroy your whole clustering structure perhaps in classification models that i'm showing here we are trying to preserve the model so what was done here let's walk through this step by step we found the metalom profile for the disease in this case soto's syndrome it says here on the y-axis nsd1 plus minus that is the actual gene that is causative for soto syndrome so soto syndrome is caused by the mutation and nsd1 loss of function mutation so we found the profile for the soto syndrome we found the corresponding profile for the controls and then the question we ask is if we have a new metalom or a new patient or a new subject does it look like the disease profile or the control profile so a fairly fairly simple classification model really basic we take the new person we take the differentially mutilated positions that we found there was about 7 000 of them and simply compute the correlation correlation to the disease correlation to the controls and see which correlation is higher so in this case we're looking for let's say weaver syndrome so that is the orange triangular points here so what we did is we took several patients that have this other but related syndrome weaver syndrome it's related and clinically overlapping sometimes with the soto syndrome and we try to see can we classify weaver samples as soto's syndrome or not so we took each weaver patient we reduced the metalom to only 7 000 positions that we found to be interesting quote unquote and then we simply computed the correlation to the control so as you see they were in high 90s so roughly like 95 percent correlation something like that and also correlations to the soto's profile to the soto's syndrome profile and that happened to be in a lower well in the lower 90s or in the actually lower 80s high 70s so clearly there was a distinction between soto's and controls and the weaver fell into the control profile so there was a molecular different molecular distinction between the two clinically overlapping syndromes which was interesting so we can molecularly classify two different syndromes into two different piles so to speak just recently well last week in fact we published another paper comparing two different syndromes so these are two developmental neurodevelopmental syndromes charge and kabuki charge syndrome is called charge because it's an acronym for as you see they're coloboma of the eye heart defects, artesia of the clone, retardation of growth, genital defects, ear abnormalities kabuki syndrome is called because of a specific facial facial gestalt so you could see that the definition of the face is not exactly right but the well the interesting part perhaps is that kabuki syndrome manifests itself slightly later in age so the child should be about three or four year old to actually see the clinical manifestations appearing before that it's hard to clinically distinguish these two syndrome molecularly there is a difference so charge syndrome is caused by mutation often in the histone modifying gene called chd7 kabuki syndrome is called is caused by mutations in two different genes so you see there kmt2d a kdm6a so one is metal transferase histone metal transferase and one is histone lysine demethylase now molecularly there is this difference clinically though there are papers that essentially say that there is a link between them but there was a patient which was misclassified misclassified first as charge but then later on during the development it became apparent to the clinicians that it was actually a kabuki syndrome not a charge syndrome so of course there are implications to treatment and the prognosis and so forth so our goal is to detect these differences as early as possible hopefully on the molecular level before the actual clinical intervention is is apparent now what we did is we built two different models machine learning models not as simple as the one I showed before for like basic correlations and clusters but essentially there was a model that predicts the charge score or the loss of function mutation in the gene chd7 that corresponds to charge syndrome and that is on the x-axis so on the x-axis essentially you are trying to predict whether a new sample a new patient is pathogenic in terms of charge pathogenicity or not or benign and the same thing was done for the kabuki syndrome so we predicted occurrence or pathogenicity of the kabuki syndrome or loss of function mutation in knt2d gene and then we tried to compare all our points with respect to these two scores so you take a new patient you score it with respect to the kabuki score model you score it with respect to the charge model and you see where the point belongs so the critical thing here is to avoid placing new patients into the upper right quadrant that's where the legend is so hopefully well hopefully no new sample will be will be visualized there and in fact none were what that means is none of our cases or controls or other external data sets were classified as both kabuki and charge none of them had high scores in both syndromes which means that there was no overlap and we were able to fairly well separate the two syndromes molecularly so pathogenic cases for charge went into the lower right corner where all the red things are pathogenic cases for the kabuki went to the upper left corner where all the blue things are and the interesting part is the model was built on the for the kabuki syndrome the model was built on the knt2d gene that is the primary gene that causes the kabuki syndrome but we also have one case for this other gene that causes that syndrome so that is the the kdm6a the histone lysine dimethylase and that is the little blue diamond that sits in the upper left corner so a model built based on one gene for this disease was actually able to score well the sample for the other gene mutation so that other gene mutation was also scored as pathogenic and perhaps having kabuki syndrome and that was an interesting connection between these two these two cases yes ah right so the question was what is all that mass in the bottom left corner right so most of them are controls and there are also some of the benign mutations so we try to score benign mutations and some of the variants of unknown significance and they also went into the bottom left corner with the controls so for them the prediction was benign or no pathogenicity and though they you know naturally appeared among the controls oops we also did the same scoring on an external data set so we took hundreds of blood profiles normal blood profiles from geo gene expression omnibus that is actually a data set that we'll be looking at today later in the tutorial and we try to see what can be said about them so just to just to check ourselves will any of the normal bloods from external third-party data sets will be scored as either charge syndrome pathogenic or kabuki syndrome pathogenic no they were all scored as benign so those are the green crosses that are shown in the again bottom left corner among the controls and you know that made us happy uh right now uh previously all these analysis are based on differential methylated positions now as the microarrays become bigger there are more and more probes of course and more and more concerns about the multiple testing correction so the more probes you have the more stringent your correction becomes the more the more you lose so to speak the more results you lose so well and as we of course move into sequencing you'll have millions of positions perhaps tens of millions of positions and how do we score them all at once and what to do with the multiple testing issue because you know applying it properly might destroy all your results now an approach that is growing in popularity is to look for consistent consistent regions of methylation so methylation usually works in regions or blocks it's not a singular position that is important it's perhaps the combination of several nearby positions that give you the methylation pattern and so again this actually comes from the same paper we just published in kabuki and uh in charge we would look for consistent pattern of methylation within a certain perhaps promoter of the gene or in some other contiguous region nearby an interesting gene so here i'm showing on the top panel the methylation profiles near the promoter HoxA5 gene the homeoboxA5 gene so there is an average line corresponding to the control profile so you see the methylation for the control roughly stays around what 60 percent methylation something like that so the scale is on the on the y axis and for the well for the charge syndrome or for the chg7 loss of function mutation methylation there's a clear uh gain on methylation there so each point that you see in fact each collection of point is where the cytosine is that is the probe from the microarray so for each microarray probe we know its position on the genome so there's a certain position and we have certain values for our you know group of controls and certain values for a group of diseases so we plot those values and in this case we see a clear difference is the you know you take the first position there the red circles represent all our cases for the charge syndrome and the green crosses represent all our control cases so clearly there isn't well there are some overlap but clearly there's a difference and there's also a difference between the averages so it's quite easy to see that the average difference is maintained consistently throughout the whole region and the region as well as long as the scale on the bottom shows and the situation with the kabuki syndrome is actually similar so we take again the same methylation profile for the controls that is the green line that is exactly the same as the top green line and the blue line is the average methylation profile for the kabuki patients now so again each methylated position or a methylated position each cytosine from the microarray is scored and we see again the consistent difference which works in about the same way as the charge syndrome methylation pattern so we look at this gene and we see that well perhaps the behavior of these two cohorts near this hoxify homeobox is similar enough to indicate that that may drive the clinical similarity between these between these syndromes and again the discussion section elaborates on that in the paper now in some other gene such as Slytric 5 the pattern is different so the pattern on the top shows a loss of methylation in charge so you see a consistent region bounded by certain other cytosines but there is a consistent region where the difference in methylation is loss in charge and the average line shows the actual you know magnitude of this difference whereas for the kabuki syndrome the situation is the opposite there is a gain in kabuki syndrome so the green line indicates the controls and the kabuki syndrome patients on average are always higher in that region in you know nearby this Slytric 5 gene so perhaps this is a gene that causes the clinical differences or differences in the progress of these two syndromes okay so switching gears a little bit to technical discussions all this was good but there are some problems that need to be addressed so one of the consistent questions is what is the right cell type how to deal with cell type composition what are the sub types that contribute to the effects of methylation and there are all kinds of problems that may appear so this is one of the papers that highlighted at least two of them so in the top case think of it this way you have two kinds of cells and it's not the difference in methylation that is driving the difference between cases and controls it's the difference in cell type composition so if the cell type A and cell type B are methylated differently but stably there is no change in methylation it's just that more let's say highly methylated cells are coming in and more highly well unmetallated cells are moving out so to speak so the difference that we are finding is not difference in methylation so it will be a goose chase to kind of look for what drives the methylation difference it's rather a difference in simply the composition of the things that are methylated or unmetallated and if you're dealing with a complex sample such as blood blood consists of multiple cell types and of course these issues need to be addressed and somehow we need to somehow account for the difference in composition in blood or of course it applies to other cell types as well cancers and so on another interesting case is you can have the same composition of cells but some of them could be active and some of them could be quiescent so again the difference is how do you measure the cell type composition versus what actual methylation you see in your sample so again there are all these problems with composition then there are problems with choosing the right cell type as well just recently in november there was a well there was a whole collection of epigenetic papers published 41 papers i think at the same time one of them was on autism and the authors here found well they took three different cell types in brain so prefrontal cortex temporal cortex and cerebellum and they were looking for in this case not differentially methylated regions but differential oscillation so they were looking at histone acetylation and they found massive differences between those cell types so really depending on which cell type you start with your differences will be really really drastic so the top the top right table B shows you what the p values are shows you what number of differentially associated peaks you could find so it's really orders of magnitude different between different cell types and well they were sort of lucky or smart to take three cell types and compare the three if you take let's say cerebellum you'll be losing out you will not find those differential oscillation as the differential acetylated regions that other cell types may possess a very interesting paper came out well in just several years ago by rafaelizari and colleague so here the concern is about aging and blood cell type composition what they did is they took another data set found by another group and essentially look at the blood cell type composition and profiling as the aging goes on so if i'm looking at let's say panel C here panel C is sorted not by age but by the proportion of granulocyte so that is the gray bar at the bottom of the heat map so as your proportion of granulocyte increases you can predict fairly well what age group the person actually belongs to so the age groups here indicated in orange gray and green on the top and they align fairly well with the proportion of granulocyte so you can see the age actually shown there and you know the granulocyte proportion rises and the age also rises or well vice versa so the concern here is again if you simply take blood you don't know what you're dealing with you need to know what what blood composition is so if the high proportion of granulocytes will cause you difference of methylation it will not be due to any disease it will simply be due to age perhaps and they took well there are small errors that point to the first and the last samples in that heat map and the heat map C and so in the in the panel D what these authors did they took standard methylation profiles for six different cell types so you see their natural killer cell CD8 CD4 granulocytes B cells monocytes and they were able to essentially reconstruct what proportion each sample has of those cell types and then the methylation of that sample of that person is simply a linear combination of your six profiles so in the panel D the profiles are exactly the same as the you know at the top and in the bottom it's the proportions and the numbers that are different and so the different mixtures and combinations give you the difference between the very very first person in the in the heat map and very very last person that has obviously a very different methylation profile so that was that was an interesting paper and these authors actually release a package it's part of mincy where you can estimate the proportional your cell types and take it from there okay so how they did this was based on a another study as I mentioned there is a study by ryanius 2000 ryanius at all 2012 I believe it should be on the slide somewhere maybe right so what these people did they collected purified cell subtypes for six individuals so there were six well middle aged actually individuals and cell types were purified for them and we have profiles for each of those cell types and this is an important study because well everybody uses this data set so it's used by many other packages and studies that try to report cell type composition for blood or build regression models or more sophisticated models for cell type deconvolution so here we have roughly 10 different cell types purified or whole blood or peripheral blood mostly different purified cell types like natural killer cells B cells and so on from which we can you know extract profiles compare them to your data sets and so on so an example of well actually several studies that uses this data set and other data sets is by hausman at all so hausman and you know this group developed methods to predict cell type composition so this is the one of the most recent ones where basically the the details are as follows you take your methylation matrix that is the one on the right where again rows are your cpgs well meaning probes on your microarray if you want and why are your specimens okay you well why is the matrix so columns are your specimen or columns are your observations samples individuals and you try to see how to decompose this matrix into a bunch of profiles for your cell types that is the matrix m and the proportions of those profiles that each of the subject has so essentially from the study of different methylation patterns in samples or subjects or perhaps tissues we are going to study of different methylation profiles so we're trying to extract the methylation patterns for profiles for your cell types and there was a big question of how many cell types you actually predict to have there so we don't know in advance sometimes you estimate certain number and you know it's not exactly right but the authors actually have a method of guesstimating or you know roughly estimating the number of appropriate cell types and then from that on you continue your study essentially comparing the profiles of your sub cell types or cell types rather than the profiles of your original original patients now this is an interesting fairly recent review paper where they estimated and well tested different ways to predict cell type composition they tested the reference based method such as the one developed by raffaella risari paper mentioned previously they tested the reference free method the one i just spoke about by hausman they tested several other methods such as surrogate variable analysis and others and all you know all tests are present you can you know examine the figures and see the see the results the conclusion is that perhaps the best way to estimate these cell type compositions and resolve them is to apply something called surrogate variable analysis so sva and from then on this sva decomposition which is fairly generic it doesn't actually stop at only cell type composition you can find some other hidden dimensions so to speak in your data sets so from that moment on once you estimate the hidden variables or surrogate variables you can use them as confounders in your further analysis you can use them in your regression models you can say that perhaps they correspond to batch effects or cell type composition effects or to some other some other interesting effects that are kind of confusing your disease signature uh and uh like i said the authors present several uh several tests and studies so from reference based and reference based meaning you take the existing profiles for known cell subtypes such as the one by rhinos as i mentioned the frequently used data set or reference free which is you basically estimate things using regression models and you know hope for the for the best so to speak surrogate variable analysis various kinds of it and some other methods that are you know have come up recently and you know the the comparison is depending on you know which method you use you could be more or less estimating the cell types correctly based on simulation data now in our studies what happened is we frequently were asked because we use blood uh did you find a signature that is confounded by cell subtypes and blood so there are several ways to answer this question uh one way is to go and estimate the blood cell types you can actually take you know the blood sample and go you know use your centrifuges i'm a mathematician i'm not a biologist so you know this is something out of my hands and i don't actually know how to do it i only theoretically so the problems there because i'm from sick kids hospital you may not have enough blood so simply for ethical reasons or for other technical reasons you may not have enough pediatric blood samples to do other steps in the analysis so you know if you're if your study is based on the blood collected five years ago let's say you cannot just go back and you know recollect the blood because it's been used already and then you know use the purification and so forth another approach of course is to do analytical estimation so we can use methods listed in the previous slide anything between you know from min fee to house man to some of the others uh an approach that we found effective and persuasive actually is the following we would take principal component analysis uh just like the one i saw i showed you before between uh you know the samples and the controls so we take the disease disease signature we reduce our space to only those cytosines or differential metalated positions then we add the purified cell types samples into that space so instead of comparing cases to controls which are separate we also compare the purified cell types taken usually from this renews at all 2012 data set so the question here is if you have your let's say disease versus control and there's a clear separation is that separation perhaps due to some cell type being you know switching between disease and control so perhaps your cases have more uh you know more prevalence in terms of granular sites or natural killer cells or b cells or one of those things so is the composition in any way uh affecting the positioning of your clouds well in uh this paper this was again chiffonniate uh nature communication paper psalto syndrome paper ours we added the purified cell types or sub types we saw that all of them clustered with controls so there was no issue that our psalto's disease samples were affected in any way by the composition so that was quick persuasive enough and you know just several hours of work instead of you know repeating your experiments or going back to the web bench and and so on um and you know we found it persuasive and so did the reviewers and the readers i hope yes as particular for this disease but um in other disease states where you imagined it would be an inflammatory component yes you expect you know different methylation patterns and while you can't be comfortable you you can create these clouds but you don't know what cells they are very true very true because they're not going to match up to reference especially if it's reference free very true that's very true especially like we'll actually look at the down syndrome later on the tutorial one of the examples where you know immune compromised samples can actually have different composition of cell sub types so yes very true yes i'm sorry okay so the question is why are we talking about blood is it mostly an issue in blood or not so there are other issues of course as well so we'll think about cancer sample cancer sample may not be pure so you can have different you know let's say if you have metastatic cancer you can have normal cells appearing in the sample depending on how the sample was extracted how the biopsy was done and so on the reason we're talking about the blood the blood so much is because blood in the epigenetic context is one of the primary tissues to go for it is one of the easiest to access you would go for blood or buckle or saliva and those are the easiest ones beyond that you will need well otopsies or biopsies so that is hard and so many many samples that deal with diseases try to look for blood because it's just lowest hanging fruit and you have you know abundance of blood samples compared to other kinds of samples so in blood there was i would say more effort to actually look into into this thing and also you know there are lots of applications of like immunology was mentioned cancer leukemia things like that a lot of things deal with blood so this is good to kind of good to figure out yes yes very true and there was a paper by michael kobor from ubc that's a sample right yeah yeah okay yes in downs and everything exactly so there was a contamination buckle and then they figured out it's because the more sort of severely affected patients perhaps had trouble producing the buckle sample and you know blood contaminated the samples and it was clear from the data itself yes so valid point yes okay another interesting well worry to worry about is batch effect so these are prevalent these are common and more common than you think certainly more common than i thought when i began so this is an example from i believe it's hat map and they looked at the sequencing coverages over a certain year and on the i think y-axis it's the day the day at which you know certain samples were were scanned and you know there was a region in a genome you know the genome location is shown one of the i think chromosomes three it is and basically it should be random or it should be more or less consistent but there's a streak of orange somewhere between day 243 and 254 which basically shouldn't be there and there's another little streak so depending on which day you're processing the data set things may be different and during the tutorial we will actually simulate a scenario where there's a batch effect contamination and we'll try to resolve it uh all right so an interesting paper here by liek et al liek et al is one of the good papers to read about for batch effects they are one of the kind of pioneers in developing these batch effect correction methods so what they are suggesting here is that batch effects are persistent they don't go away after normalization necessarily they might but often they don't so here they took some samples based on processing dates so that is you know they green and the day orange so to speak so they took the original data set this is gene expression they normalize it using quantile normalization so the panel B shows you you know nice clean quantile normalized well expression profiles okay but then when they looked at specific genes it turns out that these genes some of these genes and there are hundreds of such genes where you still see a clear pattern between you know batch one on day one and batch two on day two so you know the green versus orange expression probably has nothing to do with actually days one and two but you know the differences are there so normalization quantile normalization by itself does not necessarily remove these batch effects and you know they are persistent and clustering subsequent clustering can still reveal them now there are different methods of batch effect normalization there's well the basic one is mean centering so you simply centralize all your batches so you know you find your mean for your batch and you know you subtract that mean from every batch and now your batches are all lined up okay that's one way to do things there's the next level of complexity perhaps perhaps it's standardization so not only do you normalize by mean centering you also normalize the variance so you kind of try to put the overall profile together so you know your batches would match each other then there's a possibility of reference base so if you have multiple batches and they all have let's say controls you know that these controls should look more or less the same right so if they look different among the batches then perhaps what you should look is for difference from that control so instead of looking at your well it could be an expression it could be a methylation it could be something else some kind of measurement you would look for sort of let's say deviation from your reference or deviation from control so you would know that batches may differ but the deviations within each batch should be the same so once you you know once you remove this sort of background reference things should be comparable from now on then there are methods such as regression modeling so essentially you would add your batch as one of the independent variables in your regression model so your regression for your well it could be gene expression could be methylation but in any case your methylation now depends not only on disease status and sex and age and something else perhaps some other covariates but also on the batch as a covariate so you would add batch if you know what your batch is as a covariate and then hopefully your regression model is good enough to separate the contribution by that batch from that batch from the contribution from the disease status and you know you crystallize your contribution from the disease status and you know take it from there uh the industry standard and commonly used way is to apply combat so I will not go into the details of empirical Bayes method here but I will just say that combat is frequently used it's to-go method and it's available it's there there's an r package called sva for surrogate variable analysis and the combat function is provided and we will actually play with it so we'll play with it in the tutorial later on we will apply it to some of the uh well batch contaminated samples and we'll see if it's any good in correcting for batch yes so the question is what is the source of batch effect it could be uh it could be almost anything could be you know your dog runs through and spits in the sample it could be you know you hired a new technician it could be almost anything so the date perhaps is one of the clear thing to go for so if you know if you if you scan your samples in october 2013 and then you scan your next batch in september 2016 look for batches I would guarantee there is some kind of batch effect simply because yes there's humidity there's different reactives that were bought there was a different batch of chemicals that was used a different person perhaps processed it so all kinds of all kinds of influences go like in this funnel that eventually comes to the differences between that and that the better way to look at them perhaps is to look for again principal component analysis or some of those visualization methods they are often good in separating batches or at least in pointing out that there's something very suspicious when you when you're plotting all your data points and then you suddenly see that you know batch one is all here and batch two and batch three and batch four are all kind of patch you all over the place and all together or not very well mixed there is trouble so perhaps this is the time to look for either combat or look for what other you know problems and covariates exist there so this is a you know it's hard to answer what exactly is there but you know it's there right so there are several other papers that examine which you know batch correction papers which batch correction methods are good or bad or how they work so there's this paper by johnson lee adjusting batch effect and micro expression data using empirical Bayes method that's the one that describes combat so here they would say you know you take your initial data and you look at the clustering batches are all over the place and you take your standardization that's the one i mentioned previously you mean centralized and you also standardize your variance batches are still not entirely separable but then you apply our wonderful combat method and voila things are much better okay as i said the scda or surrogate variable analysis package has been released so there's a paper there's a paper by leak and story capturing heterogeneity and gene expression studies by circuit variable analysis so it has a lot of other things as well it doesn't only look for batch effect correction it looks for other hidden dimensions and hidden variables in your data but among other things it provides nice easy access to the combat function in r it used to be a little more cryptic it used to be a standalone function several years ago i want to say five years ago give or take but then they brought it into this sva package and and it's there so it's available i think i'll quickly wrap up i think it's 11 28 already right so what i will i'll skip this what i will point out is the following some of the batches come naturally out of your pre-processing and positioning of your slides so here this is an example of 450k methylation micro-rays if you process your slides in a certain way it's better than certain other ways so for example if you place all your you know disease samples on four slides and all your control samples on the other four slides without nixing them well then you don't know if your differences are due to disease versus control or due to slides one two three four different from slides five three five six seven eight so here the study basically says if we position our samples on the micro-ray chips in the first way without mixing them then you will find 94 000 differential methylated positions and that is something to write about obviously but if you position your slides properly with mixing them you'll find nothing you'll find absolutely zero so all those 94 000 differential methylated positions were actually spurious and combat did not help so they were after the combat correction so combat is good but not that good it cannot fix your experiment if it's a ruin to begin with so that is a good good take home message uh another paper well basically the same the same thing that if you're not mixing your samples on the slide look for trouble so you know if you're mixing them combat will help you to clear some of the associations between your slides and principal component analysis in your data sets if you don't take care of mixing these slides on your experimental chips combat will not help you so the best methods are powerless if you are not designing your experiment well and simply because this paper says batch effect and pathway analysis two potential perils i'll just say that they also point out that pathway analysis this enrichment analysis that we love so much sometimes gives spurious results as well so they took uh random cpgs from some random uh simulations and they tried to score it through the ingenuity pathway enrichment analysis and they found all kinds of interesting patterns with high p values so be aware be aware of ingenuity be aware of pathway enrichment analysis in general because you will you'll fight cancer you'll find neurodegenerative diseases you'll find some some other cellular development diseases and so on and they may very well be spurious i think i am about to wrap up i'll just point you to uh an existing youtube so you know listening to me is wonderful but there are better people who know more about batch effects and one of them is raffaella just recently he uh provided a youtube uh tutorial or lecture so it's about an hour overcoming bias and batch effects and higher throughput data i encourage you to actually spend an hour it's it's a good investment of your time listening to a real you know true expert in the field how to correct for batch effects and i'll just wrap up saying that again i'm from center of computational medicine at sick kids michael brudner is the head of the center we collaborate of course and there are other people in the team and on the uh application sites i uh must acknowledge the rosanna of expert lab from sick kids who provide me with a lot of data and uh well stimulating conversations and we collaborate a lot on all these papers i mentioned before uh funding sick kids john kander on terrarium genomics and we are part of canadian center for computational genomics that's between michael brudner's lab here and guillaume who is sitting over there in montreal so we actually have one center distributed to you know across two different facilities well thank you very much