 Hello, everybody. So, my name is Andrei Turinsky. I'm also from SickKids, as some of the presenters from yesterday. And we will continue with the topic of epigenetics in this session as well. So, the learning objectives is I'll try to hammer again the idea that epigenetics is very important in addition to genetics. We will then do the practical where we will explore some of the DNA methylation data sets. We'll try to classify new samples, see which ones could be classified as pathogenic or benign, and how to visualize the results of that. So, let's start with this topic. So, I have to say that epigenetics has been exploding in the last, I would say, about 15 years, even in popular imagination. So, people who are far removed from biology or genetics or bioinformatics are realizing that well, as the Time Magazine cover from 2010 says, DNA isn't your destiny. So, not everything is packed into your just genome itself, as people understand it, the double helix with the nucleic acids. Because if you think about it, my body, for example, has cells that are almost, well, virtually identical genetically, but obviously my brain cells are different from my liver cells, different from my blood cells, and so on and so on. And the same thing happens across diseases. And on the right side of the slide, you see two mice, they are genetically identical, but they obviously look very different. And that's due to just one gene and one promoter of that gene, a kooty gene that controls, well, these things, but also the hair color, the diabetes predisposition, some of the cancers that are more prevalent in the yellow mice compared to the brown mice. And again, this is genetically identical mice. Now, there's this common theme that GNA loads the gun and environment pulls the trigger, so people can have, you know, the same genome and depending on their lifestyle, things can go horribly wrong. Now, epigenetic provides the actual mechanism for that in many cases. So the way that actually occurs is through epigenetic mechanisms that impact how the body develops and how the phenotypes come about. Now, when we say epigenetics, people mean different things. So for some people, epigenetics is all about DNA methylation. So I will try to point my mouse now, maybe not so easy. Here, right, so you can have your DNA little metal groups and they would be attached to your DNA cytosines and that will impact translation because some of the cytosines may like physically block the transcriptional mechanisms. Now, for some other people, epigenetics is all about histone marks. So there are all kinds of histone marks. And you saw the ENCODE project before that. It's mostly about cheap, seek experiments. And a lot of that is histone marks of all kinds. There is histone methylation, dimethylation, trimethylation, acetylation, phosphorylation, ubytylation, and all kinds of things. And histones are the proteins that make up the nucleosomes and their little tails could be marked with various chemical tags. For other people, epigenetics is about nucleosome positioning. Nucleosomes are positioned dynamically so they can move around a little bit. And bigger and newer studies are studying more general and higher level looping of DNA. So DNA confirmation studies, they would look for big blocks where the DNA come in contact with each other and linearly very distant pieces of DNA may have major impact upon each other because in three dimensions they loop and they become very close. Now, in DNA studies, in genetic studies, the ones we talked about yesterday, the mechanics of it is more or less understood in the sense that you can have a genomic variant that could like physically stop the protein. You know, you have a non-sense mutation, the protein is too short, and disease follows. Or there could be a snip that affects some, you know, some gene down the line and you can have, you know, linkage disequilibrium and studies like that, you know, have their own mechanism. Now, in epigenetics, the mechanism is slightly different, but the theme of it could be similar. You can have some exposure on a promoter of a gene, for example, and the gene promoter could be heavily methylated and that stops the transcriptional machinery physically from, you know, coming in contact and doing the transcription. So the gene could be silenced that way. Or the opposite case, you can have low methylation and so the transcriptional machinery activates and the gene gets transcribed. Now, environment affects those outcomes as well. So environmental agents can act either on the proteins that actually change the methylation or some other marks or on the modulator proteins that then in turn, you know, cascade into changing other proteins and so on. So there could be quite complicated epigenetic machinery from environment down the line all the way to the transcriptional, etc., to the phenotype. And transcription could be affected by both genetic and epigenetic components. So you can very well have a SNP that affects the methylation. And that is actually one of the good filters for looking for significance of SNPs. Some SNP may not be obviously related to the mechanics of a gene that is at fault, but it acts through an intermediate layer of methylation. So the SNP can affect the methylation of a certain promoter or could be disted enhancer. And then that enhancer in turn either activates or somehow represses the transcription of the gene. And again, environment comes into effect as well. So you can have the same promoter acted upon from a SNP and also from environmental agents. So altogether I have to say that methylation is complicated. Epigenetics is complicated. Histone marks are very complicated. There are lots of them and they all do their various things. So not to say that genetics is simple, but epigenetics is more complex. It has more data types. It has more agents acting upon each other and it's not entirely understood yet, so which is why we are working on it. Now in terms of diseases, some diseases famously cancer are very much epigenetically driven, not at the initial stages. So you can have driver mutations that initiate cancers, but the progress of cancer apparently in many cases is epigenetically driven. So you can have driver mutations to initiate the cancer, but cancer kills typically through metastatic progression. So like once you're at the stage where there are metastases, that's where things become very, very bad for the patient. And the development of metastatic cancer is primarily through epigenetic, well, it's not mutations necessarily, but some kind of variability. So things are going horribly wrong on the epigenetic stages of things, even though you may not find the mutations that are responsible for progressing from stage three to stage four. Now the same thing applies to diseases that have very strong environmental component, such as autoimmune diseases, various allergies, also metabolic diseases. You can think of diabetes, you can think of other problems, body mass indices, and so on. So if you're thinking of a disease that has especially environmental component or it's a kind of cancer, think about epigenetics. It should be impressed upon you that epigenetics plays a major role, especially in some of these diseases and especially the later stages of them. Now it could also affect diseases such as neurodegenerative diseases and neurodevelopmental diseases. Now here is a paper that describes Alzheimer's and Parkinson's and Huntington and other other sclerosis. Obviously there are physical manifestations, but also if you're looking at epigenetic studies, they would tell you that there is modified DNA methylation, there is modified agents that act upon DNA through epigenetic mechanisms, such as mental transferases in the case of sclerosis in the last one. So there is a complicated machinery usually acting upon different aspects of the metallome and the changes could be either major or subtle. In cancer usually changes are very major. In neurodevelopmental diseases like autism the differences in methylation could be minor and harder to find and also diseases themselves could be quite heterogeneous because different genes act upon similar pathways and so the mutations may come from different kinds of genes, so in autism there would be over 100 genes acting upon this disease, but ultimately they all somehow end up through some kind of funnel effect to result in a various degrees of autism spectrum disorder. Now that said, I will progress to DNA methylation. Now why methylation? For several reasons. Now number one, methylation of DNA is a very stable mark. You can actually keep the same sample for decades in the frozen state and methylation is preserved DNA methylation you can study it much later and compare cohorts and so on and so forth. Now there are also sort of logistical reasons behind it. So there's a company Lumina which has created wonderful microarrays cheap enough and they can scan almost well the full genome. So you can have genome-wide methylation study of your patient cohort for cheap something around $400 something like that per sample of course. So here is a snapshot of genontology, sorry, gene expression omnibus resource from a couple of weeks ago and these are the famous 450k methylation arrays. So the reason they're called 450k is they have roughly 450,000 probes actually more than that it's 488,000. They became vastly popular and as you see there is a point with my mouse right here so there's an over 1,000 studies using these particular arrays of all sorts of diseases. So you can see various kinds of cancers there's esophagus carcinoma here and there are interesting things like humanized mice even that you know is scanned on the same microarray and I've seen primates and gorillas scanned on a human DNA methylation array but for the most part I mean here I chose homo sapiens as the species and you know all sorts of diseases could be scanned on these arrays and they are a number of samples also very quite dramatically so some studies would show you you know six or two or is it nine or five I can also see from here so just very few samples and some studies are into hundreds or potentially even thousands of samples so you can have a lot of samples scanned on the same technology it's widely available it's comparable across studies there are you know protocols that make things easy enough to compare to each other and so people have been using these arrays quite extensively there was a previous array 27k which was small mostly mostly the probes were in the promoter regions but ever since the 450 array came out it became very very popular now in the last few years it's been superseded by the next big thing it's an epic array it's a double the size for 850 probes or so the labs are switching to epic now the number of studies already published is not so great yet so currently as I looked at it a couple of weeks ago and I made the slides it was only 49 samples available at the gene expression omnibus 49 studies now of course they will come out because it'll take you know it will take people a year or two or three to actually publish this study if it's a big one but you know not there yet but growing and the good thing about the epic and the 450 is they overlap to a very large extent so most of the probe over 90% of the probes that were available on the 450 platform are also available on the epic platform so in principle they are very much comparable and studies done on 450 could be easily enough checked on the epic and vice versa you can reduce your epic array to the previous 450 if you want to compare the study down epic to a previous study down the 450 so this is same company same technology behind it and large overlap between the two types of probes now for all sorts of reasons which we can list people are also doing sequencing studies so here is a list of whole genome bisulfide sequencing so of course there are multiple advantages of doing sequencing well you have a much wider coverage you're not stuck with only the 850 000 probes that Illumina defines for you you can potentially have millions and millions of cytosine scan for their methylation status useful for single cell studies useful for all kinds of things of course more complicated in terms of coverage you have to worry about whether your cytosine in this one sample is well covered if you have hundred samples you want to know whether that cytosine is covered in all of them well whether you have you know 10 reads or 30 reads or 50 reads available for your one cytosine across the spectrum of your samples so if you are trying to overlap those things it could be you know the number of cytosines would go from millions to perhaps you know hundreds of thousands or even tens of thousands depending on your coverage across the spectrum which is why people switch to regions and analysis of regions now as Guillain was showing you the iHAC consortium is compiling primarily sequencing based studies so there is whole genome bisulfide sequencing we can point my mouse to it here so WGB seek that is the bisulfide sequencing component of DNA methylation so resources are available now different technologies of course require different methodologies and 450 array has been very popular and there is a whole spectrum of pipelines and tools available to study it so here's a paper from 2015 that lists the top five and they're still fairly popular so the pipelines are in bioconductor in R so those are the famous ones such as minfi such as rnbids and this is if you see the color as well minfi which is the kind of light brownish one is available for all steps of your pipelines almost up to interpretation so the only thing they're missing is they don't do the copy number variation analysis and of course no pipeline would do would do biological interpretation for you that's why that's why we would never be replaced by artificial intelligence for example so you know artificial intelligence could do everything up to the interpretation perhaps so same thing for some of these other pipelines the rnbids would probably do most of your steps up to the interpretation on perhaps copy number variation some other pipelines are more specific to pre-processing and they leave the the juicy parts for other pipelines or for analysis that that is done manually or through some other considerations so I have to say that the thing people look for in most cases are these two calculation of differential and methylated positions and identification of differential and methylated regions so that is the juice of the analysis those are the pieces which will tell you which genes or which promoters or which enhancers are responsible for your differences between well let me say cancer cohort versus controls or disease versus unaffected siblings or something like that so the essential outcome of the pipelines typically comes as a listing of your differential methylated patterns of some kind or another so for arrays it's typical to report positions and then maybe regions for sequencing studies the regions are more essential because first of all you have enough cytosines to to merge into regions and secondly coverage issues may may be an obstacle for specific cytosines but they will be you know you will have enough coverage for a region for a tiling window for example so the analysis across regions and tiling windows becomes more of a driving sort of driving force behind all these differential methylated patterns all right so as we go into data representations what do we see essentially so we try to think regardless of whether it's an array study microwave study or sequencing study we try to think of a table so in a typical table you would have your subjects or samples or tissues as your columns and your genes or probes or cytosines as your rows and to the dismay of many data analysts the number of features which is to say probes is vastly vastly larger than the number of subjects that you're working with so the number of data samples are small the number of features is very huge so you have to somehow you know reduce your dimensionality and work with the highly highly dimensional spaces and try to make sense of them in the lower dimensions as you go so based on this mock table your two genes x and y would define a space x and y and depending on what methylation or some other measure you have you would position your samples accordingly and then try to see what patterns exist across the samples okay so typical questions that come up is the following are you doing a supervised analysis or unsupervised analysis so in again in a mock study unsupervised would be the one where you are trying to find out the groupings among your samples so perhaps you have one kind of cohort or you're not quite sure which labels are applied to which sample so you're just looking for some clustering and you know in this case you'd see two clusters and you're happy with them until you discover more data in this case you see that yes first of all your clusters were not ideally positioned you'll have to rethink about the boundaries and you know recalculate them and perhaps there's a new cluster atop which was not evident initially because you just didn't have enough data so there's a new cluster of samples there in the unsupervised scenario that is all you can do so you're looking for grouping in a supervised scenario you actually have labels so you would be asking questions of here's my disease sample let me point to it at the bottom here and there are my two controls or two normal samples where is my decision boundary so the essential question you want to ask is well how do I decide for a new sample whether it's normal or disease where you know if you place your new sample somewhere here how do you decide whether it's closer to normal or closer to disease and therefore what is the classification for your new cases and again same issues apply if you collect more data you might discover that you know you were previously missing two of the disease samples like that so they define perhaps another group of the disease it could be a heterogeneous disease so your decision boundary can become more complex than you thought initially or you have to redraw it through some other way depending on which the which model you're actually using for the classification right so typically this will be an example of unsupervised classification uh right so this is another approach which everybody loves principle command component analysis everybody knows what it is right raise your hand if you don't or if you have used it you used it yeah okay so you're looking for the largest variation between your groups or just generally within your data set and this is an example this is actually a figure from a study we published in 2013 so very nice to draw a decision boundary between two clouds like this you have your diseases in red on the oops uh on the left side there was a here so on the left side you have a bunch of diseases shown in red on your right side you have a bunch of controls shown in green so amazingly easy to say that you know there's a clear boundary in the first principle component where you know things on the left are diseased things on the right control now this you know the big secret this was done after the coordinate space was prepared in a special way so we first extracted only the most discriminating or the most telling the most predictive probes so we spent some time finding those differential limit related probes and then in the space defined by those probes uh or which there were 53 down from whatever array that was that was actually 27 k array so from 27 000 uh dimensions dimensions you went down to 53 dimensions in which you see this pattern fairly clearly and now you can draw the decision boundary between them and so on so uh initial data set was something like this so this is based on the entire array to it was 27 k array uh promising but not as great right you could see clearly that there is a difference between the first principle component uh diseases and control so you know diseases tend to be on the left controls tend to be on the right but the separation is not as clear and there are some borderline cases uh so you have to spend a little bit more time to actually find a very good separation so in which we can build you know classification models or use it downstream in some other analyses all right and there are some tools that do well almost entirely this so there is this uh for example Glucor omic explorer tool which you can get for yourself at glucor.com and what they do is this is a uh differential limit related signature slash principle component analysis tool so all they do is they take well i don't want to minimize what they do of course they have lots and lots of options and very useful tool but uh the essential pipeline of how they're thinking about analysis is take a data set find differentially uh methylated regions and as you keep shrinking them or making them more stringent your pca becomes more and more apparent so you start with a cloud that is not very well separate separated but then as you find differentially methylated positions your whatever yellow uh red and orange or blue points will start separating into proper clouds so you know useful tool people love to use it now there are more complicated analyses so this is a paper that came out just uh in march in nature they threw together not pca but something slightly more complex tisny it's a stochastic neighbor embedding tool but again it's a different uh dimensionality reduction tool of 91 types of brain cancers all thrown together uh the picture looks something like this and you can now define your different clusters and then based on these clusters you can build your special models and then use them for classification so a good paper similar theme now we tried uh of course differential methylation patterns in all kinds of scenarios this is another paper we published in nature communications a couple years ago three years ago uh where you can show patterns of differential methylation by you know standard methods hierarchical clustering so again you shrink your space from 450 000 probes down to in this case 7 000 probes or so and in that newly defined space where things are nicely separable yes your diseases in this case uh this was soto's syndrome one of the neurodevelopmental diseases are very nicely separated from controls so not all probes define good separation but these ones are the ones we chose do and that was easy to see and clearly there is a pattern there uh now the problem with hierarchical clustering uh it's wonderful too everybody loves using it but if you add new cohorts uh the clustering will change right so like even if you add one new point to your clustering your uh the endogram may actually change its position so here we try to do uh like i'm even ashamed to call this machine learning it's that's it's too simple for that but you know it's some kind of decision boundary so what we try to say is the following we look at our uh disease cohort we look at our controls we define two profiles and then for every new case whether it's a control or disease or a similar disease or unknown samples we try to see which whether it's closer to disease or control that's it so uh we define oops we define some kind of correlation metric and we say that you know for every new point we place it whether it's closer to the uh disease profile so it will be higher on the upper triangle of you know on the on the diagonal plot here or if it's closer to the control profile then it will be on the lower triangle of the this diagonal plot and turns out that uh if you take you know over a thousand controls from the gene expression nominee boosts throw them on this plot all of them very very nicely would go and cluster here so they have a giant cloud of controls all of them properly classified but if you throw some other cases or a similar you know clinically similar syndromes some of them will go to the proper position so like a weaver syndrome in orange it was clinically similar and it was all classified as controls so we can you know we can molecular distinguish the two uh diseases but if you have some unknown cases uh or mis-sense mutations it depends so here you would have this causative genes the one for non-sense mutations of which causes a disease but mis-sense mutation is unclear and some mis-senses will go and cluster with the disease cases and some mis-senses will cluster with control cases and so you should be able to then tell which mis-sense mutation is pathogenic or benign just using this tool in addition of course to some other predictive mechanisms now uh you know all the cool kids now do machine learning and artificial intelligence like we are not averse to that um so we uh we published another study recently so we looked at two different clinically overlapping syndromes this is charge syndrome and kabuki syndrome uh they uh they result in clinically similar but still different phenotypes so there is facial abnormality there is uh intellectual disability but the problem is uh sometimes it's hard to tell apart especially for younger children now the molecular genetic side of things is very different so in charge uh this is primarily from the chd7 gene in kabuki it's from two genes primarily kmt2d but also kdm6a so kmt2d is histone metal transferase gene and the other one is histone d metallase gene so all of them are some kind of histone modifying genes uh but they are different genes so you know the the actual genetic mutations are quite different but apparently they end up working on the same pathways and the same chromatin modifying machinery which is why you have papers like this this was a previously published paper 2014 so they say things like we report a patient that was initially diagnosed as charged but that eventually it turned out that it was a kabuki case and they say that well you know these two positive genes apparently work on the same machinery which is why you know it was hard to do it initially so what we um ended up doing is we developed a dna methylation based predictive system so in this case we did you know support vector machines and tried to plant a forest and you know a few things was what we tried here and uh basically we built well two classifiers if you think about it so we can assign a score for your charge syndrome a score for your kabuki syndrome based on DNA metallomes alone and nothing else and uh well the good thing is they were all classified correctly by themselves and also some of the replication cohorts and validation cohorts and nothing was classified here and the upper right side which means no cases or controls or any samples of unknown significance were classified highly as both diseases which means there was no ambiguity so uh even though doctors may have trouble distinguishing the phenotypical manifestations of these clinically overlapping diseases especially in small children under three where things are not quite developed yet where you can specifically clearly tell yes it's kabuki or it's charge on the molecular side of things so the dna is other things uh the mechanics of it are quite different and we were able to build models that are not ambiguous and distinguish those the g the diseases early enough so that was good and uh we also threw like all kinds of geo controls hundreds and hundreds from uh publicly available data sets and they were all classified as controls so you know things were properly validated in that sense and we're all also able to say things like well uh for gene here like it's a hoax a5 your background ventilation level across a region uh this is a promoter region as you see so hoax a5 goes that way so the background level is in green and both genes are uh gaining methylation there so the hoax a5 sorry not both genes in both syndromes hoax a5 in charge syndrome has higher methylation level shown here in red and also in uh in kabuki syndrome it's a higher methylation level shown in blue which is why perhaps these genes are sorry these two syndromes are acting and showing similar phenotypes because they affect genes like hoax a5 in a similar fashion but also there are genes like uh which one is the slit trick five where uh in one syndrome you have a loss of methylation so there's a red line below background and in the other syndrome you have a gain of methylation so there's a blue line in kabuki that is above the background so perhaps these are the genes that eventually distinguish these uh these two syndromes on the molecular level and again you have a region where methylation is acting up or down as in the case maybe all right so these are the you know high level themes of how we go about and find what's going on now there are things to be aware of and i will mention two of those things so one has to do with sand line heterogene so if you think about it um let's say you know an analogy is uh you have people in the room who are like poor students and uh then you replace them with rich professors or rich lawyers or whoever and then you start measuring the wealth of the people in the room and you discover that uh people are getting richer uh well that would be a wrong conclusion people are not getting richer you're just replacing poor people with rich people you kicked out the poor ones you replaced them with totally new ones but the income level stayed exactly the same for all of them so the same thing may happen in these dna methylation studies where you would discover that there's a change in methylation and you say oh we have a gain in methylation no you didn't have a gain of methylation you just replaced your cells with the ones that had higher methylation to begin with so typical example well this is again taken from some paper where in cases you would have these two cell lines and let's pretend that uh you know the blue one has low methylation and the red one has higher methylation so initially uh the higher methylation was dominant and prevalent and then in the cases you suddenly observe that the level of methylation is lower so you would then start asking yourself uh is it the consequence of the disease so the disease causes the drop in methylation or is it the opposite is it that the drop in methylation causes the disease well none of the above all that happened is different populations of cells walked in and you know your methylation changes were only due to that and a lot of studies have to have to take that into account there was a nice paper by two prominent researchers Jaffa and Irisari Rafael Irisari who look at changes in blood composition with age so they look at various studies and decided that they really have to pay attention to the blood composition because as people age i'm pointing here as people age the amount of different kinds of blood sub types of cells changes a lot so eventually you have nothing but granular size for the most part so as you're studying your methylation level especially between you know comparing pediatric cohorts to older people to adults uh the apparent changes in methylation are not due to actual gain in methylation or loss of methylation but primarily due to the replacement of different cell types so this is one of the confounders that very much has to be accounted for and there are studies that actually take a kind of deeper look into different sub cell types and this is one of them by Grineus et al which look at specific sub types of blood and they you know they they they found various patterns among them and basically they decompose the blood into different profiles and these profiles could be used as references so the general theme of how we should understand methylation is there could be let me point this there could be initial cell types so think of them as purified cell types they have certain methylation levels so here are the profiles of your favorite cell types and then there are different people who have various combinations of those cell types you know these are your patients your your controls your cohorts and the weighted combination of those cell types and the weights depend on the person produce the actual methylation that you then see through the microarrays or through sequencing study or whatever the case may be unless it's a single cell sequencing where things are easier right so in general or up until now the methylation that we see is typically a result of some kind of weighted combination of profiles and we need to keep track of what those things are so these profiles could be based on either existing references or you have to discover it through some deconvolution methods that have been produced such as the one i'm quoting here reference free deconvolution method for recent enough paper now there's evaluation of all these deconvolution methods so some of them are right so this paper studies them if you want to look you know take a look so some methods are reference based as i mentioned you first produce your methylation profiles and then you see what your patients and cohorts have in terms of those references so you find essentially the weights and there are reference free profiles of which there are several so there's quite a few of them and you know the number was growing the recommended one apparently is surrogate value variable analysis package that seems to do well across the board some other ones are famous as well now we found it easier sometimes to claim our results through some kind of shortcut so you can do the full deconvolution and then use your proportions of your cells in your study that's one way of doing things or if the question is simply is your disease pattern due to cell profiles or not is it robust to changes in cell composition or not you can say the following you can take all your initial profiles and see if they cluster with controls or not so here again this is from the nature communication paper we published some time ago we found a bunch of diseases we found a bunch of controls on the pca they look like this so you take your principal commons analysis and diseases are nicely separable from controls and then you throw together all these individual cell profiles and see if any of them end up with disease so if any of them end up with disease you or someone rather your critic may claim that oh this disease pattern was entirely or perhaps entirely or in parts due to that cell type present in the disease population but not in control population so your study is confounded results are garbage and you know go go back and redraw things uh well in our case we were you know we were good so our cell subtypes all cluster with controls and of course any linear combinations of them will also be there with controls so you know that's a presentable enough case to say that our disease patterns are actually not dependent on the cell heterogeneity or cell composition effects and this is not only true for blood blood is a complex mixture of all kinds of cells granulocytes natural killer cells you know all kinds of things but so is saliva so is buckle buckle is the cheek swabs that people use in all kinds of studies so you can take a brush and you know swap someone's cheek uh tissue and that would be uh DNA for analysis so again uh cheek swabs and saliva have uh heterogeneous mixtures of cells and you have to be aware of the effects of that okay so another important aspect and technical aspect of all these methylation studies and not only is batch effects so I'll just spend a little bit more time on this everybody okay who has worked with batch effects I have heard of batch effects before all right not too many all right so typically batch effects is something like this you have one technician working on monday and another technician working next Tuesday and they have different techniques and skills and and one of them spits into the sample just for a good measure and the results are different across the technicians even though your data seems to be the same from the same distribution so results are dependent on the technical facility the skills of a person working with the samples the country in which the samples were processed the year and date on which it was done and so on and so forth so if you do your processing even in the same facility in you know january versus july you may and very well end up with different results not only that if you spread your big cohort across multiple chips like microarray chips a chip can only hold that many samples so 450 array holds 12 samples on the chip so if you have you know more than 12 samples of course it will be spread across the chips and different chips can be physically processed in a different way right they are like handled differently they are dried differently and so on and so forth so what you may discover could be differences due to chip processing the microarray chip processing rather than due to the actual disease versus your control study so these batch effects are prevalent across lots and lots of studies and more and more people should be not alarmed but concerned and this is an important issue that should not be overlooked so here's a paper from 2010 where people looked at what was 1000 genome at the time so they saw well let me see what it is so across the bottom the x axes this is a particular genome location all right and the y axes these are dates on which the chips were processed so you know days are numbered and what you see here is the coverage the genome coverage and orange is high and blue is low so for some reason you know there was a period of blues where coverage was lower and then there was a period of highs where coverage was higher and then periods of lows again so things very much depended not on anything else but on the date of processing so if you happen to have all your diseases processed on blue days and all your controls processed on high you know orange days then you will see some differences perhaps but they will be due to different dates of processing and maybe different people or different facilities or different other you know kinds of things like that on which your analysis depends and these confounding factors will well potentially ruin your study as we are about to see so the normalization doesn't always help so people say well I normalize my data well it's not good enough so even if you normalize your data your entire data set and again methylation data set can have as I said half a million probes or 80 800000 probes so your total methylation profile may very well look well aligned and normalized but you can have a hundred genes that behave like this where on day one they're all low methylation and on day two they're all high methylation or expression or whatever it is you're studying right and so even though a overall picture genome-wide seem to be okay because you did your quantile normalization and things like that you would still find enough genes with difference and then you say oh these are differential emethylated or differential expressed genes well no they were just processed in different ways and the batch effect was not accounted for and you know things should not be published so unfortunately now there are various ways of correcting batches so from simple ones to more complex ones so simple ones simply you know take the batches find what were the average levels or something like that and then try to you know equalize them and merge them down to ups down to more complex ways so the standard and you know state of the art is called imperial empirical base methods so typical example is combat function so combat function has been widely popular and everybody says we use combat don't worry so combat was used it corrected for the batch effects and so on and you know it's a powerful enough function it was initially published as a standalone r package in 2006 so this is the paper where people announced it Johnson only and it was an r package not even the package it was like an r function for many years and then more recently became a standard part of this surrogate value decomposition package sva surrogate value analysis variable analysis right so now through the sva package available through r and by conductor you can just have your power of combat with you and you can use it and we'll do that today as well now main thing to remember is bioinformatics cannot fix bad design so if your study was poorly designed a bioinformatician cannot fix it necessarily it depends so batch correction may not help actually and this is a paper again fairly recent this is march of this year where well the title almost says it all where's my mouse here adjusting for batch effects and DNA methylation study a lesson learned and that was a hard lesson so as you can imagine people did something they realized things are not great and then they had to redesign the whole you know the whole analysis so the problem here was they were looking at two variants and a reference so you you know references and what color is that yellow let's say and variants are in blue and orange spread across several chips so here are seven different microarrays are shown as I mentioned before a microarray can hold only 12 samples the problem was some of these chips like this one second one it only had one reference and two samples from that other variant but no orange variants here they only had the orange variants on chip number four but no references and no blue variants and and so on and so forth so the problem was that their groups were not equally presented in all batches so a batch here was defined by a chip let's say and not all batches contained equal number of groups and that became a big problem so they discovered that yes they went through the motion and after they applied the comeback correction which they thought will fix all their problems suddenly they saw a giant spike in the results so they had tons of differentially methylated regions and instead of publishing them they were very smart they were actually quite alarmed and went back to study what's going on and they discovered that all of that was fake so you know you can you can artificially spike your results with false discoveries and then you know hopefully you will catch it in time and another paper this is a good one methods that remove batch effects while retaining group differences may lead to exaggerated confidence and downstream analysis so a good statistical paper it shows you the mechanics of what can happen so here's a the first panel panel a in which you would say there are three groups so you know red blue and green and let's just pretend that we try to confound group a with batch one so batch one only contains this one group and batch two only contains these other groups but nothing from group one and then for some technical reasons batch one may have you know slightly higher methylation or slightly higher something some kind of measure you're studying gene expression maybe then batch two so for whatever reason you observe that batch one is different from batch two you try to correct for that so you try to centralize them and of course everything goes slightly wrong because now batch one goes all the way down to make its center equal to this whole central batch two right and therefore your you know your your pink group your red group drops down below the blue group for example and the true values were exactly the same but now you see artificially introduced differences simply due to the fact that your batch correction fails to correct so you know the batch was bad enough the correction was applied and the correction over corrected so problems like that do exist and we actually saw a number of our collaborators doing this so they would confound their study because a technician is not is not aware of the dangers of batch effects and they would place all their you know controls on chip after chip to chip and then they go to their disease cohort and place them on the sequentially next group of chips and of course this is a study stratified randomization controls better for batch effect so they did this in which case the you know the case and controls were severely confounded with batch effects with different chips they found 94,000 differential methylated positions at five percent false discovery rates so you know this was awesome after combat correction so combat correction was applied this is state-of-the-art batch correcting method and you know all of that was garbage all of that was fake discoveries so it was not five hundred it was not five percent false discoveries it was hundred percent false discoveries so zero remained after the after the chips were properly structured after the design was improved and after they randomized or in this case did the checkered placement right so the study of another group same result so this is a different paper in which you know you place all your studies in one batch and all your you know other cohort in the other batch and then you discover things and then you realize that it was all due to placement in different batches and unbalanced design of every batch so long story short batches should be populated with equal proportions of your groups so make sure your chips are populated with equal proportions of your groups if you're doing micro rates or your labs are structured in such a way that you know you you do half of your experiments on monday and half of your experiments on tuesday but it should be a mixture of your groups in on monday and also a mixture of your groups on tuesday so it's not to discover that the difference is due to monday versus tuesday rather than disease to control all right and while we're at it simply because this paper was available they said there are two potential perils in cancer studies involving things so batch effect was number one and they also discovered that what everybody loves which is the pathway analysis that can also go horribly wrong you can actually discover a lot of interesting enrichments in totally random data so they took random sets of probes and they discovered enrichment in cancer enrichment and developmental diseases enrichments in all kinds of ways so be aware that some of the pathway analysis software may present issues so if you if you're not careful you may discover enrichments in pathways that are just always there cancer development and beyond development all right so i think this is more or less at what i was trying to convey today so i'll just say that there are you know luminaries in this world much better than i am and you can listen to them so there are people like raffaella risari who leads a lot of analysis and biostatistical package development and they know a lot and they have seen a lot about batch effect and all kinds of issues like that related to DNA methylation also expression and other types of bioinformatics analysis so it's worth your time to go through a good selection of lectures on you know available on youtube from broad institute for example this is just one of them this is from 2016 but you know it's still there and still good and for about an hour you can listen to a prominent expert warning you about various pitfalls of your analysis so it's worth to be aware of all these things all right so to summarize again as giam told you in previous lecture in the previous module epigenetics plays key role in many diseases so again let me hammer down this this idea that if you're looking for cancer especially at the stages of progression not the initiation but progression it's more likely to be due to epigenetic disruptions than anything to do with genetic disruptions there is also you know there's an interplay you will probably discover some mutations as well sure but it's unlikely for you to discover too many driver mutations for your later stages of cancer so once it's already there it's progressing through epigenetic this this regulation same with any disease that you may think has environmental component asthma allergies rheumatoid arthritis things like that psoriasis perhaps also any disease that has metabolic component anything to do with your diet anything to do with diabetes things of that nature epigenetic effects are present also neurodevelopmental neurodegenerative effects you know the embryonic development is heavily affected by the epigenetic state in in the womb in the surrounding placenta and so on and so forth so there's a lot of factors that are in play beyond simply genetics so genetics is also there but epigenetics is very heavily in play now epigenetic machinery is complicated as i mentioned so again not to simplify and make fun of geneticists they do a wonderful job and there's a lot to do but epigenetic just has so much more to offer and you know think of like the difference between coding and non-coding genes so of course you know no one will want to throw away 98 percent of your genome because it's you know what used to be called junk DNA right so like it's not junk DNA anymore so people used to think of you know there are genes and there is junk DNA which is like 98 percent of your genome so like raise your hand if you want to throw away your junk DNA i don't right like i better keep it and people are realizing in the last you know five to ten years something that you know non-coding pieces of your DNA are very very very important to your transcriptional machinery and how things are actually progressing in the same fashion you can think of epigenetics genetics is good and then recently we're discovering more and more that diseases could be caused and you know strange phenotypes could be caused by variants in the non-coding part of the genome you know non-coding and enhancers in some other distant locations because you know DNA loops in a certain way and you know physically close regions could be far away in terms of the genomic coordinates same thing with epigenetics there's a lot going on not you know not everything is understood far from it and there's lots and lots of different agents acting upon the DNA so there's you know DNA methylation there are all various histone modifications of which there are too many and the combinations are affecting transcriptions and eventually translations and so on then there are all these issues related to transcription factor binding sites so some binding sites could be physically blocked by all these marks and some other binding sites are open and so the transcriptional machinery can or cannot access them so there are too many of these agents acting and for us the bioinformatician that presents a kind of combinatorial challenge there are too many things meaning there are now multiple data types multiple data formats multiple sources from which these data pieces could be could be fished out from encode from icac from go from all kinds of other places they're all different formats which means also you have to develop multiple pipelines or think how to put all these multi-omic things together and make sure all the formats match and you know the methods apply and so on and statistics is there as well and of course the you know next big thing machine learning everybody's doing it so you know lots of things will be probably explosively developed in this area simply because the area is complicated and lots of things is going on and this will probably be enough for us to do for many decades to come all right so on the technical level those are you know nice motivational things but beware of certain technicalities so beware of confounders such as cell heterogeneity as I mentioned your results could be almost entirely or to some extent due to the fact that you forgot that the cells are not a uniform mixture it could be a heterogeneous mixture and marks such as DNA methylation act as an averaging agent so unless you're doing a single cell analysis you have to be aware of the complexities of your actual DNA population and also beware of batch effect that applies well beyond epigenetics you can you know you can think of batch effects applicable to almost every area of quantitative studies in biochemistry and so on so gene expression is no you know not averse to batch effects and of course DNA methylation histone modification so on could be very much impacted by batches and just beware of how you design your study because if you spend your hundred thousand dollars on a study and then all of that is wrong bioinformatics student cannot fix it for you in two weeks so you have to go back and redesign your whole study and pay the money again unfortunately okay so on this very positive note thank you very much and right so I'll just mention that I am from sick kids I'm from center for computational medicine Michael Brunner who presented yesterday's leader of the center and we have a big and nice team Sergei sitting there and also I work with Rosanna Wexberg lab maybe some of you know her for and some clinical geneticists that sick kids as well so we do some interesting studies on developmental diseases and so on thank you