 Okay, then I think it's time to get started with our second presentation, who will be given by Dr. Gabiele Schweikert. Gabiele received her PhD at the Max Planck Institute in Tübingen, and she focused on developing machine learning techniques for computational gene findings. And from 2011 to 2012, she was a research associate at the Wellcome Trust Center for Biology in Edinburgh. And then from 2012 to 2015, she was a Mary Curie Research Fellow and an Embo Fellow at the University of Edinburgh. And since 2018, she leads her own research group with a focus on computational epigenomics also at the University of Dundee. And since 2019, she also heads the Computational Epigenomics Research Group in the Cybervalley's Division of Computation Biology. And Gabiele is one of the recipients of the very prestigious UKRI Future Leaders Fellowship. And she will give a talk about machine learning for cancer epigenomics, and we're very excited to have you as a speaker. And please take it away from here. As you notice, I'm probably a bit nervous. So usually I think I would say what a great pleasure it is to be with you and to present at the summer school, but everything is a little bit different. And the truth is I was very excited to, when I got the email from Katharina to join the summer school, but I have also mixed feelings and I had increasingly mixed feelings over the last couple of weeks, because when I got the schedule, I noticed that there were some really, really big beats on that schedule. And for you students, that's absolutely amazing. But I have to tell you, it has been quite terrifying, terrifying for me in the last couple of weeks. And so I was thinking about what to do about it. And I did what I usually do in these situations. I go back and read and I was trying to prepare myself as good as I could. But that didn't quite help this time either. And eventually I came to a feeling which is very nice depicted in this picture by Kasper David Friedrich, the monk at the sea. And the problem is that I chose three topics which are vast and like very, very deep oceans. Epigenomics is a very big field. It's fast moving. And it seems bottomless, to be honest. Cancer biology has changed massively over the last decade or so. We have heard a lot about it already. Also a big, big area. And then machine learning, of course, very exciting, lots of developments. And there's no way that one can measure all these areas. So what to do in this case, what do you do when you feel like a monk and looking out at the sea? And this is, of course, even stronger the case when you are now in lockdown and COVID. You can't go to the lab in many cases. But I was helped very much, I must say, by my team. So I've been independent for two years now. And my team has joined me a year ago. So these are my fantastic three PhD students and my intern. And I want to say a big thank you to them. First of all, and I want to say that science is a team effort. And I think that the summer school is showing this even more. And I think with the help of this team, I can kind of plunge myself into the waves and enjoy very much the summer school and see that we are working together to achieve great things and to learn from each other. I want to say that I'm also having some big beasts in my corner as well, I think. So we are obviously not working on our own. But because we are all working very interdisciplinary, I've already mentioned that we have, of course, experts who are supporting us, who are advising us, who are criticizing us. And there are, again, more people from the summer, from the network that are making us stronger and making us better in what we are trying to do. And again, with that, I think I can end my little pollock here and start with the talk. But I do want to change the title of the talk. I have realized that. So instead of saying, I'm doing machine learning for cancer epigenomics, I think what I can offer you is maybe take you on a leisurely walk through epigenomic landscapes. This is kind of when I've been the last couple of years as a postdoc in Edinburgh and then continuing in my own lab. I think we are starting to have occasionally quite nice views onto other areas, I guess, neuroscience. Initially, developmental biology and more and more. So tumourogenesis. So maybe some of the lessons that we are learning from this epigenomic walk are useful in this respect as well. And to be honest, I'm doing a lot of bioinformatics. We are collaborating a lot with biologists, as I've already said. But at times, we are also using machine learning. And we are trying to develop some tools. But it's not always about machine learning. It's about the science, I guess. So also a lot of the other thing that I want to mention is, of course, that a lot of the projects that I'm going to talk about are in very early stages. They are work in progress. As I said, my students have joined me a year ago. And also what I think is quite interesting, and we have seen that in the last, in some of the talks already, I think that sometimes when you read a very good paper, you have the feeling that the authors walk down a very straight alley, straight to the lights. And neither as a PhD student nor as a postdoc, I ever felt like that even in very good labs. So I think science and biological science in particular has convoluted ways. There are also negative results sometimes. Sometimes you really don't know where you go. And this view of a straight path towards the light is very often a view from the hint side. And I think we have to allow ourselves to get lost as well while keeping kind of an eye on our target. So we have already learned quite a lot. I had one talk, for example, on genome-wide association studies. So we have a lot of data now available. Data on sequencing has become very cheap in a sense. So there's lots of data available there. And this has driven a lot of advances and discoveries. So genome-wide association studies, for example, they look at individual particular diseases and then they try to find genetic changes or mutations, SNPs in the nucleotide variants that are associated with the genes. So we've heard about that. We also heard about the kind of, or maybe one of the targets is also to go along the opposite way, to go from genetic changes to predictions about a particular disease. So maybe outcomes, risks for diseases, tract responses and so on. Again, this is power to a large extent by the huge available data that we are seeing now in terms of sequencing and also by the advances in machine learning technologies. So for example, in a paper by Palazzo, they used a lot of a large collection of somatic mutation data. They trained an autoencoder model to find a lower-dimensional representation of these somatic mutation data. Then they used kernel learning, this hierarchical clustering, to assess the quality of these mutation embeddings. And eventually they used the embeddings to classify two more subtypes. And that's quite interesting and useful. So a lot of these data is obviously taking from population data, from patients. And that has, of course, a lot of advantages. It means that the disease, if you want, is studied in the organism of interest. It's studied in human, and it's studied in the disease context. But there's obviously a lot of data available in this respect. But it also has disadvantages or problems associated with it. So in many cases, cause and effect, maybe years, maybe decades apart from each other. So somatic mutations may accumulate over a long period of time before they manifest themselves in, for example, cancer. And also, you have a lot of inherent biases, which are actually not causal for your mechanism. So it's not trivial to disentangle the two causes that are responsible for the observations that we see. Right. So on the other hand, we can look at a molecular level. And I have been much more concerned, so that's more where I have been working over the last years. I think so we can ask how do cells actually work. So we can go from the genome to the cell level, to a cellular state. I wish to be honest, and sometimes I cheat myself into believing that I'm kind of working on the cellular state. So this is a beautiful image, cryo-electron tomography of a human fibroblast. And we see some of the proteins, so the cell nucleus in blue and the nuclear pore complex and a lot of proteins. And it's really dense and beautiful. But actually, that's not where I'm most of the time operating. When I'm talking about the cellular state, I have to admit to myself that I'm actually simplifying even further. In many cases, I'm actually thinking about the transcriptome and I'm not thinking about the cellular phenotype. I'm not thinking about proteins. So I'm trying just to connect what we see in the genes in the genome to the cellular state. And somewhere in between there comes epigenomics. So how do we get from the genome to a particular cellular state in the sense of a transcriptome? And I think that's where epigenomics plays an important role. And this state of the cell is, of course, still extremely complex. So I've said I'm going down from an individual in a human to the cellular level. And still this is massively complex and complicated. And we heard some of this in the previous talk as well. So looking at... So however what I'm interested in is also when I work with biologists and why this is a useful system is because we can actually do interventions and do control experiments and see what happens in order to get an idea of what individual genes are actually doing with their function. So if we are looking at the molecular level, we can do very much control experiments and we can observe the effects of the disturbance within days in the case of knockouts. But we can also use fast degradation and then we see the effects within hours. And that can be very useful also to identify mediators for certain diseases and so on and to understand the mechanisms very well. The problem is of course we can't do that in humans. So very often we have a model system, be it mouse or cell culture cells. And it's often not in the right context in the tissue context or in the disease context, but it's in cell cultures, it's in or sometimes in the human organs. So these are the downsides to this approach. And again I think in the previous talk we have already seen how these two approaches could be and should be, how we should go back and forth between these approaches and how they can inform each other. So how do you epigenomics in this fall in there? So again through Chivas for example, we find that there we have been, we found that there's, so in this paper by Bailey for example, the cell paper, big paper and cell, there was a big comprehensive characterization of cancer driver genes and mutations. There are a lot of 10,002 more samples were analyzed for 33 cancer types. And so this is Chivas studies of course and in this in this dataset they found 299 cancer driver genes. And again these genes and mutations are actually shared across anatomical origins in cell types and 57 percent of these tumors have a potentially actionable oncogenic events. So this is quite interesting to know and it's also very interesting to see that if we are classifying, if we look at what kind of genes are actually found to be Chivas for cancer, many of them happen to be epigenomic associated with epigenomic mechanisms. So he has epigenetics, DNA modifiers, chromatin others, chromatin histone modifiers, histone modifications and so on. And what you can see is that they are, that they are read in many different cancer types. So epigenetic mechanisms seem to be disturbed in cancer types. Why is this the case? So that here are some of the epigenetic mechanisms, they seem to alter the chromatin structure, which then leads to an up air and gene expression and therefore introduces problems in terms of differentiation, metabolism, the stem-ness of the cells. So here we see how epigenomic mechanisms and understanding, gaining a deeper understanding of these epigenetic proteins, players can really help to also understand better what comes out of these population data studies and their human data studies. And so what I want to do now is I want to actually think about what is epigenetics, I want to give you a very high level introduction to epigenetics, maybe too high level, just because I really enjoy it, I must say. I'm really always still enjoying to talk about this, this aspect of cells and yeah. And then I'm digging deeper a little bit into three different parts of this epigenetic machinery. I'm talking about DNA modulation, chromatinory modulus and histone modifications. Right, so let's get started. So as you all know, I, so the genetic information is my, the genetic information is stored in about two meter long DNA molecule in every single cell that I have in my body. It's a very long, so it's not only long, two meters is a long, long bit of molecule, but it also stores a huge amount of information, it stores three billion letters, which is not, so yeah, it needs a lot of letters to encode what I am. Charles Darwin only needed 4,000, only needed a lot less to write his original species, so my DNA is about 4,000 copies of Charles Darwin's original species, that's quite remarkable. What is also remarkable is that I and you are made up of about 40 trillion cells, so that's a whole universe that's more than the, that's more than an odd of magnitude more than the stars in our galaxy. So lots of stars, lots of cells in our body and each of these, each of these cells obviously, each of these cells contain this long DNA molecule, the two meter DNA molecule, and they are still all very different, so we have nerve cells, prokineal cells, with a beautiful structure, we have the heart muscle cells, very different, very different functions from nerve cells, we have T cells, we have hair cells in our, in our inner ear that helps us hearing, so they are very specialized in their phenotype and in their function, and yet they all share the same DNA. On the other hand, I've already said that we can learn a lot by studying model organism, and so when we look for example at a prokineal cell of a mouse, it looks surprisingly similar to our prokineal cell of humans, so in this case we have a similar phenotype, but a very, well not very, but a different DNA. So what is it that makes a cell, gives the cell its phenotype and function, well it depends on the specific genetic programs that are actually executed by the cell, so which is read or executed by the cell, so every cell, obviously every cell type obviously reads a different set of genes and the corresponding genetic programs. So this would now bring me to epigenetic mechanisms, so as you can see here on the right side, it's again an electro-tromographic rendering of a cell nucleus and the DNA in the cell nucleus, it's incredibly densely packed, so it doesn't look anywhere like the beautiful library that I've shown you in the earlier images, a little bit more like under the bed of my daughters, so it's quite a mess, so how do you access efficiently quickly the information that you need, how do you find them if you don't want to read the whole molecule in order to find individual words. And so the cell has a couple of tricks in its pockets, so it can chemically modify the DNA, thereby changing its physical properties and use these kind of changes in the sense, well I think of it like bookmarks, so or if you're thinking of the DNA modulation, I'm kind of thinking about it like putting a bit of study tape around your DNA molecule, so even if you have it in a ball in your hand, you will easily find the piece where the sanity tape is and you don't have to read the whole two meters, so that's kind of a way of how I think about DNA modulation. But before I continue telling you about these individual mechanisms, I want to take a little break actually and going back to the, well before I do that, I actually have to think about, so epigenetics is actually a term that has been used over the time very differently and there are lots of misconceptions involved in it. And so there was a PNIS paper a couple of years back, for example, that talked about epigenomics in a similar sense that I'm going to talk to you about now saying that it has a lot to do with how, with the regulation of gene expression. And Marc Petashnik kind of depict this concept and said that there are core MIS concepts and he said, so in his paper he said finally to that dreaded word epigenetic and he says in particular his tone modifications are often called epigenetic and one can only wonder why. So these are the three things that I'm talking about DNA modulation, DNA modulation, chromatin remodeling, histone modifications, three physical changes to the DNA and the chromatin and they are commonly called epigenetics but at the origin of the word epigenetics was maybe the epigenetic landscape which was coined by Beddington in Edinburgh and he was thinking before the concept of genes was even, before genes were even discovered about the idea of development and differentiation and he saw that cells are moving from the pluripotent state to a differentiated cell state in a similar fashion as a ball is traveling downhill in a landscape and it can kind of commit to certain lineages based on the shape of this landscape and the ball has to travel downhill, he was assuming that this is a process that is not not reversible, so you start out with stem cells, with pluripotent cells, you might have primed cells and then you commit to a certain lineage and you continue to differentiate to the end and if you're following a different route in this landscape then that leads to a different lineage and very remarkably he also understood that these developmental processes are driven by no changes in the DNA. What was then a very remarkable experiment which led to the Nobel Prize in 2012 is that you can actually force the ball to walk upwards again so you can reprogram fibroblasts to become pluripotent as well again and all you need to do for this is you need to force the expression of four very specific DNA binding proteins so and this was an experiment by Takashini and Yamanaka in 2006 so you need these four transcription factors to be expressed in the cell and that forces the the ball to travel uphill to reprogram to go back from a differentiated state to a pluripotent state so really it seems that the transcription factors are doing all the work that there is and I think that's the this this misconception that Takashini was was considering that when you look at the definition of epigenetics by Vatican he says the process by which the genotype brings the phenotype into being that is epigenetics and was also said that the system that regulates the expression of the library of specificity that is the genetic material which is meant to be DNR and RNA sequence so it's really about regulation and control and it seems that potentially what what what is the driving force behind that is transcription factors and not the three mechanisms that I was referring to just now from DNA methylation histone modification and modeling and so transcription factors are doing the heavy lifting so are the three epigenetic mechanisms epigenetic in the Vatican sense do they cause gene expression so there's also an interesting paper I think this is still an ongoing to to some extent an ongoing debate do what is the how much are they cause and how much are they affect so there was also an interesting paper by Gretis Affiniti from Edinburgh where he showed that histone modifications for example can actually be very well predicted from the binding of transcription factors so it's possible that transcription factors are causing both the expression of genes and histone modifications and then it's then there's also a different definition of epigenomics which says that maybe it's not which puts not the control into the focus but the memory or inheritance so this is the this would be these two definitions for example where mitotically and or my my heritable changes in gene function that cannot be explained by changes in the DNA sequence are what what is epigenetics so the idea would be that a gene is expressed and because it is expressed the cell remembers that the state of of an active transcription and then it deposits a epigenetic mark to remember that this state that this gene should be should be should be expressed and if the cell were to maintain its identity it needs to remember which genes have to be expressed even in the absence of of the the initial signal that transcription factor binding so this is a depiction this is maybe an idea of of this this i yeah this is a graphical demonstration of this a little bit so epigenetic epigenetic components act as barriers against accidentally programming so you have you start from a pluripotent cell and you have certain precursor cells with very specific sets of transcription factors the cells then have developmentally induced specific epigenomic programs that prevent them from accidentally reprogramming into the old state and in the case of an altered in the in the case of disease so for example in cancer your epigenome might be changed and therefore the cell can lose its identity or can take on an alternative identity so i i guess the the idea the the definition of epigenetics with which i was go now is actually that all of these mechanisms are possible so adrien bird has has defined it in the sense that epigenetics is the study of structural adaptations of chromosomal regions so as to register signal or perpetuate altered activity states so in this case what is epigenetics could either cause a change in expression it could register a state of expression or it could perpetuate it to the next throughout the cell cycle to the next generation of cells yes so in this case so this is the definition of epigenetics which which i'm i'm i'm kind of settling i think the precise the precise causal structure is is very much depending on the context and on on the locals and so on so so that it is not and it's ongoing it's an ongoing study i think which makes this so difficult and exciting right um right so with that i'm i'm now actually digging a little bit deeper into dna modulation so this was the first epigenetic modification that i i i started to study a little bit and it's potentially the best known one so what it means is actually that the dna that there can be are matulated mature group added to the cytosine on your dna strand it happens so that cytosine modulation mainly occurs when a c is followed by a g so we say in a cpg context and this will be quite interesting because that means the cg is um symmetric so if you have cg on the forward strand you also have cg on the reverse strand on the so watson and creek strands and um therefore um this might um or this this this offers a possibility for um inherit for memory for inheritance so when the cell divides and one strand is um is given to the one daughter cell and the other is given to the other daughter cells both get the information of the matulated cpg and they they can copy um this this cpg in this case um what is also interesting is that um cpg um that there's in the in the mammalian genome there's a certain dna mutilation landscape um that kind of is is very um commonly observable um in particular we find that cgs are actually depleted in the bulk genome so we find a lot less cgs than we would expect by chance and it turns out that these cpgs um in the bulk genome are very often matulated mainly matulated however there are certain regions in the genome which have a very high density of cpgs um they are called cpg islands and it turns out that these um cpg islands are very often unmatulated um so um I should have said that this is a very common depiction of matulation so you have this would be the um the dna strand you have certain genes here and um then every cpg is shown as a as one of these lollipops and um the matulated ones are the the black ones the unmatulated ones are the the white ones um and you also have matulated ones and um what is also remarkable is that cg um cpg islands are very often found um to overlap with promoter regions also with enhancers but very often with promoters and um um um and there's a correlation there between matulation and um active gene um transcription um so when we look at um datasets um uh whole genome um um matulation data sets um we can look we can find um patterns like that uh so we have several genes down here uh and these are the tracks for different um samples so the top ones are blood uh then we have brain samples and then we have spleen samples um so it's the the yellow tracks are the matulation tracks between zero and one so in this case it's um it's not single cell matulation but it's bulk um uh mat bulk matulation so one means that all cells that are um sequence have at a given cpg um are fully matulated and zero means that they all of the cells are unmatulated and we find that there's some global pattern I would say there globally we see that most of these cell types have the bulk genome here which which is highly matulated and then there's a region here um whereas there's a lot more structure and these structure um it's generally unmatulated but you have a fine structure here um which is um tissue specific and these signals this matulation signals are read out by um epigenetic um readers um I will use that that that term again so one of the first readers for DNA matulation actually the first epigenetic reader that was discovered I think by Adrien Byrd in Edinburgh was um ME-CPG meet mature cpg binding protein 2 so it specifically recognizes matulated cpgs and binds to matulated cpgs and interestingly this this gene is very highly expressed in neurons and it is linked to a disease called red syndrome and Adrien has been studying this disease for a very long time and when I was in his lab I was also very much interested in ME-CPG and how it's binding to to the DNA and to the matulated cpg so a red syndrome is uh uh autism spectrum disorder which affects um little girls uh so it's x linked and um the girls develop almost normally for the first couple of months up to maybe six months nine months of age and then they lose gradually abilities that they have already started to gain so very often if they have started to so if it's a severe case um if they started to talk um they lose their ability if they have started to walk they they lose some of this ability um they get very very um bad seizures um so someone has described it as imagine red syndrome so red syndrome imagine the symptoms symptoms of autism cerebral um palsy parkinson's epilepsy and anxiety disorders all in one little girl so it's a very complex disorder but it has a very surprisingly simple root it has so the root is mutations in this one gene um however it is not so the the mechanism so despite us knowing um that ME-CPG2 is the cause of this um disease or mutations in ME-CPG2 are the causes for this disease it's not entirely clear what ME-CPG2 actually does in the neurons um so but what we what we learned so we have we have looked at um ME-CPG2 conditional ME-CPG2 knockout in mouse uh neurons and what we definitely learned is that the correct interpretation of matulation readout is absolutely essential for neural operation after a certain developmental stage um adrien has um has shown in in engineered mice models again that the defect is actually really amazingly um reversible so if you um engineer the ME-CPG2 in such a way that you can actually turn it on again um then the mice um have a normal survival chance chance and the symptoms are actually going away so it's not a degenerative disease but it's it's it's really important to be able to read out um this matulation signals and what was also um uh so we were having a hard time with this project to some extent because we were trying to um first of all characterize where ME-CPG2 was binding precisely on the genome and this is quite difficult because it's very different from the transcription factors transcription factors have um well-defined um motives that they bind to but matulated CPG is still very common in the genome and it appears that it's binding almost everywhere in the neurons except for CPG islands um so we showed that it's not only binding to CPG CPG matulated CPG but also matulated CACs um but as I said it's difficult to characterize the binding behavior because it's quite universal um so the specificity is quite interesting so it's like many of the epigenetic um mechanisms the specificity is still quite puzzling and also the function so the idea was actually that ME-CPG2 is a repressive because it binds to matulated CPGs and matulated CPGs correlated uh with um uh silence transcription but when we are disturbing the system it turns out that actually the correlation with expression changes is not so great um so I think the last word is not said in this respect it's also of course that we are seeing a lot of um indirect effects so um so um the the cells are obviously adjusting to the loss of ME-CPG2 and therefore we see those upregulated and downregulated genes but this is only a little detour to to show how important um it matulation is in the brain and how it is um read out um so but um we are now actually even more interested in the writing of this this marker so we have been talking about the reader which is ME-CPG2 one of the readers um now we are interested in the writing and I've already um said that this symmetry of the CG's provides a mechanism for cellular heritability so we we want to and we want to understand that a little bit better so actually what it turns out that there are several enzymes that are writer so there's DNMT3A and 3B and they are called de novo de novo material transfer races and then there's DNMT1 which is the maintenance material transfer race so and the DNMT1 is is is actually copying um so it is recognizing hemi mutilated CPG's so CPG's which are mutilated only on one strand and it's adding um the mutilation on the other so it's basically copying the mutilation from one strand to the other and there are also active there's also active de mutilation going on to um TET enzymes so TET enzymes actively remove uh material groups from the CPG's so this is again showing the function of this DNMT1 so behind the replication fork so when the DNA is replicated through cell division um you can see the here's the parent strand the parent strand has the mature group in both cases in both daughter cells in this case and the newly synthesized strands are unmetallated so that creates hemi mutilated CPG's which are recognized by DNMT1 it's recruited DNMT1 is recruited through the replication fork um and then this creates um uh fully mutilated CPG's so CPG's that are mutilated on both strands and this provides a mechanism for the cell to remember the mutilation state even when it's proliferating when it's dividing so again um when we look at that so we have the unmetallated CPG's so this is the DNM DNA mutilation cycle so we have unmetallated CPG's at the top we have the novel material transferases DNMT3a and b which can establish um hemi mutilated CPG's um then DNMT1 copies the mutilation from uh state from one strand to the other creating fully mutilated CPG's there's a number of enzymes the tether enzymes which um can um turn the mutilated CPG's back to the unmetallated form through some intermediate states and then in addition we have passive loss of mutilation through cell um division so every time a cell divides um the newly synthesized strand in the beginning is unmetallated so that's also a way of creating unmetallated CPG's so if you have unmetallated CPG's and the cell divides then you will have two unmetallated CPG's if you have fully CPG mutilated um uh CPG's then you will create two hemi mutilated CPG's and you need the DNMT1 to to uh get back to the fully mutilated stage and so the DNMT1 is really what provides uh a memory or an epigenomic memory to maintain the cell identity and there's plasticity provided to these patterns by enzymes like DNMT3a and b and the tether enzymes which allow um dynamic changes during differentiation so now the question for us was what happens if we are actually knocking out DNMT1 and we are interrupting this epigenomic memory um system and again we do that because we have already seen what an important role um modulation and MECP2 is playing in the mouse brain um we have been doing that in a quadrillion knockout in in embryonic mice and um in embryonic uh mouse brain and this is um done um by Sabine Laker uh so we both worked in Adrian's lab before she was a postdoc working um in the on the wet benches and I was the computational biologist and she got her own group now at the vet med in Vienna and um I've got my group in tubing and dandy and we continue to work together to be working together and it's a lot of and we enjoy that greatly um so in this case uh what Sabine has done she has created the knockout strategy where um uh in um where DNMT1 is um deleted specifically in um in in um in embryonal brains in neurons um so this model is quite exciting because um this uh in her uh her mice actually survive for quite some time postnatally and this allows us to study this very crucial time after birth um so there's uh other mice um which where they have equally aimed um to understand the function of DNMT1 and um usually these these mice are not viable they die at birth uh so what we are seeing here is now that the protein DNMT1 um is at embryonic day 16 there's not very much DNMT1 expressed in wild type it's not there in knockout but at birth there's a lot of DNMT1 um in the wild type at the knockout none as expected and then it goes down in the wild type again up to um the uh day 13 post birth um so when we look at the phenotype of the mice what we see is that at birth um there's practically no phenotype at birth however the knockout knockouts die within 15 days um so they have um um and they it appears that they are they their development is practically interrupted so in particular their brain don't seem to fully differentiate it stays in a almost um birth-like state and um while the the the wild type mouse grows and the brain fully develops this is not happening in the knockout when we are looking at the matulation patterns we are seeing indeed that um in the knockout um we have um globally lost a lot of matulation um when we are looking at individual CPGs and we are comparing individual mice we see that they are very highly correlated so the matulation levels are very highly correlated and this is whole brain data actually so so it's interesting that um the the state of the matulation is actually really well correlated between individuals in the knockouts we observe indeed that we don't see any fully matulated CPGs because um uh DNMT1 is obviously missing and the correlation uh is to some extent lost however what is interesting is that when we look at triplets of CPGs which are very close um together uh so CPGs which are neighboring and then there's a large correlation between these sides and this correlation is actually conserved between wild type and knockout so some functions seem to be um of the DNA matulation of the the pattern doesn't seem to be destroyed totally um through the the um DNMT1 knockout uh but there are clear um deficiencies now uh we are also interested in how the transcriptome is changing so how is this cellular state changing uh upon this intervention and again unfortunately we only have um bulk data um and not single cell data but luckily um in recent times um there has been single cell data for mouse brain um become available and we can use this um data uh to deconvolute um our bulk um analysis so we have obviously a number of different cell types uh in our mixture and we can only um measure we have only measured however the average gene expression instead of the mixture um but um thanks to this um this this huge collection from 2018 of single cell profiling of the developing mouse brain um we can use this data in the the transcriptomes of 73 clusters cell types um to make sense of our bulk data and we are using a deconvolution strategy for that so these are the neural sub types which are all found by the single cell data um and uh and we are using non-negative matrix factorization to take advantage of the single cell data set in order to make sense out of our bulk RNA sec data set so the idea is obviously that um the RNA sec bulk data um is a mixture out of um of the um individual cell types uh and the fraction of the cell types the proportion is unknown and we are trying to um to estimate these these unknown cell types um we can also use a somber methods um uh if we have several reference genomes in order to improve our our estimation of the individual cell types um so this um these tools have been um suggested by Wang et al and by Dong et al and we have seen that there are quite some challenges so we try to apply them there are quite some challenges um because the new technology single cell technology um allows for way more nuclei nuclei and um this deconvolution methods were initially not readily um capable of of dealing with so many um nuclei also um uh the deconvolution methods have been built um this only relatively small data sets and tested with five clusters here we have 73 and we also don't have a fully matching um data set so there was quite um a lot of uh work by my student Chris who uh who was looking into that uh I think the biggest challenge was the selection of the marker genes one of the biggest um um uh challenge and um so the selection of marker genes for the individual um cell types that has a huge effect on the results of the deconvolution um but I think he did a good job and he managed that and we are looking now particular at oligodentrocytes uh which um are one particular cell types in the um central nervous system they produce the myelin um which is kind of the insulating layer that forms around nerves and this allows um the electrical electrical impulses to transmit quickly and efficiently along the nerve cells so it's um um and the oligodentrocytes um the differentiation goes through several steps uh from neural stem cells to oligodentrocytes precursor cells committed oligodentrocyte precursors and so on up to um the mature oligodentrocytes and some of these differentiation happen postnatally so what we can uh so Chris has now re um embedded the single cell data and then used that um to to understand our own um um bulk data um so here we start um with um with the simulated data um out of the single cell data that's at postnatal day two and eleven and what you can see is that at later stages um of uh so at day 11 um the further developed um differentiated cell types are higher expressed than at p2 and our own data is taken at p5 uh and we have wild type uh we are showing here the wild type um data so it look it's all pretty nicely this in um so between the p2 and p5 such suggesting that we are getting the um the that we are getting the proportions quite right um and also the replicas are fairly similar um but if we are looking at the control at the DNMT1 knockout we see that um the increase of the differentiated cell types is not happening as it should at this um uh day five so it seems that we have um a differential differentiation problem and we can um identify the exact stage where this happens now so we can um say that from this from the cell type um which is identified here as 58 towards 56 there's clearly a problem in the difference um differentiation and we can also look at individual genes uh which are down regulated down regulated in the knockout and see how they are um where they are expressed in the um pseudo time of the single cell data and that's quite interesting for us so i'm not going to go further into detail but what i find quite interesting so from a high level understanding is that um if we are disturbing the system here this memory system really the DNMT1 um then um what happens is that we are actually seeing problems in differentiation so it means that actually um we need to remember um the state that we are coming from we need to integrate the changes in order to um to become something new um and that is um quite interesting i think um so here are some very high level um lessons from this this experiment um so DNA mutilation uh so the first one is that DNA mutilation is a composite signal um there are several writers that establish it and i think there's also different levels on which it is scales on which it is read um so we have already seen this plot of different tissues um and um we see that at scale certain scales it seems that these tissue tissues are very similar so the bulk genome seems to be similar and then this this area of high cell density is also is a bit different um but so the the landscape so the landscape of establishing the bulk genome which is fully methylated in CPG islands which is which are unmetallated which depends largely on the density of CPGs it's one level um it's one part of the signal and then we have a fine structure um which needs to be um uh cell type specific and dynamic and i think when we want to integrate the DNA mutilation we have to remember that it is established in several processes with competing enzymes writing and erasing it taking information from different sources so for example from the parent strand um or from transcription factors which might recruit DNA DNA um DNA mt1 DNA mt3 for denormal mutilation and um also through the density um of CPGs so it's a composite signal and it's probably interpreted on different levels i think it's also interesting to see um that the phenotype really hits it after birth um so this is a moment when um when the brain really has to start integrating a lot of information so so it has to to start in um processing gravity it has to start breathing it has to start uh so there's a lot of information information coming into into the brain and this information needs to be processed and there needs to be an mediator to integrate the external stimuli and there's a lot of evidence that DNA mutilation could be one of these med mediators and that's why DNA DNA mutilation plays such an important role also in learning um in memory formation it's um occurring again in old age in neurodegenerative diseases in Alzheimer's for example um so I think DNA mutilation might give the cell the chance to um to process external information and I said I'm giving you some classes on the cancer landscape I think this opens also um makes the cell vulnerable to changes from the outside world which are then also integrated and and somewhat stored in their um modulation patterns I think what we're also seeing is that the cells need a memory and the DNMT1 can be one of these memories um memory function and they need the memory not only to stay the same but also to become something different so you need to be aware from where you are coming to where you are going in order to to to achieve that so as I said that was high level um high level ideas um lessons that I seem that that I thought interesting about this project and what I think could help understand the DNMT1 also in other contexts so now I'm actually also working um Viscata Ecker which happens to be a Sabine's sister and she's also working on DNMT1 um in this case in the NPM arc positive anaplastic large cell lymphoma so in this case we have a very similar we have a similar setup so we are looking at a rare and very aggressive not non-hodgkin lymphoma of T cell origin um it is driven by constitutive activation of the oncogenic anaplastic lymphoma kinase arc so um just by introducing this gene arc we can determine deterministically create a model of this lymphoma um so NPM arc transgenic mice develop tumor in the tumors and they die at the age of 15 to 13 weeks. What is interesting in this NPM arc transgenic mice is that they have um DNMT1 seem to be expressing um DNMT1 um to a high extent so here's a controlled cell line and you see that DNMT1 is it's not very strongly expressed but in the arc um in the arc cells DNMT1 becomes expressed so um what Gerda has now done is she has conditionally deleted DNMT1 in T cells in the background of the NPM arc transgenic mice and what what she found amazingly is that um this conditional deletion of DNMT1 impairs the tumor formation um so here you see the um survival times for the control mice for the DNMT1 so if you only knock out DNMT1 it doesn't um hurt these mice in this case very much if we are looking at the um at the oncogene the arc um the mice die after about less than 30 um 30 days and um when there's a double mutant so um the arc knock out this one the purple line it it's it's back to um to the control and you can see that here as well so we have a the tumors of a control mouse of the DNMT1 knock out mouse nothing much happens here um then of the arc so this this um um a tumor formation and in the arc knock out it's almost back to um to the control so again we are studying um this in this case reduced representation bisophid sequencing the DNA matulation um we have wild type mice we have tumor mice we have three knockout mice and then we have the double mutant and we see um so the phenotype again the phenotype is the the biggest phenotype we see in the tumor samples but over all the tumor um samples have not lost an awful lot of DNA matulation we see that the double mutant on the other hand have lost a lot of matulation um then if you just knock out DNMT1 not so much um so it seems that there's there so the the strong phenotype um in the tumor sample is not so so there's a there's not much loss overall of DNA matulation and there's seem to be a bigger effect in the arc knockout okay so if we look at differentiated matulated cpgs we find again that the arc knock out has the highest number of hypermatulated cpgs uh there so there are both hypo and hypermatulated cpgs in the tumor and there's not so much happening in there in the knockout um so in the in the tumors um so I think one thing that we have to keep in mind here is that the tumor cells are not um not proliferating so much anymore um so they might not need the DNMT1 for that reason the tumors tumor cells start proliferating a lot more again so they they see a lot of changes and the changes are um are propagated and then also in the arc knockout we see changes but we don't see the tumor formation um what we are seeing here is that the changes to DNMT to the cpgs are most traumatic um actually in the tumor so we see most there might not be so many cpgs that lose matulation but um when they lose matulation then they lose a lot of it so they go from 100 percent matulated to zero matulated and I'm looking again at the correlation between neighboring sites and in this case um the um uh the data is not um not quite as uh the coverage is not quite as good but we see again that there's a high correlation between neighboring sites in this case um which in the tumor cells now um is somehow broken so the the DNMT1 knockout doesn't seem to affect um this cooperativity between um neighboring sites um but the tumor does and in the arc um arc in the double mutant again the correlation seems to be maintained so um this brings us to a typical hallmark of cancer um which is that um the um which is that this matulation landscape that I've been describing earlier where we have matulated um bulk genomes and unmatulated uh cpg islands is kind of destroyed so we see now that um that there are um hyper matulated cpg islands and the bulk um genome is losing um is is losing some of the matulation so the contrast between these functional elements the cgi's and the rest of the genome is kind of destroyed but that this is not a function of the DNMT1 but it's rather a function of the other enzymes of DNMT3a and b uh and the tether enzymes so again we are trying to also look at the um transcriptome states and again we were lucky um so there has been so again it's it's we only had um bulk data at the time um but um uh there's a cell atlas for human tumic development that defines the t-cell repertoire and um here we have additional challenges um or chris had these challenges I must say the student um because again in the tumor sample we find uh of course not just the tumor cells but we also find um tumor cells that are different and heterogeneous from this reference data that said that we are using for the deconvolution and that is quite tricky I think to quantify and to make sense out of it but we are now quite positive that um we are seeing that the um um tumor cells are originating from relatively early um um uh cell uh uh cell double negative the quiescent double negative um cells and we are also looking at how if we are deconvoluting um the bulk data this different single cell data sets from different um uh developmental stages we find again that the tumor seems to be originating from a relatively early time point and not um as we would expect from the later time point where we actually uh looked at it so again what are the kind of um global lessons um uh of this project um of the dnmt1 knockout in in this large cell lymphoma well what we found is that the expression of a single oncogene um is able to induce large alterations in the dna matulation um uh landscape during tumor agenesis um um positive tumors lose their ability for collaborative dna matulation um so um the correlation between nearby cpg sites seem to be lost and therefore the the general landscape is destroyed so if we destroy if we delete dnmt1 in this genetic background um tumor agenesis is uprogated um despite the activity of the oncogene arc and um gene expression and epigenomic data point also we find that muc is actually um necessary the induction of muc is necessary to induce the tumor reprogramming um so but the complete reprogramming requires an epigenomic reprogram reprogramming mediated by dnmt1 so if we can't do the um epigenomic reprogramming um then we can't change the um transcriptome towards the tumor transcriptome as as we observe in the oncogene we also seem to um to be able to show that the origin of arc lymphoma might be a small subset of immature immature t cells but there's still a little a little bit of work to be done um so um this was the biggest bit I think about dna matulation um I'm going very briefly about um uh over other epigenetic um mechanisms one of them is chromatine remodellers so I have what you are seeing here is that the dna doesn't only use um matulation but it's also wrapped around these proteins they are called called histones and um the histones so it's it's organizing the packaging in the dna um in the cell nucleus and um there are so-called chromatine remodellers which can eject dimers from these proteins and they can eject um complete nuclear zones and they also can uh lead to a sliding of the nuclear zones and what you can imagine is that transcription factor binding sites for example that are um in this bit of the dna um where the nuclear zone has become more or less accessible through the through the action of um these um ATP dependent chromatine remodelling complexes and interestingly these um chromatine remodelling complexes um play in a more an enormous role in many different cancer types and um their the importance are there we are just at the beginning of understanding the importance so one of these remodellers or um in humans there are actually three um remodelling complexes and um as you can see they are very um they're uh they consist of many sub um proteins for example arid 1a it's it's one of the genes that pops up very very often in all kinds of different cancer types and it is interesting that um that these difference um that these different components have all different specificities for tissue specific cancers and neurological um disorders so again what we are seeing is DNA modulation and chromatine remodelling seems to be working um across the genome everywhere but still um there is some specificity um uh inbuilt and it is we are only starting to understand where the specificity is coming from in this um this is another collaboration um with uh a colleague from dandil professor from on use and he's um looking at um arid 1a and in this case um it's done in es cell lines and what is quite exciting is that it's not a knockout experiment but rather it's a very fast degradation of arid 1 and therefore what we can observe are the changes the induced changes over time which then um help us to identify primary targets um from secondary and indirect effects so over time so this is the pass phase uh for cancer and you can now actually see how over time um the changes are propagating um through these different um cells and the way we are studied so another student of mine Tanmaji is studying that and we are having help us well from Dominic Jansen who's in tubing and he's working for amazon research and here um what we are wanting to do we want to understand time course data and we want to imply um causality or we want to infer causality we want to infer between direct and indirect effects using these time course data sets um and um currently we are working with cranger causality um we are using uh again a tool called bets um that has already implemented um cranger causality for time course data i think it's implementing a stability selection criteria and um to come up with our with not too many false positives um but i think at the current state this is still a very challenging project mainly because we have very few time points and very few replicas so something that is very common in um in um in biology of course um so we are thinking at the moment very hard in how to move from causal discovery to including uh knowledge that we have already got uh and um and and direct um the discovery in this way but what we have found uh so our initial results seem to be uh at least um picking seem to pick up some of the pathways some of the genes that are also parental genes and in the right order in known pathways but as i said this is very much ongoing work um there's a lot of open questions i think um with regard to chromatin remodeler so it's it's very unclear how the cell type specific specific specificities um achieved in individual components and then it's also not clear how changes can be propagated during the cell cycle and if they can be um propagated so how is there a memory involved and this would be um quite important for the epigenetic mechanism as well i'm very quickly moving on to um the last of the epigenetic um mechanisms namely his tool modifications so not only can these nucleosomes be positioned at certain places but they can also be decorated with other chemical modifications and again we have epigenetic writers um that establish certain um modifications so in this case it's h3k4 tree mutilation um so that just means that his tone so this these are octomers and his tone three has a lysine at position four and this is decorated with a tree mutilation group so and there are in humans actually six writers which all write this one um epigenetic mark there's a number of different epigenetic readers that recognize specifically um these his tone modifications and h3k4 tree mutilation is only one of several so you have also for example h3k27 mutilation which has a number of different writers um which established them um and we can measure all of these his tone modifications in um well it's difficult to measure them all in um the same cell but we can do that in of course so bulk data um so this would be in this case different genes in the yeast genome and each of these tracks are one his tone modification um and what we are finding is that some his tone modifications are correlated with silencing while others are correlated with active gene expression again um we observe that there are dynamic changes of these epigenomic patterns there are dynamic changes during development in differentiation there are dynamic changes during memory formation and consolidation and we also see that there are abnormal changes of epigenomic patterns so for example during tomorrows genesis or during aging we find that the epigenomic patterns are changing so there are a number of challenges where machine learning comes into play i would say for his tone modifications so one of them is of course that the data is extremely high dimensional it's not independent you can see that some of the his tone modification down here look very similar um so there's rip for manifold detection dimensionality reduction methods and so on the data is actually really sparse we have lots of data but each cell has a different epigenome so um it is um certainly every cell type is expected to have a different epigenome but potentially also every cell um so um we have um so there has been an imputation challenge um where by the encode consortium so the goal of encode is to build a comprehensive parts list of functional elements in the human genome um but performing all of these essays is is really expensive and in some cell types it's really challenging um so that potentially computation methods that are capable of predicting the outcomes of the essays and that's therefore the encode imputation challenge was was was initiated to compare methods for imputing data um uh so one of the um i think i'm i'm a bit run out of time so i'm not going to say very much about that um but um there's a our avocado has provided a multi-scale deep tensor factorization method um there basically we have factors along the assay dimension the cell type dimension and also a genomic position dimension and these are um these embeddings are learned simultaneously and then a neural network is used um to combine them and and form a final product prediction for a given new unobserved cell type assay and genomic factor combination and um Alex has worked significantly um on a new method um and he has done uh particularly well in this challenge and was the second runner up in the um in the preliminary results i think he's now number three or but did really well in this um i want to again um say that we have to be considered that that this is based on correlation and eventually i think they had two we want to move for further towards causality um what we are seeing is that um if we are removing parts of the histone modifications then sometimes we don't actually see an effect on gene prediction so we don't quite understand yet what the purpose um uh how these epigenomic um um part these histone modifications are interpreted so there's for a long time has been in the room the idea of an histone code so are epigenomic uh in histone modifications instructions for gene expressions so you have writers and readers and they are correlated with with expression states but then of course also a code could be a system of symbols that represent a message or recording information like like this code in a library which records who has actually read the book in this case um the the the writer would um the the um the expression would cause the writing rather than the other way around and then of course there's the possibility which i think is the most likely is that there's a his that that it is supposed that it's instructing for gene expression and also recording of previous transcription and therefore we need to understand the local um causality structure better um again it's hugely important for all kinds of diseases so if you're only looking at a single market we care for tree mutilation and we look at the protein family of the writers we find that they are involved in a huge array of neurodevelopmental syndromes and also in um they are all involved in different cancer types um so understanding what these epigenomic patterns mean is hugely important so i think that this is kind of the bigger ideas um the identification of a local causality structure with that um i i want to come to a closure um i think i will give you the very last definition of epigenetics um adrien bird in 2007 said epigenetics is a useful word if you don't know what's going on if you do use something else um so um so i think that that's quite useful to know it's it's useful that epigenetics is still around that means that there's a lot of things that we don't know and a lot of room for you potentially to discover um i want to kind of go back to the beginning of my talk um and and also reflect on what i have learned through my research maybe um for for for life as a scientist as well so of course it's okay to feel out of place um even that big beasts do i know that i think it's important that we work as a team that we ask that we help i think that we help out each other um i think that it's crucial to know the past because that allows us to change um i think we have to allow ourselves to be puzzled um i i think it feels very uncomfortable sometimes we want to know and we want to know the causes but i think this feeling of puzzledness is very important and it's where the discoveries come from that's what adrean's definition said i think at the end of the day we just need to swim we need to plunge ourselves in the water and we also need to be thoughtful of the bigger picture where our science is is is changing um and interacting with society i guess and with that i want to thank i want to thank uh foremost my uh team my little team uh tenmachi Giovanni and Alex and i want to thank um my collaborators which i have mentioned mostly and i want to um also appreciate the funding um that helps uh to do this research research and that's me great thanks a lot um Gabriele for this very interesting talk we already have one question from Christel could Christel please hi um excellent talk um i've learned a lot um so i'm not so much into epigenot epigenomics but um so if you would think of it in terms of waddington um so as processes by which the genotype helps the phenotype into being i was just wondering have these processes themselves changed from an um evolutionary perspective as well yes they have massively and um um yeah i think that's that's uh another um very vast area so um matulation for example is not observed in the elegance for example or in trisophila is if i'm not mistaken um matulation in plants is completely different um so i think the some of the epigenetic machine so i'm calling epigenetics the so but that's kind of for myself to sort things i'm calling epigenetics um the um machinery the readers and writers and i'm calling epigenomics the patterns i think epigenetics is conserved between um so a lot of the epigenetic machinery the the proteins are conserved between mouse and humans for example so that's why i why it makes sense to use these model organisms um so the histone modifications the the readers the kan t2 family for example um there are only three in trisophila um so they have doubled um um so there's a lot of i think there's a lot of evolution going on um i think also if you look at the um epigenomic patterns um so for example promoters seem to be very well so seem to have very similar epigenomic um patterns but enhancers seem to be not so well shared across organisms for example because i was also trying to put it then in the context of population genetics so i'm primarily looking at genetics in relationship to um transcriptomics etc where of course population structure is a very important aspect and i was just wondering you know where does then this epi genetics or epigenomics comes in so is do you also see population differences i mean so when we stick to one organism humans is there an evolutionary aspect i think absolutely i think the part of the but it it depends on the time scale again i think um you see um i mean part of the epigenomic system is that it's i think part of its function it's is that it's dynamic that it's different so i haven't talked about it but um for example we can see um of course matulation is changing over age um so within within the same individual it's changing over time and i think it's it's changing over um if it's if it's parts of its function is to integrate external stimuli uh then you you would expect that a lot of it is changing but i think again we have to remember that it's a composite signal so some bits are i think um part so um okay i'm not sure i can answer that correctly but um i think parts are determined by developmental processes and parts are determined by integrating additional signals thank you it's going to be very complex right to take all that into account uh yeah it's like two different time frames right that you need to take into account in the indian analysis yes i think it's even more than that um because i suppose that transcription factors could um work on a completely different timescale than histone modifications and histone modifications work on a different timescale from um uh dna matulation and they are interacting so if you if you are if you're recruiting histone modified histone modification writers you would also change um potentially the modulation and vice versa um so it's it's a way potentially to write into from a long term from a short-term memory into a long-term memory thank you very much there are some questions on slido i just going to read them out as one is does matulation in promoters silence gene expression or is the lack of gene expression allowing for more matulation i think there's example for both um so i think that um yeah i think that we tend to want global answers and i think that this is kind of um problematic so i think uh we can we can find enough evidence that we can actually switch have epigenetic switches where we can switch uh where we can switch on transcription through changes to the epigenomic patterns but we also see that epigenomic patterns changes of epigenomic patterns don't have an effect on on transcription so i think it is loco specific thank you and context specific right um and the second question is are cpg matulation patterns observed in the nm t1 knockouts at birth inherited do they die sooner because of the impossibility to maintain them can you can you repeat the question yep our cpg matulation patterns observed in the nm t1 knockouts at birth inherited do they die sooner because of the impossibility to maintain them um okay so we are lacking a lot of data unfortunately so ideally we would have to look at the dnn dnm t1 matulation at different time points uh so right after birth then after a couple of days and then shortly before death i mean there's some ethical consideration because all these are animal um experiments and so we cannot um we cannot all do these experiments unfortunately so we have so far only observed dna matulation um at day um 12 i think after birth so i cannot fully answer the question um i i think that there's some um that there's kind of a default um pattern um and um the default pattern which is potentially um governed by the which is governed for example by the density of cpg's um and cc and cheese um so what we are observing for example is that when we um use artificial dna and engineer that into into cells then we can predict um the matulation states of this artificial dna based on the density of the cpg's and the density of cng's um so this is one one information that needs to be read out to establish the patterns additional what the cell needs is transcription factors i think that we could dna dnn mt1 um at some some point and then when they divide they also copy the information from the parent strand and i i i suppose when you want to and one of these mechanisms is missing my dnn mt1 is gone and and that's i think to copy the changed states so you need to integrate the you need to remember so when you go through the differentiation um you first change some cpg's and then the next and then the next and therefore you move away from the default state and you need to remember what you did and you copied every step to the next generation and if you can't do that i think you can't differentiate but that's kind of a rough idea okay thank you and so we have one last question from sashti kumayoti um first of all he says great work and wonderful presentation and then you mentioned about imputation to handle missing data but it will be helpful if you could expand more in details um he's mostly talking about the sparse and na values in some of the data you were talking about and how you use this with machine learning um right um so um so the idea is that you can have you can learn an embedding for for example for the cell type you can learn an embedding um for um so for the cell type for the um for the cell type for the assay and also for the local position um and um if you are learning these embeddings um you can then combine them um this and you feed them into a neural network and learn a function to combine them and to create um a prediction for another cell in this tensor that is missing so i think the problem is that um um that you need to have the right combination uh well at every every time you predict you have a different set of assays and a different vector of assays um and cell touch from which you can learn um so that's one thing the other thing is that you can actually um instead of using the embeddings of the genomic position and the cells but in particular the genomic position you can actually use the observed measurements and you can have an embedding for that one as well um so so this is i think this would be a subject of a whole talk okay great thanks a lot and i think there are no further questions so also you get a round of virtual applause and thanks a lot for your very nice presentation