 Giovanni and there's Giovanni in the middle who took over from Alex and Tanmaji and Chris who also helped with the project. And this picture is indeed still taken at the University of Tübingen and I think I was announced as my affiliation being in Tübingen but I have moved to Dundee in Scotland since then and some of them have come with me. Okay, so histone modifications, that's the first part, that's what we are talking about firstly and just a short reminder again. So you have your DNA and the DNA is wrapped around these histone proteins forming these beads on a string like structure and then these, if you are looking into detail in each of these histones, they are actually complex made up of several proteins and these proteins have tails. So the proteins are called H2A and H2B, H4 and H3 and then they have amino acid tails that you can see here and some of these residues can be modified. So for example H3K4, that's the lysine at position 4, can be matulated or acetylated and you see the different chemical modifications that are possible here and this is kind of the nomenclature that I've been using already yesterday. So I've been talking about H3K4 tree matulation or you could think about H2A 190 ubiquitination or something and as you can see there's a whole circus of different modifications at different positions and this is kind of creating this whole landscape of epigenomic patterns. So here's another few of the same thing. So you have the DNA which is wrapped around the histone protein and then you have the different histone modifications that are all potentially present at a given nucleosome and of course you have to measure that across the whole of this very long string and then you get the information in a genome specific way. So the imputation challenge was to fill in missing data. So of course it is very data intensive epigenomics and that is because for once you have a lot of histone modifications that you can measure. They are shown here on the x-axis and as I said in contrast to the genome the epigenome is really different between cell types and tissues. So even for a given individual you would have to measure it across different tissues and so for some of the reference epigenomes and roadmap projects you can see that even there there's a lot of data missing. So the blue squares are measured data and the white square is missing and what we want to do here is we want to fill in the data for each of these squares and importantly also in this case for the reference epigenome these tissues are not stemming from the same individual or tissue culture. They are measured in different labs and they are not related in the sense that they are all derived from the same individual. So it's a reference epigenome which is a mix from data from different labs which are not related in a sense that they share the same underlying genome. And so the ENCODE consortium was calling so the ENCODE consortium is really an international collaboration which is funded by the National Human Genome Research Institute. It's trying to find a comprehensive parts list of all functional elements in the human genome including regulatory elements that control cells and circumstances in which a gene is active. And a major activity of this consortium is to generate new datasets that would enable the characterization of various types of biochemical activities. So however of course performing these assays is still somewhat expensive and there are some technical challenges such that some tissue types or cell types are not as easily accessible and this really prevents its completion. Therefore they have called out a challenge where an international challenge which was mainly led from Stanford, Manoles Kelles and Bill Noble in Washington and they provided us with training datasets. So this is the same kind of matrix that you have seen before. In this case you have the histone modification the y-axis and the cell types on the x-axis and these are the blue dots are provided during the training phase of the challenge then in the second phase we got the validation datasets and eventually we had to test they actually generated new datasets for the prediction for the final prediction and that's where we had to and that's where they validated where they assessed the different competitors. As you can see the sampling of the test training and test set is potentially not ideal so it's not really randomly sampled these testing sets. So initially as I said the challenge closed somewhere in August I think September 2019 and we had not done prediction before, imputation before but we had a prototype which we called IMP and you can see that it performed among the best methods in this international competition and then we went on from there and improved further this tool. So but let's think about first why imputation is possible in the first place and what are the assumptions that we can make. So of course the assumption is that there are partial correlations between different datasets and that is of course because the individual histone modifications are not independent of each other so we have discussed readers and writers yesterday and we have discussed that the writers are very often very big complexes that also contain reader domains or reader proteins and therefore they read and write at the same time and also I have said a number of times that for example H3K4 treatment relation of course at the start of genes so that means clearly that they recognize signals on the DNA so for example the start of genes or the or enhancer motives and these are of course the same these DNA signatures are of course the same across cell types and therefore to some extent these histone modifications would be correlated across cell types. So and this is kind of the setting as I've explained before so we have different cell types one one and two and we have different histone modifications one and two and then we want to infer the histone modification in another cell type. So there were some tools already out there so the pioneers was really Jason Ernst and Manolis Kellis I would say and they published a work in 2015 which they called From Impute and it's still very widely used I would say and it's based on an ensemble of regression trees so you would have your different histone modifications here you have your target mark here and you build and in this case you have different cell types and then you make built regression trees for your target mark and you do the same for your target cell and then you combine it as an ensemble of regression trees. The big disadvantage of this tool is however that for each new target cell type assay combination From Impute requires the training of a new ensemble of regression trees. So every time you have a new tissue and a new histone modification and you have new training data available or new data sets available other historic other marks available you would have to retrain so that's a huge bottleneck. So an alternative approach was suggested in 2018 and then again in 2020 by Timothy Durham and Jacob Schreiber and they considered the whole problem as a tensor factorization problem. So in this case you would have so this is exactly kind of this data matrix that you have seen before but you have of course the genomic axis here as well so you have here a genomic position genomic axis you have an assay so the different histone modifications along this axis and then the different cell types and if you are if you can learn the factors for the differ assay cell type and the genomic position you can then during prediction very simply input the coordinate into this tensor and by combining the learned factors in a clever way you can make a prediction and what you can see here for example is that they are providing three different levels of that they are using genomic factors for different scales so for different bin size so this is a very small bin size and then increasingly larger bin size so it's a massive model and but it worked fairly well and avocado had a little bit of a it had a neural network to combine the different factors in other ways it was very similar to predicted and it worked quite well and it was had the good advantage that it would generalize very well to new unseen combinations of assay and cell types the drawback however is it had to learn factors for every genomic position and of course this is massive because as I've told you before you had a three billion the human genome has three billion letters so this this factor is is really very long so we came up with it is where we used a combination kind of out of well we changed that to some extent so in this case the so this is our model and I'm going through through it in detail in a second so the primary primarily what we are doing is sorry we are also using slices through the genome and we are we are looking at each of these slices independently and the assumption is that there's a lot of redundancy across the genome so if you are looking at a promoter region then this should be captured in the data that you have for this position and it would look to some extent similar to other gen promoter region in the genome so we are getting rid of this of the huge number of parameters that you need for the genomic axis and you could also imagine this as looking through a film a movie and where you have certain pixels missing and you want to complete the missing pixels and you do that actually one frame at a time and the reason why we do it one frame at a time and don't learn across the genome is that given one assay you really get measurements for the whole of the genome that is if you're not having that assay you're also missing the whole of the all other assays all other genomic bins so if you are in a in a in a if you want to predict this square across the whole genome this square is missing so learning across the genome is not really helpful in that sense so we are we are we are it's like looking at a movie where your screen is broken in one corner and so and it's always the same pixels that are missing and we are trying to complete it by looking at the remaining pixels that are available for training and testing okay so the way we do it we separate we first look at the we use a signal embedder so we use first for a given cell type we are taking the data from all the assays at a given genomic position we are combining that with a global cell type embedding and and try to embed this signal with the signal embedder to capture the information of these genomic bin and combine it with the with the global cell type embedder and then we are using self attention blocks so this is coming coming from from natural language processing basically a transformer a kind of architecture where we are now looking at all the other cell types and we are trying to identify which of the other cell types for this slice is really most informative to understand to make a prediction of this of this cell type so we get we create contextualized cell type embeddings with this self attention program right so so with that we can really capture the interactions between different cell types in the model and we do a similar approach across the assays so we are taking all the data for one assay we are taking the the data across all the cell types in one slice we are combining that with a global global assay embedding to learn these embeddings and we again use this self attention model for contextualized assay embedding that really also looks at the other assays at the same time and then we really combine these two fuses in a multi-layer perceptron to make a prediction of this square that is missing exactly here so with that we have kind of what this allows us is to build a model that is has a large that is orders of magnitude smaller than the previous models so instead of billions of parameters we only only have six million parameters and it also generalizes really well to unseen assay cell type combinations with outreach training so we completely got rid of the genomic embeddings so the model doesn't scale with the genome size so for the human genome that's very good news and it turns out that the architectures also very suitable for domain adaptation in order to to apply it also to new data sets so if we are looking at some performance matrices and the generalization efficiency so this this strategy we can really use different blocks of of different random sets of the genome for training so as as performance matrix we use area under precision recall for enrichment for peaks and on the x-axis here I'm showing the number of training bins for genome write prediction and the blue dots are is edi's and edi's is so so higher the higher this area under precision we call the better and you and the x-axis is also in log scale so and this purple dot is the competitor which is chrome impute and no it's actually I'm sorry this is avocado and for avocado because we need to have the genomic factors we can we actually have to do the training on the whole genome so we can't do the same thing that we did that we are just using parts of the genome for training but we have to have seen the genome the whole of the genome for training to make predictions there so it really uses all of the the genome for predictions but if we are only using four orders of magnitude less training data we are already getting a better performance and as you can see it's increasing this more training data as it should so we are also looking at mean square error so basically how similar is our training our predict predicted signal from the measured signal and here the smaller the better you see again this quite significantly less training data we get really good results and also if you look at genome white correlation we do really well so that was quite nice so to look but so now we are looking at some specific results on our predictions so this is a small part of the genome very small part so you have base pairs on the x-axis and it's part of a chromosome I think it's chromosome 11 and you see the observed measurement in blue and our imputed one in red on top and then there are so-called peak finders called the one that's used by the encode pipeline by the official encode pipeline is called max 3 and they really call the regions which are enriched above the background and you can see that they overlap quite well between when you do that on the observed signal versus the imputed signal and also these signals look fairly similar so it does look like edise imputed tracks can quite accurately capture the measured signal so again here are some of the performances so there's mean squared error globally and genome white correlation edise is doing better than its competitors and this is across a large number of different tracks and historical modifications so that's why you get these these bar pilots so it does look like edise is doing a quite good job however the problem here is really that it is really difficult to assess the quality of the prediction and the reason for that is that we are actually not interested so much in recapitulating the precise signal across the whole genome rather what we want to do is we want to find very subtle differences between for example tissues so in this case so here's a needle in the haystack and this is of course a painting of franco and if you were able to very precisely recapture this painting predict this painting but you would not paint the needle in this case overall you seem to be doing a good job but you're missing what we are looking for in biology and the problem is that we don't even know precisely where the needles are what a needle looks like so really we want to assess our predictions mainly on the needles and not the vest of the haystack so we are not interested to very accurately paint the haystack we really want to find the differences between cell types for example do we achieve that so what would tissue specific differences look like so here's for example a peak a signal again xx is the genome which is the same in two different cell types so you have a cell type which is called e025 and a cell type which is called e52 and the measured signal is the solid line in blue and red and you see that these look fairly similar the blue and the red but and when we look at the imputed data it's the dashed lines and they recapture this this shape quite well and they look very similar for the two different tissues if we look however at a peak which is different between the two different cell types so again e025 and e52 you see the solid lines are quite so the the signal is only present in one cell type and not the other and we have captured that really well as well so we we can see that the the two cells are correlated they are related at some in some areas and in others they vary and we are able to predict very well the areas where they are similar and also where they are different and we have tried to quantify that to some extent where we have used the Wasserstein distance as a metric to to see how similar the shape of these peaks are and you can see that this peak so the red triangle you can find it here so in this case you have the Wasserstein distance between the observed marks in the two different tissues and it's a large difference because large matrix because this is a cell type specific peak and also when we looked at the imputed difference it's also a large peak so again we are doing quite well in identifying the differences and also the similarities and that was reassuring so what is I have I'm not talking I see I missed I took out the slides here so when you talk about so I have been working with with biologists and experimentalists for quite a long time and as a bioinformatician really what we have been trying to tell them is that they have to do replicas so it's not enough to just do the measurements once but you have to do it at least three times in every cell type and then you have to do statistical testing between predictions to see what so is this is are these difference these more subtle differences maybe here for example and they relate you to the differences in cell types or is it just is it just noise or random random differences is that just the variance of the of the signal itself so for the imputations this is not not being done so we basically just pull the replicas and make a prediction of the mean signal and that I think is also a bit of a drawback and we were trying to estimate the divergence and then create replicas based on imputed means and then do a statistical testing on these on the different peaks as well and that worked very well as well but in in general the imputation field is is looking at mean signals rather than and then trying to identify differences between tissues without doing proper statistical testing and I think that's still a bit of a drawback and potentially something that needs to be investigated a little bit more so the next thing that we were trying to do was we were thinking is it possible to do that actually in an individual specific way can we use our tool to predict across individuals so in this case we would actually we make a small question because we're moving to a different semi-project right um yeah are there any questions in the room or in the virtual room about transformers and the imputation challenge in general so I have to say that for the very specific technical questions I will eventually refer you to to my my students who did this very sophisticated um uh implementation but I'm very happy to answer um questions um are there references that you can share about these uh yet is there a bio archive yes so this is this the paper is from bio archive and um and it's it's it's submitted and sent out after major revisions so it's a bio archive um paper um at the moment there's also and I after this bit I will talk a little bit more about the competition as well and this the competition paper is also a bio archive paper and I can add it to the to the slides the the references okay thank you any questions that have matured in these minutes okay carry on okay so the next um project that we were so following up from these kind of reference epigenomics um we were thinking it would be really cool if you could make personalized epigenomic imputation and as I've said before um the the reason here is really that um um epigenomic um patterns seem to be changing very very early on uh for example during tumor rechenesis so they seem to be an early diagnostic marker um they are also markers for for uh new new developmental disorders and new degenerative changes and also for aging quite relevant so however um a lot of tissues are actually not easily accessible so it was easily accessible as blood for example but but you might want to know something about your liver tissue so would it be possible eventually excuse me would it be possible eventually if we had some tissues in a in a target individual and we have the relationship between these tissues learn on a different individual to make predictions on a missing tissue for certain epigenomic marks I'm sorry and um very very recently a very good dataset came out um and was added actually to the to the encode dataset and this is data it's the same data matrix that you have seen before with the different histone modifications on the x-axis and different cell types on the y-axis and in addition to that we have now four individuals so these little circles indicate that for this um assay and this tissue we have four data from four different individuals and the individuals are two of them are females and two are two of them are males and they have different ages as well and um so in this case we we chose a small subset of these datasets and we just tried to predict um first um we tried to predict across individuals and um so here we first had to kind of understand the data again very much so this is um a dataset for one tissue type and one histone modification so it's tissue 106 and the assay is H3K4 tree mutilation again and in the different rows you see the different individuals and what what you can see is that globally these um these signals look very very similar um so it it is actually um so and the imputation and the observation is also very very similar and that's not surprising because they are actually very very well conserved so um just to give you an idea i'm sorry i don't know why that's building up but just to give you an idea again what we were trying to do is so for this for this peak um we and for this individual so in this case this is the real of the individual um this is the data matrix um that we had so we have these different tissues and we have these different um assays and we have in this case we are using all the data available and we are trying to predict um this um pixel the value for this individual pixel and we do that for each of the individuals um and um when we are looking at the imputed epigenomes um they look so so the top row is the observed data and the bottom row is the imputed one so again in this in this setting it's a leaf one out setting so we used all the pixels to predict one pixel and then we just um mutated um the around the the the pixels um but we are doing a pretty good job predicting this peak um but it's also a fairly similar fairly easy um um signal because it's shared across all the individuals this is a much more uh difficult um setting so it's a different um uh tissue also it's required for tree mutilation and in this case it's only one of the individuals who has the peak and the other stone half it and um okay that's a bit silly I'm sorry why this is I don't think went wrong went wrong here and if we are looking if we are taking this individual this row the individual again and we are now looking across the different tissues in that individual it looks like that so it's a peak which revise a little bit not only between individuals but also between tissues and so it's becoming really difficult to predict um so the matrix looks like that in this case so the question would be is this a peak there or not um it is there and our predictions for for this peak are also relatively good um as compared to the observed data okay I hope this this makes sense um is there is the strategy clear here not entirely to me so you're not treating uh so the the obvious thing will be to treat different individuals as if they were just different tissues right but you want to somewhat retain the fact that you might have more tissues from the same individual is that no we are we are first we have all the data available for one one individual and um then for the target individual we have missing data and so we want to um and the missing data um in this case we're starting out by just removing one one single pixel in the target individual and we want to predict whether this this this um the value for this um assay tissue combination in the target individual so we have the full matrix for another individual this one um and we are then doing transfer learning um to um infer missing data on another individual um given the data that the other data that we have for that individual so the e things are different types of tissues right but these are the cell types so the tissues right um so the assumption here would be that we are having missing data for one tissue in one um assay so it's a fairly similar so we have a lot of data available so in the next step what we did we are using uh we we wouldn't have any data from one tissue for the target individual uh and we would try to predict the whole row um for uh for a new individual and that turns out to be quite hard and we are not doing very well on that so um yeah we still have to work on that but this is the first step where we are actually trying um one individual um where we have learned the relationships between the different tissues in a different individual and in the target individual we try to infer missing data and we are particularly interested in differences between the individuals as well uh and there was some please repeat from the virtual room but it's not clear what exactly is it clear now after this explanation Tasnim well okay let's assume you cleared it up good okay um oh there's a question from the room okay uh microphone is being carried to the questioner uh just one question about the individuals like are they like for random healthy individuals or like are they differentiated by some specific illness or they are so you can't take these um samples from living individuals so they are deceased um so uh post mortem um but still the question remains whether they were you know healthy dead people or deceased dead people right well they are not um that's that they were considered to be um relatively healthy all right okay thank you um but they weren't um yeah they were considered to be healthy um yeah uh good okay so I guess what you are seeing here is this this is a slice through one bit of the genome just one these these images are slice is just one slice through one position on the genome and you see the differences between the individuals so you have two males and two females here um again you have different tissues and different assays and what you want to do is so this this female looks quite distinct from this male for example and you want to capture these differences between the individuals okay and we and in the imputation you see that we are doing that fairly well also it's a hard problem okay so and the idea is so why should that work in the first place so um if we don't have so I think if our if the underlying reason for a changed epigenomic peak could be on the DNA sequence itself so if you have a mutation on the DNA sequence then that would be individual specific and it would and if it's a germline variant you would see that in all the tissues and it would affect potentially more than one assay and therefore you should be able to infer these differences in the in across the assays also it could be that you had a systemic response for example to pain so a certain gene might be switched on because you have you are repeatedly exposed to severe pain um this is remembered in your epigenome but not in a tissue specific way but in a in a systemic way and then it is possible that you could kind of predict that um it it um manifests manifests itself in a certain way in a certain tissue based on the other tissues that you're seeing and then you could potentially make predictions on on what what does these changed epigenomic states in that tissue might cause so that's kind of the the motivation for us that we are trying to um to predict these individual changes in um in the individuals but it's also again it's really difficult to understand how to how should we validate our predictions and um so and and also how do we compare it to baseline methods for example because you have seen before in the h3k for rematulation setting um a lot of the peaks look very very similar across individuals and you see that here so if we are looking at one male and one tissue and we look all the the peaks that are that we find there for this assay we see that um most of these peaks are shared across all four individuals and all tissues um so this could be because h3k for rematulation is really very often founded promoters the promoters of genes are very important so they are they are shared across individuals they are also shared across tissues and that's um so h3k for rematulation is easy to predict in a sense because it's there's not much variation between individuals but then you have something like h3k9 rematulation and in this case you have um most of the the enrichments are only found in one individual and only in a subset of the tissues and um then it is getting really difficult um to to predict them so um yeah so the next thing that we are doing is we we try to um so this is kind of the uh shared and unique epigenomic patterns and as a baseline we compared our methods to to a very simple um way of of making predictions which is just ever-riching and in this case we are ever-riching um for a given assay across all the individuals saying that okay this this is um um so if we are just ever-riching using the average of all individuals how does the signal look then and the other baseline method would be um we are using the other measured three individuals and take the average of that to make a prediction for a given cell type um assay combination so that's the track average and um and in in our case so the edi's left out um leaf on out um uh scenario is we are learning about we are learning we are using all the available data for that given individual except for the one that we are trying to predict and in the next step a step we were trying to do it um without any data from that given tissue but that's increasingly hard so in the case of the track average of course it is very easy to predict these guys um in this case which are shared across individuals um so you're you're just taking because they are they are shared so you would do quite well with these um these peaks um if you are looking at the average across um individual average across individuals it's these guys which will be easy to predict okay and then if you're looking at the performance measures um that are looking at all enriched regions on the genome um so um H3K4 tree mutilation the averaging already will do quite well um and that's kind of based on the on the way the data is is is shared between individuals and tissues um okay and you can see that um here we are looking at the area under precision recall curve and um the blue uh line as edyes um and we have the the average um averages work quite well for some of the methods and for others it doesn't work quite as well so we make a little bit of progress but to be honest it's it it is a really hard problem um to predict so the ideal target would be for a patient to come into the clinic you take a blood sample um you measure you you take the assays for all these different epigenomic marks and then you make a prediction across many different um tissues that you can't um sample and then um use that for for diagnostic purposes and I think we are quite far away from that but that's that's our motivation that's where we want to go okay can I can I actually ask a question so I mean these numbers are much much lower than the UPR sees that we were seeing before right exactly and the reason is because you're focusing on individual specific enrichment yes okay so you've taken a subset of the you're looking at the hard cases basically and yes exactly okay so we are subsampling the peaks and only looking at peaks which are not shared um uh across individuals I see okay right thanks okay um but in the meantime while we were working on the individuals we were also so the competition was um the competition organizers were also looking at um at the provided the amputed data sets uh and they came up with um uh this ranking um the first ranking that we got was in June 2020 um what is and we were um there we had the third rank there um again the problem here is that um they also had as a baseline um they used averages so this is going back to understanding epigenomic patterns in the reference epigenome and even if you are going in the reference reference epigenome if you are using the average um for given assay across all the cell types um according to their matrices the average did remarkably well as compared to all this very sophisticated um uh partly deep learning uh tools which was quite um frustrating to be honest and I think it was also for the organizers quite frustrating um then they revalidated um the um the whole um data set and went and came up with a different ranking in June 2020 uh in this case um the average performed much worse we are still doing very well relatively well up here um and in the meantime in June 2021 they had a third ranking in which case average was doing intermediate well and we did um uh much worse so the problem here is that the assessment of the data set is is really really difficult and we could talk a lot about you know different types of models and improving the transformer model and and all and so on but the real problem is that we do not understand yet how to assess the data and um and that's a really really big problem I think also in the interpretation of of these histone modification marks we don't understand what is a good mark what what is the needle in the haystack and I think if you are doing machine learning in the context of biology this is the take home message that I would like to give you that it really depends on um identifying the needle in the haystack and getting the matrix for your validation correct and uh that brings us to the next project which is trying to understand what these signals um actually what they actually mean and if we can do something potentially um in the preprocessing slightly better in order to um to to get some of the confounders out of the measured data sets in the first place in order to be better able to to to interpret the the measured data sets and I think here I'm taking a break um for for more questions yeah so one of the things that I noticed was that avocado was consistently the worst no matter how you really renormalize the data right and this was the method that was started to be you know the the best thing by being noble and the organizers at the beginning right yes yes that's actually true and so so how I mean that's that was quite a high impact well at least you know it was well published in a good journal and so on how did they manage since it is consistently worse than average whatever you do uh was that yeah I I suppose that the that the competition is is flawed to some extent in the way the data was um collected and the test data was collected um and but I think it's a it's a very it's potentially we are in a in a similar process than what happened for you know protein structure prediction this kind of took I think 20 years or longer um to actually um get the right assessment scores and to understand how to um how the competition should actually be around so this is the first time um this implementation challenge or the second I think there was one before a slightly different one um so I think it's it's still uh early in understanding how a competition like that should be um designed and I mean to some extent it was created such that you would improve improve upon avocado which was the state of the art at the beginning of the competition what is really worrying that just ever reaching across the samples is creating such a good a result and I think part of the problem is that training and test data were performed at different times in different labs so you have massive confounders and massive batch effects and um the tracks the the tracks in the test um set um the um sharing these confounders but the confounders were different from the training set so if you're just ever reaching across um the tracks from the test set you can more so so you're you're more easily um making a right prediction so so yeah somehow almost predicting the the the technical differences there's a question in the chat uh what explain rank here please so I guess what what are the metrics really if you could expand a little bit more oh um so as we have a large number of different tracks to predict and then large number of different um um so a large number of different matrices like mean squared area over the whole genome area under precision recall curve for the max cold peaks area under rock um background correlation um foreground correlation foreground being within the peak so if you're predicting the the the peak height correctly and so on and they have um they have computed all these different matrices for all the different tracks and then they have combined created a combined rank um for each of the predictions so again that was some of the matrices were highly redundant so um genome-wide um correlation is very much redundant with um background correlation because the majority of the of the of the genome would be background uh for many of the peaks um so so I think um the way they have constructed the ranking is problematic the way they have constructed the training and tested is problematic but it's also um yeah so so I think it's it's not very clear at the moment because we do not know what the needle in the haystack is so how do you validate the two pictures of the haystack if really what matters is the needle but you do not know what the needle is or there are several needles in that haystack and the question is do you identify all these needles um and do you identify needles that are different between different tissues or different individuals if that makes sense so the validation uh is really difficult the questions from the room I mean I would actually have another question uh on these needles I mean suppose you were getting a method that performed you know amazingly well you know drove your average score almost to zero how do you know you've actually found the needle and and what is it then exactly that's exactly the question that that would kind of invalidate the whole prediction purely prediction based uh well I mean the problem is is is bigger than just the imputation and the prediction because you have um so if you have a data set in generated in one lab and you have the same data set generated in another lab and you are trying to identify the needles in these two data sets they are they will be different sets of needles um and quite significantly different sets of needles and so because the the technical biases are quite significantly different and so how do you cope with the variation in the measurement itself how do you disentangle the true hidden signal from this signal that you actually measure I think it's it's not just about imputations it's also about the technique the measurement um how do you eat and so how do you say that this is a particular this this experiment worked and this experiment did not work or how do you say this experiment worked better um than another one okay if I see no further questions either from the actual room or the virtual room let's let's go to your multi multi-stone chip analysis okay yeah and so what I um so so so with that we started to think a little bit more about um how these signals are generated in the first place and um what do you see here is is um there is a bind version so every dot so the every every um every little um diagram here every little a little graph here is is one histone modification um and in there the dots are bins on the genome and um now we are looking at the coverage for that this this height of the signal and we are also looking comparing that with a control sample and um okay I am and it looks like so a background control that in many cases for many signals um so for example h3k4 tree modulation there's really a very high correlation between the background the control and the actual signal that you're interested in so um if you're not looking for this mark but just looking at the background you get a similar high um um uh signal um then if you are actually interested in this signal and that's of course problematic and so let's look at the experiment and how these signals are and actually generated so um the chip sec uh so this is called chromatin immunopress chromatin immunopresentation followed by sequencing and the protocol is as follows um so you have the dna in plaque here and you have i'm showing five different nucleosomes in a row and um three of them are having the mark that you are interested in shown in a as a red ball and then what you do is you do sonication so you're trying to break this very long dna string into smaller bits you do fragmentation such that these bits can be easily sequenced um and and and then you can identify certain locations on the genome the problem with the fragmentation is that the fragmentation is not not random because there could be um you know the the bits which are actually wrapped around the histones are not easily breakable um so the most break points happen between his tones and if you have a very densely packaged package chromatin then it's very difficult to break it up and you would get longer um longer bits of the dna in this region and shorter bits in this region okay um and then you do the chip part which is immunoprecipitation so you take a specific antibody that recognizes um uh your histone modification of interest and um you then remove the actual protein um so in this case um so you remove the protein and um i'm sorry so you're using the antibody of course um to to to filter out um um the dna which is bound to your nuclear to your histone with the modification um and you are only interested eventually in sequencing the bits of the dna which have the modification and which are pulled out um with the with the antibody um so these bits um uh so this bit is particularly highly um uh pulled out but um you also have a step where you do size selection um because you can't um sequence fragments which are too long very well and so this bit would actually not be sequenced very well and this bit would not be sequenced very well either because it doesn't carry the the the mark um so you are ending up with reeds which are then mapped to the genome where you have a very high peak here but lower peaks here um because this is just too densely parked for fragmentation and lower bits here so instead of um identifying these two regions you are potentially only able to reach to to identify this region um this modification and also um you can't this the regional resolution it is much smaller here because this this is not this nucleosome is not independent of the other two which actually carry the mark while this doesn't um so to some extent um you can correct for these problems by using um uh by removing the step of the chip step um where you are just using the fragmentation and the size selection and then you map everything and with that you have kind of a controller background sample that does not enrich for your modification of interest but it just it can tell you something about which areas of um the DNA are actually kind of measurable which are kind of this is would be like the the bright side of the moon and this would be the dark side of the moon so here it's much more difficult to measure in the first place and here it's it's more easy to measure make measurements and that's your control sample so your control sample at any given tenomic bin is really determined by the accessibility of your chromatin but also by the mapability so in some in some cases the DNA is very repetitive and then you cannot easily identify the specific the specific the specific locus where read is coming from but this is also determined by the DNA sequence itself so you have accessibility and mapability which determine your control samples and for the treatment samples you have the same problem you have accessibility and mapability that determine your mark and you have also the specificity of your of your antibody so if your antibody is not very specific you would also pull down fragments that do not have the desired mark so you would pull down noise other unspecific reads so you here you have an addition of the unspecific noise and your signal and both of them are depending the signal that you're seeing is both of them are depending on accessibility and mapability and so the idea that we are having here is first that we can use decodenia we use actually first try to separate in our treatment sample the signal specific part from the unspecific part and then to remove the accessibility and mapability which is a multiplicative factor and in order to separate the unspecific from the specific parts we are using non-negative matrix factorization where for example we have two signals so this is our background our control samples it's also sometimes called in biology input sample and this this is our histone modification that we are interested the first one that we are interested here and by using non-negative matrix factorization we can find an unspecific part which is shared between the control sample and the treatment sample and we can remove this unspecific part from our observed signal and if we are using a different histone modification there's a different signal we would still assume that the unspecific part is also shared and we can remove that one there as well so if you're looking at the origin so this is now a heat map with the original data so this is the control measurement this is the first histone modification and the second histone modification and after the non-negative matrix factorization we can what you can see here is that we are removing quite a lot of the unspecific noise and we are packing that into the unspecific track and we also have the mixing factors so saying that in the first case of the unspecific read in the control read we know that this is a mixture of only the the unspecific it should be only unspecific tracks so that's why we have this this here and in the case of the other two histone modifications we know that there should be some unspecific components and some specific ones and we start we have a much cleaner way of looking at the signal itself and removing the unspecific reads from from that signal and in the next step what we are doing is we are using a tool to remove the multiplicative confounder does that make sense so far just to clarify completely so these indices are h and also a and m are matrices of course right um so this is the treatment sample so i have to say i'm this is the first time i'm presenting that so we are still working on the notation so if you're looking at a given genomic bin for a given histone modification h and a given replica so in this case so you can see that here are different replicas for the same histone for the input and so we have two replicas both for the control sample for the histone modification one and histone modification two and um and they should be shared so um the idea is that most of the time the control samples are actually not very deeply sequenced so we have a lot of missing information here but because the antibodies are never 100 specific and in fact they are very in some cases they are actually not very specific at all the assumption is that they all carry quite a lot of signal from unspecific signal so we are learning the blue curve jointly across all epigenome all histone modifications and then removing that right so that's that's kind of the novel thing in this approach because usually what happens you are just taking um one replica at a time and correct dividing by the input by the control sample rather than learning the unspecific tracks and a and m would be different for average genomic bin excuse me a and m would be different for average genomic bin yeah so m is the mapability which really depends on the dna sequence for this genomic bin how likely can you find the right spot on the genome and a is the accessibility which depends on how the how the dna is kind of packaged in that specific cell type which determines the fragmentation process and the size selection process okay okay so and a and m is shared between control sample and the treatment sample but you also have the in the treatment sample you have added you have the signal here as well and the k is basically in this case the mixing factor so how much signal is there relative to to the unspecific read so that's kind of the specificity so now what we have done we have really we got rid of this bit um the unspecific bit and now we are trying to um so what we are interested is the true signal sh so with the non-negative matrix factorization we got rid of the unspecific reads but this is still confounded by m and m which is very strongly depending on on the position so even after we have removed the blue line from these two tracks you can see that in this area for example it's really really difficult to get any signal whatsoever so if we do get signal here then we want to actually boost it while in this region it's actually quite easy to get signal because even if you are not using an antibody you you get more signal than you would expect so we would kind of want to rescale this peak relative relative to this peak because we already know that here it's it's so difficult to make the measurement and for that we are using um so we want to kind of correct it by removing this multiplicative factor and we are using a technique um from causal inference which is called half zippling regression and the idea is that what is hidden are this the true signal s1 and s2 so these are two different histone modifications and also a times m the confounder and what we are measuring is just x y1 and y2 and by writing down this causal model we can actually remove the effect of am from y and and infer what the true signal should look like and so we are doing that by so this is the red signal now with the blue signal removed but we still know that this is kind of the signal from the noise and by applying the half zippling regression we can boost this we can we learn that this peak should actually be um having a bigger weight than this peak because here it's much more difficult to actually make um to make the to make the measurement in the first place um so um if you we start with this kind of measurement which is very much depending on your specific instrument for sonication for example and the antibody you're using the specific specificity of the antibody so whenever you're using this measuring that signal in one lab and then you're going to another experiment it will look very different and we are trying to correct for that um in order to get the true hidden signal for this history the true hidden um his tone modification enrichment um this this kind of signal and I think if you're using that for the prediction then also in in edice later on it will be more predictable um and more and we'll be able to to make better predictions as well um okay and that's another peak um which is corrected and um so I've shown you this this plot before um so you have the control um um samples here and I've what I've shown you is that there's a high correlation between the background uh and and there are different his tone modifications that we are measuring and we are correcting for that um with this half zippling regression in the non-negative matrix factorization and and so we are getting we are removing this effect which is really a confounder of the different that that the different locations on the genome have different more or less easily measurable and so this this correlation is completely removed in the bottom plots um we also tried um uh we did a couple of um simulations first so in this case we created a peak and a dataset um and this is the true peak um this is our control sample um this is the sample um that this that's the um his tone modification and um when we use max that's the usual preprocessing um um for for this tool um you would identify these two peaks but after using decoden we identify only this peak um which is the true peak that we have put in and this one is is is almost reduced so that's just an here's another example this is the troop in the true peak this is the signal after um decoden this is the simulated chips example if you're using the chips example and you are applying max the usual tool to it this is the signal and you see you would get an additional peak in this area where you also have a high control regions um and again um uh it looks quite good for us now this is a real dataset um with which we have corrected um from the encode dataset um the top is a control dataset so the background control um then then this one is um hvk for tree matulation and this is hvk 27 matulation and what you can see is that here's a big peak only in hvk for tree matulation and this is after nmf and this is after the half zippling regression and you can see um that the peak the shape of the peak is not very much affected um while here's a region where you have a high signal also in the control sample and also in hvk for 27 matulation and um this peak is basically removed after um already nmf but then again after the half zippling regression and the same holds for the hvk for tree matulation signal um so we are we are we are conserving the big peak here which is specific to hvk for tree matulation but we are removing peaks um here that is probably a due to a confounding factor um and we have quite a lot of more examples um where we are where we really can see that the what we are doing makes sense and we are trying to validate that in in different ways now but again that is difficult too because we still don't know for sure um what the needle is in the haystack and um and really what we want to do here is we want to identify the needles in the haystack okay any questions i do have another question so if you actually use multiple tissues here you may also be able to estimate because the mapability is the same across tissues presumably yes but the accessibility no so you might actually be able to estimate tissue specific accessibility directly from gypsy and you could completely compare that with say attack or something like that right exactly yeah we are thinking about a way to do that as well so it's not entirely clear so i'm i was i was talking to some of the biologists and they weren't sure about that i mean i'm i'm calling it accessibility for lack of a better word i think it's kind of um telling you about something about the pre-kits of the dna and uh how densely it is packed because if it's very densely packed you would get longer fragments um so um but attack sec again um is quite a different technique with different biases it would it will be interesting to see how that um how that um compares but i do think that um that um the access so the unspecific track um in a tissue specific um context would be value in its own rights any more questions from the room or the virtual room because you know you still have three minutes so okay i think i leave the last i i think i leave the last um i i i think i this took a little bit longer than i had expected i just um went slower and tried to explain it um i hope it was understandable and i didn't lose everyone anymore last call for questions before lunch well if not uh let's thank uh gabriele um and uh i'll i'll stop the recording right antonio i started a few minutes late fortunately someone reminded me