 Yes, it's 3pm in Switzerland and time to start our second talk of the afternoon. It's my great pleasure to welcome Catherine Röder here for this talk. She's an outstanding statistician, a very accomplished researcher in this field. After PhD in statistics at Pennsylvania State University, she was a professor at Yale and then moved to Carnegie Mellon University, where she's a full professor since 1998. I remember that I first heard about your work when I did a medical at the Broad Institute and I was interested in the interface between machine learning and the life sciences and they said, look at Professor Röder's work that's outstanding and I read your papers back then. I'm now really happy to really meet you here and host you for a talk. You know, almost 10 years later, but I realized over these 10 years, even though we hadn't met yet, but I realized that the high opinion that the colleagues at the Broad Institute had was shared by the whole community. So Catherine was elected to the most prestigious societies, like she's a fellow of the Institute of Mathematical Statistics of the American Statistical Association National Academy of Sciences and the American Association for the Advancement of Sciences and she has won numerous awards from the Committee of the Presidents of Statistical Societies. So all of these reflect the high standing of her work. And we are very excited to have you here and to learn more about selective inference in G-Bus. Welcome and we're looking forward to your talk. Thank you very much. You know, I thought when we started emailing, I thought, now I know that name. I know that name. That's an unusual name and I'm glad you reminded me of when we had emailed back when you were at Broad. So glad to get to meet you at least on Zoom now. So thank you for inviting me and I look forward to sharing my work with you this afternoon for you and this morning for me. I want to talk about how to improve power by using extra information that's there's so much multi-omic data that we can use extra information and various tests and we started off with GWAS, not because we were most interested in GWAS, but because there was a lot of data and we felt we could really test things out there. Ultimately, we want to try these methods on genome widest. We want to try them on whole genome sequencing for rare variants, but we're not finished with that yet and I'm going to share with you that fairly successful investigation we made of GWAS. So just a very quick review of that. So what's the idea? We've tested many, many SNPs across the genome and we do a marginal test for every SNP to see if it's associated with the phenotype, say maybe schizophrenia and these are the results that we would get and that's a happy result with a lot of signal above the significance threshold but that's not always the case and if that's not the case, we want to try to use extra information to improve the power. So we can summarize those data with our P values, a P value for many, many, many tests and for each test it either is associated or it's not and so it's null or non-null is the truth and then we either reject or we fail to reject and we're of course worried about the false positives here where we rejected and we shouldn't and R is the number of discoveries, the number of rejections we've made and typically people really control very carefully so they want the probability of any false positives to be less than alpha which leads us to only reject if the P value is less than five times ten to the negative eight which is quite stringent and but that works well for well-powered studies like schizophrenia, you can see there even in 2014 we had a lot of discoveries but for autism just look at the colored bars here for autism even in 2019 after millions and millions of dollars we have hardly any P values crossing that threshold and so we would like to do better and to learn more about autism and so we might consider instead using a false discovery rate so instead of trying to control for any false positives we might like to control so that the expected fraction of false positives is less than alpha and that'll increase the power quite a bit but still not enough with autism to have very many discoveries and so we want we want to see if we can do more so here's our P values lined up and those are maybe genes they're associated with and and of course there'd be most of the SNPs wouldn't be associated wouldn't be in a gene but we might be able to map them to a gene based on which gene they're nearby and which gene they tend to drive the expression of and so we could have X which is the what we call the side information so any kind of information that may might make us more likely to reject that test and so for instance if we were studying schizophrenia bipolar is very highly correlated with schizophrenia the same SNPs tend to drive the signal and so we could use the big studies of bipolar to increase the power of our schizophrenia test and then also more interesting in a multi-elmic way is we might look to see which SNPs drive which genes and if they have a very because those are more likely causal SNPs and if they drive a gene in in other words they control the expression of it if they have a very high impact a big beta slope in that relationship there whoops then we then we would maybe put consider that extra information that that SNP is associated associated another more interesting side information is if we if we looked at correlated genes based on gene expression and created communities of genes which are been called modules then we might say if a module has a lot of small p values associated with it then we would say other p values that are on the order of being rejected maybe they we should reject them to because this is obviously a functional set of genes and that are important to this phenotype so that's what we want to do we want to gain that information now quite a long time ago 15 years ago we wrote a paper aiming to solve that problem in 2006 and it was a very simple idea where you just you would use get a weight based on the side information and you can in theory improve the power quite a bit and that was a big hit that paper was very very popular but the problem was there was hardly any side information 15 years ago multiomics really didn't exist but now the problem is the opposite we have an excessive information we have high dimensional many many many things to include and now we don't know how to determine before we couldn't determine the weights because there wasn't much information now there's a lot of information some of the good some of it not so good and we can't snoop in the data to decide so we need new methods beast to build on these ideas we had but the main idea though really is always going to be that use the side information to be more lenient with some p values and less lenient with other p values so that you all together spend your alpha wisely okay and so the desired properties of this is we the covariates need to be independent of p value the p value you have to collect things like gene expression or other g was that are independent of your p value and then we need to be able to explore these high dimensional spaces without snooping so that we can up weight the right features without taking a penalty too severe of a penalty for for for for we must take a penalty for snooping but we want a minimum penalty that still allows us to get greater power and yet maintain our false discovery and so this is the idea so this this figure is extremely important for you to understand so on the vertical axis it's log minus log 10 p value so the bigger the value the more significant that test every point is that's the p value for a particular snip and then on the horizontal axis is possible side information a scalar side information x and the idea there is that if you did just a usual bond for only i'm or a benjamin hodgeberg fdr control you would you would reject any of the tests that were above the dashed line and you you can see there that x is obviously quite informative the bigger x gets the more small p values we have we have a lot more up there with significant significant results and so and hardly any when x is small there are no interest no small p values so we should use that and draw that curved blue line and then we can we can tilt and reject more over there when there are a lot of the near the bh threshold we'll tilt it down a little bit and then at the expense of not rejecting over there on the left when x is small and nothing interesting is going on and that if we could learn that line that blue line we could have a much more powerful test and so i'm now showing exactly the same thing but i'm change the vertical axis to be the zero to one p values so now all of the significant ones are the blue points down there that are i mean well actually this is a simulation so all of the blue points are the actual signals the truth and you can see a lot of them have small p values but they're only tend to have small p values when x is big and so we want to be able to find a lot of those and so this is the test we're going to use and we want to the way we're going to do this i feel like i i miss something oh yes i forgot to acknowledge that this is work due to lay in fifth end and it's called the adapt procedure and it's for adaptive p value threshold and so i don't i don't want to take credit for this work which is their work and my work the work of my team is bringing this to fruition in genetics okay so here we are let me explain how that adapt algorithm works okay so there this is the same figure as below but now at the p value of point five i've drawn a boundary a blue boundary just below at point four five and a red boundary that's the mere image at point five five and i'm going to reveal those p values in the the black ones and now this is kind of extreme i could make that much lower make it more like at point oh five but let's just suppose we started off with this huge rejection region any p value less than point four five we consider rejecting that and the point is is the mere image the red points those are pretty much all the null all following the null hypothesis so we're going to adjust the blue line and then the mere image one will follow naturally and we'll always find out if we're ready to quit when the estimated false discovery portion based on the red points is less than our threshold so we count how many points are above the red line and we keep adjusting until that's less than say point oh five so it's it's a clever it's a clever idea and then we're going to we're going to adjust the line using mostly the black lines but we'll also use the other data cleverly okay so this is what it looks like we start here i'm just going to say we'll show you what the algorithm does and then i'll show you how it does it so we start there and we say well that's pretty terrible so i start rejecting a little less and i'm going to reject less by going over on the those uh low x's which aren't very good and then a little more and a little more and a little more and more and more and more and i'm getting close to having the fraction in the red less than point oh five and a little more and i'm done now i have found the rejection region where i allow where i'm mostly rejection rejecting for big x's and small p values and i've learned that from the top from the red points okay and now i've just flipped it back that's exactly the same figure except now i'm showing it with the axis being log 10 p value so you can it exaggerates the the small values and you see that was that curved i found that curved blue line there okay so how do we how do we learn that red line and of course that red line can be in very high dimensions but i can show it best with a scalar um we can't break we can't cheat we can't break the symmetry between the red and the blue regions we can never look at the p values that are in the red and the blue regions but we can mask the data so that it's legal so every p value that's in the black region we can see so we can see that p value we can see the x the side information associated but we can see something about the ones in the red and the blue we will say we don't know what the p value is it's either p or it's one minus p and we're not told which one it is so that mask that clever masking lets us get a lot of information without actually cheating because a very small p value is either really significant or really not significant and so we can use that and this is going to help us to decide where to draw the line in each step we get more black points until we can never backtrack because once they're in the black region we've seen them and we can't go back okay so now the point is is we have to know model the local false discovery rate so at each step we're going to have a probability model that says what's the probability that that point is actually null given the side information and given the mask the masked values in the p one minus p and we're going to remove the hypothesis each step we're going to remove the the SNP that has the highest estimated probability of being null so it's rejected but we shouldn't it's the least likely to be associated and so to do that we need to build two models one of them we need to build a model for the probability of the hypothesis being non-null based on the side information and the other one is we have to say well given that it's non-null how how strong the signal does it have what do we think the true log p value is for that x so that's what we need to do and why are we doing that we're going to we want we want to compute this condition this local f dr and we have to to do that we have to have a model for the p values and our model is a mixture model we say pi one as a function of x is the fraction that are non-null so one minus pi one of the p values with that level of x are null so they follow the uniform distribution and then the other fraction has to follow a distribution that says that the p values tend to be small and the easiest thing to use for that distribution is beta distribution with its mean piled up near near the with very small p values so that's a natural thing and it has the one parameter which is a parameter of how big is the signal in that domain parameter the beta as a function of the side information and if we estimated those two things the pi one and the mu which are needed in that calculation we can get this local f dr and how do we do it very cleverly with the eam algorithm what we do is we duplicate the data so all the mass p values appear in our data once is p and once is one minus p we don't know which one is the true one and then the eam algorithm imputes weights which allows us to up weight the right one and down rate the wrong p value when there's enough information for it when there's not much information and they're about half and half and this allows us to figure out which is the least promising snip we remove it we refit s to put it in the in the not rejected set we keep making our rejected set smaller and smaller as I showed in the animation report okay so what do we what are the assumptions this is really very important the surprising how often this assumption isn't met so you have this the first thing you have to check if you're going to try this algorithm the mirror we're relying very heavily on the mirror conservative idea that is we're using the p values near one in order to figure out whether we're controlled or not so we need for the under the null hypothesis the fraction or less than t the greater than one minus t has to be equal to the fraction less than t and here's an example that's not near conservative we're missing some p values up there near one and that'll make a huge difference we'll make a lot of false positives if we happen to have a test how would this happen the test isn't properly calibrated so we said it followed a chi-square one but it didn't and so we didn't get enough p values up near one and so then when we're trying to figure out what our false positive rate is we can't we can't use that mirror conservative okay now it's legal if there's a pile up of p values near one but it's very low power that's very bad too because you you get a test that doesn't perform well so we we need to have tests that are well calibrated under the null or uniformly distributed okay so at any rate we're pretty excited if this is back when we're working on so very excited about this using adapt it was perfect for our genetic problem because we had a lot of a lot of information that we wanted to find and so in our literature search we found this paper from Corethar and they said they had a practical guide to controlling false discoveries in competition biology we thought wow and they scooped us they already did it but then they said they said oh adapt is terrible doesn't work weekly informative covariates degrade the performance and it requires a careful specification of the functional relationships so it's and they often got no discoveries at all so this seemed bad but we went ahead and and looked anyway to see if we could do better so let's look at what the story is you have p values you have covariates which are your features or your side information those go into the adapt machine it knows what to do with those but it you must model these parameters the non-null probability as a function of the covariates and the strength of the signal as a function of the covariates you have to have a model and of course it has to be a good model so when when Corethar at all said that it suffers when those things are bad but those are the classic modeling problems any model suffers when those are bad that's what our job is is to make good models for those and so it turns out that the off-the-shelf version of adapt just had the most naive of statistical input just generalized linear models which would lead people to fit their covariates in an additive way and perhaps at least approximately linear it's not a very sophisticated way of putting in a very high dimensional and possibly many covariates that you put in are not very good and this is not a very good way of modeling that so that's probably why they didn't find anything so our job was then to do to solve the right hand of this the modeling procedures and so we thought well we have a very high dimensional problem and whenever they do competitions who always wins it's a gradient boosted tree so we try gradient boosted trees so the idea is that of the boosting is that you you fit a very simple model but you do many many steps the first step maybe maybe you get that that tree and that path has a very predictive for for what we're modeling and but then we've still made a lot of errors so we add in another small tree and then we see oh where are we making errors we have more more more more trees eventually we have many many simple models that add together and we can capture very complex non-linear relationships and interactions and this kind of model is very good for high dimensions with some uninformative covariates so that's what we went with and remember we wanted to model the non-null probability and then we wanted to model the strength of the signal and so we used logistic for the first and gamma for the second and we did all this using the XGBoost software which I think is is really marvelous software to for for this kind of problem so we did that and now my graduate student Ron Yerco suffered for months and found out that many many many details matter the parameter space for cross validation the frequency of cross validation frequency it's very expensive to keep doing the EM algorithm in this huge space and and even where you you started your line should you start them at 0.45 or should you go more towards a sensible 0.05 for the starting rejection and so he worked out many of these details and you can find them on the on the Github repository for the software okay so how did it work out so remember we were going to we're interested in autism actually but we're gonna we're gonna learn on schizophrenia because there's a lot known about schizophrenia so we can use the 2014 data and then we we could also validate with newer data and since bipolar is is very highly correlated genetically so we're using bipolar GWAS for one of our site information we're going to use gene expression and what kind of gene expression well schizophrenia seems pretty clear that we all use brain tissues so we use the brainvard data for our gene expression information and I think so so the way we want to use that brain expression is we want to figure out what are the eqtls and what does that mean it's the SNPs that as the SNP varies across its genotypes is highly correlated with the expression of that gene so we want to use that what we're going to use is the magnitude in absolute value of the slope and that'll be one of our bits of site information a SNP that that drives a gene and drives it with a lot of signal is likely caught not necessarily causal but likely causal the other thing we're going to do is we're going to take that those gene expression in brainvard brainvard is brain expression and it's four in the developmental ages it goes from fatal up to teenagers and so it seems like the right range of data valuable data for schizophrenia risk so in those data we we assume it's approximately block diagonal we find that the communities of genes and so and this great software from Zang and Horvath called Wigna and you color you use colors to describe the different core gene communities and so like we have a yellow community and turquoise community and those are the correlated genes and we're going to we're going to use that as site information too so SNPs get assigned to modules because they drive a gene and then the gene is in that correlated set of genes and so it would be called say the turquoise that SNP goes with the turquoise module okay so finally here here's our data and this is the qq plot for whoops for SNPs and you can see that there's definitely a signal in that data if we look at all SNPs those are the blue points but we're going to we're going to restrict ourselves to the only the SNPs that are E SNPs and you can see that there's a lot more signal in those E SNPs and also we want our SNPs to be they don't have to be independent but we want we don't want them to be highly correlated for FDR won't work very well so we're only testing the E SNPs and so that's the input data the bipolar GWAS and the gene expression for grain var and we're going to test these 25,000 SNPs okay so let's see what do we have to do here we have the the GWAS summary statistics we're going to have to select out the E SNPs we're going to join all that information now we have an x side information high-dimensional side information and the p values we pop that into the ADAPT machine and we start the algorithm with some generous cutoff we started at we're going to reject every p value less than 0.05 and then we look at that and we say is the false discovery rate less than 0.05 say or alpha no of course it isn't so then we we use the EON algorithm to to discover what's the worst SNP that we rejected we throw it out we update our line we do this again again again again but it doesn't take it doesn't it it converges in the reasonable amount of time and when finally we get the new rejection set we look at the mere conservative we see have we control the false positive rate and when we have then we return our rejection set we look to see what we got and we want to compare this with various ways of choosing the side information to see when we get the most discoveries and we start our search with we're only going to reject the p values less than 0.05 if you remember right we had about 25,000 SNPs we were working with is that right yes and we right away we revealed most of them so our black region reveals the majority 21,000 of the SNPs and now we that that way our algorithm because remember the EOM algorithm has to have something to get going then it works very well and we can then find we can up weight down weight based on the features until we get a more powerful test so how much more powerful is it all right so remember we have all these look on the bottom here we get 300 300 well first you can just see overall this is if we don't use any covariates we just do FDR that is adapt with the intercept only is just basically an FDR controlling and you can see there's some there's of course the big strong signal on chromosome 6 in the HLA region there's a number of signals but once we put our covariates in there we get whoops we get a lot more signals and the signals are not just all piled up in the same LD friends but they're spread out over the genome and so let's see how did it go and the look at the bottom of these bars here if we use the intercept only we have 361 now if we take a very simple model and only put in the bipolar we only get three more discoveries now if we add in the EQTL slope so we get another 110 but not that much more now if we throw in the gene modules we we get another you know another increment and even if we just put in the gene modules by themselves we get the most as long as we allow for complicated interactions between those modules we get 100 up 603 so almost twice as many as we started with but if we put the full power of XG boost fitting with many simple trees but adding together with interactions we get all the way up to 843 discoveries and these discoveries so we we found a lot more discoveries and even if you count them based on correlated sets we get a lot more independent regions so it it did work very well and we published that PNAS in 2020 and now I want to show you a little bit about because I know most of you are this is machine learning so you want to know how did we look at this after the fact so you you can analyze your trees based on the relative importance of your variables and remember the tricky part of this is we keep running the ADAP model again and again and again as we at each step we change our our our curve and so a variable can be very important in the early steps and somewhat less important in the later steps so you can see that the bipolar even though now remember bipolar didn't help it all by itself but in conjunction with the other ones it is by far the most important variable and so another one that's very interesting here is this gray line and that that's a variable that was very unimportant in the beginning but as we get towards the final steps it becomes quite important and that is the Wigner gray module which seems quite weird because if you know about Wigner the gray module is the module that the gray module is the module that's uncorrelated the genes that are uncorrelated with all the other genes and the way this plays a role in our fitting is it any SNP that maps into that gray module gets down-weighted relative to the others and that and that helps us a lot in the later steps when we're really refining things and then the salmon module was important early on and then becomes less important let me show you what how that was so if you look at the SNPs that map to the salmon module you see on the bottom histogram there's a lot of small p-values but the SNPs that don't map to the salmon module are closer to uniform distributed so we up weight the SNPs in the salmon module and we find them early on okay so now we're finally so i'm kind of wrapping up on that subject but i'm just saying that we we did find a lot of extra SNPs because of the side information and we didn't worry too much about the fact that our SNPs were somewhat correlated due to LD but because ADAP maintains FDR with this amount of correlation but still correlation is really an issue when we're testing and so we want to move on we might prefer to do gene level analysis because we can put all that we put the we believe we can put the correlated SNPs together and draw conclusions about genes this might boost our power because we have fewer hypotheses to test and it certainly improves our interpretability so so that was the next thing we tried we want to put the p-values together and and do a gene testing but we have to account for that the LD within a within a gene and so so we look in the literature and there's a method called magma and it is a really popular method even though it's relatively new to the scene it has 1200 google scholar citations i'm maybe probably probably more every minute and and the magma computes the gene level p-values by using Fisher's test and a correction for correlation this is an old correction from 1975 but there's a little whoop a little problem with that correction it's a correction for one-sided test of significance so wait a minute SNP test are two-sided will that be a problem well so we we run their software and we look at our histogram of p-values this is for autism and we're happy happy to see that big spike near zero but there's a very serious problem here we're not going to be able to you we don't have mere conservative there are p-values that are missing over in the right hand side so while the magma authors aimed to fix they aimed to fix this problem of the one-sided two-sided test they managed to fix it mostly over on the left the small p-values are pretty close to correct but they they failed to fix that dearth of big p-values and most people think that doesn't matter but it really matters for FDR you have to not just for that just in general FDR needs the uniform p-values and so some other authors in 2020 used magma with an FDR test and for the low so and then we corrected magma we fixed that uniform distribution that was shocking how many false discoveries they had so for skits so look down there the orange ones we we also found those genes and the gray ones are ones that didn't didn't replicate when we fixed the p-values and so you can see for schizophrenia oh it's 13 percent of them were false but if you roll back to autism almost you know about half of them were false more than half of them were false and the reason is because the lower the power the more important the distributional assumptions and autism and ADHD are the lowest power of the psychiatric disorders and then schizophrenia is more powerful so it really that that mere conservative really matters and so um so anyway we did we did fix that and then we so once we started worrying about correlation and i need to wrap up soon but we started worrying about correlation we thought what about those what about say if p3 is the causal one that's going to be correlated with p4 which is in the neighboring gene and that's going to cause problems too because we're probably going to we should reject gene one but now we're also going to reject gene two due to that um bleeding of ld and that'll be a problem because people will try to interpret that gene it's worse than SNPs because people don't interpret SNPs but they really interpret genes that's going to be a problem and so what we recommended is you find the genes that have a lot of bleed over correlation and you analyze them as a unit and so i'm going to i'm running out of time here so we just you can look at our paper for that but we are flexible with plus minus five minutes so oh okay okay i'll slow down a little bit so what we do is we look to we first we we take the positional SNPs we put the E SNPs into the genes then we merge loci that have a lot of correlate correlations so we merge genes A and B there we leave gene C separate and then we now we have loci some of which are one gene some of which are several or two genes and we analyze them everything's we can control the FDR very well there and then after the fact we'll zoom in on the multi gene loci and try to figure out what is the causal one although that that's still just an exploratory analysis but at least we never we never call out a gene that's totally just correlated as an independent discovery so now this this figure which my student made is a little too complicated just ignore everything except the top three bars of the three phenotypes and so the black bar is just if you just did benjamin and hodgeburg and the orange bar is if you did um adapt with only but not with no covariates and the blue bar is when you put in some side information because that's our goal is to have side information help now we get a lot of discoveries for autism more like 125 genes and the side information helped us a lot now and some of the side information was gene expression we also use schizophrenia which is correlated and we use educational attainment and surprisingly enough educational attainment in europe is quite highly genetic and very well studied and correlated with autism and so we put those in and we found more for autism but what was also interesting is that the side information helped a lot for a low powered test some for a moderate powered test like schizophrenia and really hardly at all for educational attainment because that's very highly powered because it's an easy an easy phenotype to measure so there's a lot of data so side information is best when you need it and not so useful when you don't need it and so and then that's our I mean my student made the application for viewing where the signal is and I'll let you look at that on your own so we're moving towards for the last step of his PhD to try to move from common variants to rare variants and it's going to be really hard but there is a lot of side information that we can pull in and so I'm hoping that by next year we'll have a paper trying to find to interpret whole genome signals so let me summarize then want to bridge the gap between selective inference methods there's such beautiful machine learning methods in that area and almost none of them are applied to real data and so what we tried to do was get practical applications and genetic studies so that we could actually achieve the promised increases in power and we feel that we did do that so we have these three papers the one in PNAS is the main one I talked to you about about the SNPs and then we like we we have a couple of papers that are about moving to gene level tests and what we learned from all of this is it is possible to greatly improve the power by incorporating multi-omic information but it has to be done with a great deal of care so that you um not just so that you avoid false positives but if you don't do it with a lot of care you don't get any improvements so good modeling is the secret to good results so I'll stop at that point thank you and we thank you and we sent you a round of virtual applause here on zoom that was fascinating an extremely important topic it also unified several topics that were mentioned in the earlier talks today like multi-omics and making discoveries through data integration this was extremely exciting so other questions from the zoom audience yes there is one Lucas one of our students in the doctoral students in the network he Lucas Miranda will go first thank you Garcin thank you very much Catherine for such an interesting talk I have two questions if I may if I have the time please one just out of ignorance and uh it's something that you may have mentioned during the talk but I either didn't pay enough attention or I didn't understand when you compare different methods to threshold p-values for example and um you say that one yields more discoveries than the other how do you assess how do you compare true positives with false positives understand okay that's a bear though go ahead you want to ask your second question and make a unified question yeah the second question has nothing to do with it oh okay then hold that one until I answer this one okay we don't know what is the truth so um we we can't know that those results are definitely true um we did um we we we did that that analysis on schizophrenia in 2014 data and then there's been publications since then and we we got uh the expected level of of um but you know we you know that the new data supported the findings that we had and and we felt happy with that but that's all we can know can't know for sure if they were true um likewise when we did our gene test then we found that the the genes were enriched in the genes that we expect but you never know for sure we believe it's true but we don't know thank you question second question okay yeah so from my limited understanding of these psychiatric diseases that you mentioned such as schizophrenia for example there is a lot of hypothesized heterogeneity behind them like um there are people saying that many of the diseases are defined based on symptoms but there might be different biological entities behind them do you have an opinion on whether this can have a high impact impact on the geo studies because either labels that we have are representing different genetic entities for example this might be yes that that is that is a very good question and and I I commend you for asking it because I've never given a talk when somebody didn't ask that question and the um the thing is is that people you know different people with autism you know I mean it could say say for instance autism is the one that they say it the most about and some are very high functioning some can barely talk and these differences you would think it would make sense to partition your analysis by these different categories that never works I don't I think that the truth is is that people have multiple hits having autism is really kind of like being I mean being tall or being short there are many many many different SNPs acting on that and so this it's the severity and the slight differences don't don't borrow from the fact that they do have the same problem the brain is not wired quite the same I mean there's the big hits that make you have a very very low IQ that's kind of set that aside the ones that cause us to have social challenges those are some of us have more or worse social challenges but they're still social challenges and so but when people partition the data they find nothing so they go from you know seemingly meaningful to nothing thank you then another question by Julia hello thank you so much for your talk it was very very interesting my question is very much related to the first question from Lucas so I was was interested in knowing what you're using when you have like results for natural phenotypes as you as you mentioned what you're using for analyzing those results and for understanding if those results are actually meaningful or or not so it's very much related to his question okay that that's a very good question I wish I had a much better answer the one the first thing that we did was we we looked at the the the newest data set which was at that time 2018 and we we were able to statistically work back the data is all pooled together but we were able to statistically work backwards and figure out the strength of the new test from the new data by subtraction so now we have the the second like 25 percent of the data that's come in recently and now that's not well powered enough to replicate these these things but at least we can see if do we get do we get like a 0.05 kind of level of significance in the the SNPs that we found and and even then you don't always because these things are all right on the border of significance but we get we did a simulation to see how many would you expect to replicate if everything that you had was true and we got we got what we we got what we'd expect so if you can look in the supplement of the PNAS paper to see how we did that because it was it was tricky and so we we were able to replicate as many as we thought we would if our if our if our findings were all true and so so we felt that that was a good replication study not perfect then the other things we do the same things everybody does you know we we look to see if we if the results are meaningful do do you get a lot of results that fall in certain pathways and and we didn't but neither does anybody else because schizophrenia is a broad broad it's not like it's not like that type one diabetes where you you know it's the pancreas the brain like the brain is like i i study only brain things the brain's like everything almost every gene is brain expressed and and a lot of little things can go wrong that can cause you to be mentally ill it's a it's a miracle that we're all we're mostly healthy given how many little things can happen so we were not able to replicate our work in schizophrenia by using sort of those kind of enrichment results but we did analyze type one diabetes when we knew what to expect and it came out perfect we got good enrichment of exactly the things that we all know cause type one diabetes so that is a good way to check for phenotypes that you understand better like you know the lipid things and those kind of phenotypes not so good for the brain ones does that answer yes thank you so much for this answer and i would definitely have a look at the paper excellent okay are there further questions if not i would have a question if i recall correctly then you mentioned in the beginning that the covariate information should be independent from the information that you are testing like the genotypic information um how how can you guarantee that this is the case or what would happen if it is not fully independent can you detect this empirically so how can this affect your approach yeah yes no absolutely i thought about that because i um what what that means is like you wouldn't i mean of course they're correlated in the same i mean are they conditionally independent because of course you want them to be correlated right but you want them to be independent in terms of the measurement errors of them and so you wouldn't want to use the same like if you had a g was for the same subjects and you measured let's say you measured educational attainment on them and also whether they had autism or not that would be wrong right the correlation because any measurement errors would be the same the causing the same same small p values and so that that would be unacceptable but the fact that you use like like when i use the the bipolar those were different subjects but we believe that the the underlying signal was the same that's good that's the kind of correlation that you want but not not false correlations thank you really can measurement are the measurement errors are related yes thank you for that so we've come to the end of this one hour presentation Catherine like all external speakers has kindly agreed to also meet the the doctoral students of our network for for half an hour and to to give them advice and feedback on career and research issues that that's wonderful so we also thank you for that and and for the excellent presentation that you have given here is very was very thought stimulating and an excellent talk that that was great and thanks a lot the the pis and the youtube audience say goodbye to you now thank you very much Catherine and the doctoral students will take you to a to a breakout room thank you thank you for having me you're welcome and we'll be back here in the general audience in half an hour Christoph Fisher will present at 4 30