 height so thank you for inviting me and you know when I heard that Huido got the Nobel Prize I was like yeah no pressure I'll do such a good keynote there will be no comparison. So hi I'm Sara Maliakana I'm from the University of Amsterdam and MIT at MIT IBM Watson Eaila and today I wanted to talk to you about causality inspired machine learning as we call it and in general this is about the ideas that we can use from causality so that we can improve machine learning in particular it's robustness and it's safety and I think I'm preaching to the choir so we all agree causality is super important it's to be kudos there are many questions that do require causal knowledge but maybe a slightly less represented view is that causality also allows us to reason systematically about distribution shifts and so this is something that for example Carlos was mentioning about and also okay did mention about a bit when he was talking about the model drift and so obviously I'm not the only one thinking this so there are several papers and this is just a collage of the ones that you know I found and I have read and there are many many more and many more are coming so it's a very hot area and so you know this started at least as far as I know this started at least in ICML 2012 and so it's been like the connections between causality and improving robustness of machine learning have been kind of well known for a while and just maybe to drive from the point this is not a purely academic interest so a lot of the papers that actually put here you know not by chance are also from companies so they are from Google from IBM from Microsoft from Amazon and from Facebook and so this is also like how to improve robustness is also clearly interesting for tech so first maybe I'll give you like a very short example since it's just after you know dinner or breakfast depending where you are or lunch I guess so I'll give you a first example of why it's important that machine learning algorithms are robust distribution shift so maybe you already know this story this is a story about a horse a horse called clever hands so in the beginning of the century in Germany there was this horse and the idea was that the owner would ask the horse a mathematical question like what is one plus two and the horse would tap his hoof for three times and then the question was everybody was thinking maybe it's a scam so there was a you know a proper investigation by researchers and you can read about it sometime but one of the findings was that essentially the horse was able to get some clues from the person asking the questions so if the person was asking the questions and they asked what is you know five minus two and the horse would tap once it would tap twice it would type three times and the person you know was plastered then the horse would stop and so you know then you know the horse would have learned the correct answer but not for the correct reason and so this is the clever hands effect when you know you have maybe an algorithm this is able to learn the correct answer but it doesn't mean that it's actually learning anything it's just kind of able to essentially game the system of pine forestics or in other you know in other places it's being said like it's right for the wrong reasons and so maybe to give an example in computer vision there is this kind of slightly ordered data set from 2016 which was you know the first time I heard about this clever hands effect and it's a data set on visual question answering and so you have these images and then maybe you have some questions so the question is what color is the jacket in the picture so there's a jacket here and there is the colors are red and blue yellow black orange and it's rather blue if you can see the image so this is the task visual question answering and so in this order data set you know pre 2016 the authors here managed to find that a lot of the baselines were actually doing like a decent job let's say here but if they train just an MLP we just use the answer so you throw away the image throw away the you know the question you just use the answer and use this essentially just train an MLP on the answer you would get similar results to the state of the art and so the idea was that the data set for example was quite biased so most of the images you know we're taking by day and so if you have a question in which you had what time of the day is it you know it's the was the picture taken daytime was a good answer or in general if you have colors there was a lot of grass a lot of you know trees so green was a good answer it's essentially this is what the visual question answering systems were often learning this kind of simple heuristics and so the way that this was solving visual question answering was they introduced new data sets which were trying to diagnose and be very careful about data set bias and so in another so this is not just you know for computer vision and visual question answering in another domain for example in an LP there is this task called natural language inference and the idea is that you have you know a sentence the premise for example the doctor was visited by the judge and then you have a hypothesis and hypothesis is the judge visited the doctor and so the sentences are equivalent so you would say in this case there is an entailment from the premise to the hypothesis and so if you do train BERT and you train it on the standard data set MNLI that people use for natural language inference you would get very good results but then the authors of this paper so this is from ACL 2019 so much more recent they found out that there are few heuristics that you can think about for example you could just you know learn the overlap of the words so the doctor appears in both sentences visited through a judge to so essentially a heuristic which just means you know if two sentences have the same words or a high overlap then they're entailed works pretty well because most of the sentences which have a high overlap they're actually entailed and so what they did is they created like tricky sentences like this one here so in in this sense it's a bit tricky even for humans it's like the doctor was visited by the judge and the doctor visited the judge so essentially they're like the first one is the passive voice and the second is the active voice so they're not entailed and so what they did is they did a new data set that they called TANS as you can imagine and so in this data set they had half of the you know simple sentences and then they added half of the sentences which were these tricky versions and so what happened is that on the simple sentences obviously the machine learning items were actually picking up on several different types of heuristics but on the tricky ones the results were actually quite bad and so the idea is that this kind of leads us to think that BERT and the other methods which are state-of-the-art and LP were actually just picking up on these kind of simple overlap heuristic or some other simple overlap heuristics and so the idea that you know that the authors proposed to solve it is to do that augmentation so you do know which are these heuristics so you can train the model to you know fit these heuristics and essentially to be able to deal with these kind of tricky sentences but the question is what happens if you don't know which are the heuristics that the model is using and so going back to causality the idea is that I really like to think that causality is a way to do it to do this kind of reasoning systematically and so especially and this is maybe my biases are showing especially through causal graphs right and so I was actually looking at all of these papers and again you can imagine I chose them because they are you know close to the papers I like so they do tend to have all different slightly different graphs but all kind of graphs nevertheless and my hope is that at the end of this talk you could be able to at least understand how people did create these graphs and how they're using them and so and especially and this is I think very interesting because I think I mean like Carlos already talked about transportability and you know generalization etc I think that one interesting thing is even when you don't know the graph you don't have the selection diagram you don't know you know some things but maybe you don't know everything even if the graph is unknown you can sometimes use you can sometimes reason about distribution shift through causal reasoning and even more even if you have essentially missing data and this is the paper we did in 2018 even if you do have missing data so even you cannot reconstruct the equivalence class of the graphs so the micro equivalence class you cannot reconstruct it because some data are missing so some conditional purposes you cannot do them you can still do something about it so this is what I hope in the end of the talk you will be convinced about so I start with an example and this is a bit back to the basics because after I seen the talks especially with one from Carlos I think you guys are much more advanced than I thought so let's go back to the basics so this is you know an example of a structural causal model and let's say I chose an example in which I have a log and I have logged how much food how much calories I eat every day how much exercise I eat every day my basil metabolic rate and I love my weight loss because I'm trying to see you know I want to lose weight so I want to see how I'm losing weight and so there is this rule of thumb that you may have heard that is essentially if you have you know a deficit of calories and you divide it by 7700 which is supposedly not an arbitrary number then you can show that you will lose one kilo more or less or one kilo of fat so that's why there is a tilde there that more or less and so we can write that formula there so this formula this formula here we can write it and we can just call it weight loss wl and so wl is a function of food calories fc exercise calories you see and the bmr so the basil metabolic rate which is the calories you spend just being alive and so the idea is okay it's you know as I said it's more or less so maybe we can also add some probabilities and so here I'm trying to explain to you you know why structural causal models are so nice because you cannot this kind of exogenous variables for example do that and essentially this will allow you to model some noise which is in this formula or some kind of personal factors or any kind of you know probabilistic version essentially of this rule of thumb and so I'm drawing this you of that as you know an observed variable which is like a bit shaded like lightly shaded and with this dotted kind of thing around it and so we can write a structural causal model and I think you guys are beyond that but just you know to give you the whole run so we can write it like this and we can also have these exogenous variables for you know the food calories the exercise calories and bmr because maybe you know there is a certain distribution I don't always eat 1200 calories every day so there's a certain distribution so I can model the distribution with this unobserved exogenous variable ufc or also for exercise you know things are quite stochastic so I can model the stochasticity through this uac and also the bmr is not this kind of it's it's based on your weight so it should slowly decrease so even these kinds of distributions model here and so the idea is yeah I usually have this kind of written variables but just for simplicity I will not throw it and that's you know convention we generally don't throw it okay yes okay it's up here and then I can also add another variable for example my motivation to actually pursue weight loss and so m is a function of my way with loss let's say that you know when I'm losing weight I get more motivated so there is a function let's say fm which is a function just to be lost and so essentially in the graph that we are drawing on the left hand side we have the variables and on the right hand sides we have the parents of the variables and we connect the parents with an edge so this is the general idea and we can also model soft interventions so I start with the more general version so soft interventions and soft interventions for example a soft intervention on m is just the idea that that will change the equation of m with another equation so just use another function maybe because you know I was you know believing about it and I know that you know I used to be more concerned about my weight loss and my you know motivation to lose weight they were more connected more correlated but now I read more about it and therefore the correlation is slightly less and so I can change my function here so this is a soft intervention for example another intervention is a perfect intervention and in this case I'm interviewing on m and it means I will kind of just cut the edges incoming from weight loss to m because now suddenly I decide for example that you know like my motivation is regardless my weight loss and so this would be a perfect intervention when a variable becomes independent of its parents so I'm cutting the incoming edges so after now it was you know super standard but now I wanted to really like drive a bit home the point and this is a bit similar to selection diagrams but not completely and we can discuss that maybe in the questions if you want so I want to be able to represent with a single cos diagram three different distributions because maybe I have you know three different datasets for example I want to you know be able to represent with you know the let's say the observational case in which I have you know f of m so the motivation is f of m of w l maybe I want to have another soft intervention case in which m is f of m k when you know you have learned more about weight loss and therefore the function has changed or maybe m is also f star of m in which suddenly m is not related is not connected anymore is not caused anymore by weight loss and so I can just represent this with a single graph very simply by adding this e variable e because you know people call it environment or domain or regimes and this is an idea that shows up over and over again in several words and so the idea is that if I have this e this e can act as a switch so when e is zero I will use one function when e is one I'll use another when e is two I'll use another one and so I'll just have g which is the function that you know motivation is based on known g we'll also have e as a parameter and so in this graph I can represent several distributions observational soft interventions and let's say interventions and in general I can also represent let's say different distributions from conceptually different observational distributions for example I have my log but maybe you know I have a friend and he has also a lot like you know my green friend and then I have a purple friend and my purple friend also has another log and we all have slightly you know different habits and you know how we eat and how we exercise and the bmr and our relationship between you know the weight loss and our motivation is different and so what will happen I will just start the environment valuable and this will be causing all of the things that are different across us and I can represent all of our data in a single graph and with a single structural causal model and so the idea is so this as I said this is not you know my idea is this kind of shows up repeatedly people rediscover it people reuse it in it's in a lot of the papers also I have shown before some of the things we actually did provide that it's slightly different and you know the reason one of the reasons is also because wanted to learn essentially the selection diagrams or things that look a bit like selection diagrams from data is we proposed this joint causal inference and in this case we wanted also to disentangle the changes in each dataset so instead of having one environmental variable we added several context variables we call them and the idea is that if we have you know these three people we can just you know add three intervened three kind of context variables here so we add three columns and you know this column so the first column is only one let's say for my data points and it's zero for all the other people and the second column so the second context variable is zero for my data points and it's only one for you know my green friend and the third context variable is only one for my purple friend and so these are like you know datasets three datasets that we can put together in a single table you know in excel you can do it so why not so essentially we can do this and the idea is we can do this very simple thing and then we can consider so we can have some extra assumptions and that's why the paper is actually quite long and surprisingly complicated but we can show essentially that it's kosher to put everything together and then we can treat this as if it was observational data and then we can learn a structure on this and so this is the work we have done in this joint cousin inference paper and so okay and so you know before I had like three context variables but for simplicity we can also remove one because you know then we can encode people like my you know my purple friend is zero zero my green friend is zero one and I'm one zero and so these are perfectly enough to encode the differences between us and so the idea is that if I have this you know this table in which I put all of the samples together then I can essentially represent the same data with a graph and the graph will have this e here this is from before that encodes all the changes but because now I have the difference c1 and c2 I can split the e in two parts so for example maybe my friend my purple friend here actually has very similar habits to me so actually the only difference is that he is much more optimistic so I will just have an edge from c1 that represents actually me to motivation and c2 instead the let's say my my green friend is much more similar to me but on the other hand he may have like different habits and so I can essentially disentangle the different contributions in the different datasets and so now we have a graph it's a standard graph we can use this operation we can use the calculus and so that's that's the that was the idea to essentially go back to a very simple structure that we can also learn from data and so now okay so now we have a way to encode you know different datasets in a single graph fantastic but you know why should we care so first I wanted to maybe give you like an idea of the different tasks that you can use in transfer learning in which causality could be useful so the first one is let's say multi-source domain adaptation like the supervised version so let's assume now I have for my friends I have all of the data so they are source domains and for me instead I'm trying to predict my weight loss but I don't have some of the data so some of the data are missing here so my data is the target domain and it's supervised because I do have some some values but not all of them and so now the question is essentially how can I estimate something that predicts the weight loss from the other features and so maybe to give it to make it a bit more formal let's assume I call just all of the features x1 x2 and x3 and x4 so these are like my features as in you know in classification and why is my label and so now my task is to essentially learn a model that is able to predict why well from the source domains in a way that it would also transfer well to the target domain and I can also use some of this data so this is supervised multi-source domain adaptation a more complicated case is unsupervised multi-source domain adaptation so in this case I actually don't have any data for me so there are no labels in the target the target is completely labeled and so in this case I want to exploit the source data to learn my estimator of why but I also want to exploit essentially some of the data I have so I have some data on the covariates or features or how you want to call them in the target domain so I can maybe use them to compare with the other covariates to see what has shifted and what hasn't shifted and so this is essentially one of the ideas that I will talk about more in the later so this is one of the setting a supervised multi-source domain adaptation and the final setting and I think this is you know important to have them distinguish because people tend to confuse them is the main generalization so in this case we want we want to be able to generalize under any intervention so we essentially we do have the source domains as before but we don't really have a target domain there is no data nothing is here and so the question is how can I find like a predictor f hat such that it will generalize regardless what happens in the target domain and as you can imagine this is generally quite difficult but let's assume one can see this as an intervention on all other variables except the label y and then in some cases you can use the causal parents to generalize and I'll show an example when this fails a bit later but it's generally related to the presence of latent confounders so now I'll give a very quick very very quick example of unsupervised domain adaptation which is will be most of my focus and for now I have a lot of simplifying assumptions so one is there is only one source and one target domain so c is either zero or one simple we assume we know the graph and it's this very very simple graph and so now the question is what features can we use shall we use x1 to predict y shall we use x2 to predict y so we lose x1 and x2 to predict y which one shall we use and so I will actually relax this assumptions later so we did not have the graph we'll have more than one source and possibly also more than one domain but for now let's start with this one so I think this is actually very similar to the criterion that is in the selection diagrams the transportability criterion I think people have been reinventing also this criterion a few times I think they are called stable features invariant features etc so unfortunately we didn't know that so we called them separating features in 2018 and the idea is we just want features that separate y which is the label we're trying to predict from c which is the context variable or in other cases let's say the context variable that represents the target domain so we have multiple context variables and so the idea is we want to use essentially this separation and you know for the people who don't know this separation is a criterion that allows us even a graph and under some assumptions to read conditional dependencies from the graphical structure and so I will just give you I don't have the time to actually define this separation although I would really like to but I hope if you're interested you look it up but I will just give you a very quick intuition of how it works so let's assume I choose x1 as a predictor of y so I want to know if x1 is a separating feature and this is you know the separation statement I'm looking for so I just want to know if y is disappeared from c given x1 and essentially this operation means that I like x1 has to block all of the paths so just you know to give you a very simple example there is a path from c through x1 to y and when I condition an x1 like this I can show that you know this path is blocked and so you have to believe me but just to give me you know just to give you an intuition if you do have the BMR and you know the weight loss you don't my motivation doesn't care about the BMR my motivation is fine with the weight loss so knowing the weight loss will kind of block the influence of the BMR to the motivation and so this is a similar type of path than the one that you have here and so there is another path in this graph there are only two so this path is from c through x2 to y and so in this case the question is does x1 separate also this path and the answer is yes it does and you can think of it conceptually as you know the calories that eat are independent from the calories that you know I spent exercising regardless of the fact that they both cause weight loss and so this is let's say a simple example and so we can show that x1 is you know a set of features that you know is a separating set and this means and you know very simply if you have this separation you can show that you know or actually essentially by assumption you assume that this condition dependence holds in this condition dependence you can just write it down as the fact that the condition distribution of y given x1 in the source and in the target are the same so this is like just by how you rewrite conditioning fantasies and maybe this doesn't mean much to you but maybe we can just look at the data so you have y here and x1 here and so if we look only at the data from the source domain so these green ones here and we fit a line and so this line is essentially the it's kind of doing the conditional distribution of y given x1 so this line would still work even for the target data so the target data are these blue circles here so a line from you know that is only learned using you know the green data will also work on the blue data so this means essentially that you know this is a good predictor even for the linear case and now we can check x2 you know is x2 a good predictor and so the question the answer is no so if we look at this path that we had before now we're not conditioning anymore on x1 and therefore this path is open and therefore we know they're not the separated they're disconnected and so this is similar to the case in which the BMR actually may influence my motivation if I don't know what is my weight loss so this is the end and on the other hand this is a bit a more complicated case even the other path so we didn't even need to check this path because the other path was already open since it's already a problem but just to check also this other path even in this other path this other path creates a problem because when I condition x2 x2 is a collider which is a structure in which two edges are coming in and when I condition it I open this path through from c through x2 to y and so this is the idea that you know if I know I have lost weight and I know I haven't had any deficit then definitely they might have done a lot of exercise and so essentially the calories that I eat and the calories that you know I spend in exercise are not independent anymore if I know my weight loss essentially and just to give you an intuition and so x2 is a horrible predictor really bad and so I can show you even in the plots it's even more clear that it's really bad so here we still have y on this axis and x2 on this axis and again we will try to fit with the source domain just you know linear regression align to be horrible for the target domains completely wrong and even worse in our unsupervised domain adaptation case we actually never see why so we don't know what's happening we just could have essentially potentially arbitrary large error so x2 is very biased and we shouldn't use it and so x1 is a separating feature x2 is a bad feature it's not a separating feature and x1 and x2 together are also not good because x2 essentially will bias things and I think essentially this is not something new it's more like a summary of things we have apparently discovered but one of the things that people don't realize I mean people who don't realize but the people who don't know so much maybe don't realize is that you know the separating features are not necessarily causal so I can completely have this x3 and you know I can show and you know trust me on this unless you want to learn this operation immediately that x1 and x3 are all are both separating sets so both of them are good predictors and maybe x3 actually help us in predicting why it's even better than just having x1 and so separating features are they're not they need not to be causal so they don't need to be parents of y and so this is the difference a bit with domain generalization in domain generalization you generally would prefer to have the causal parents because you would think that everything can be intervened and therefore most of things would have either this situation or they would be kind of not important so you can always take the parents and this is just maybe to drive another point home if you do have latent variables certain confounders for example this one h sometimes even the causal parents cannot really help you and so sometimes there are other features that you should be using and so in this case x2 is a parent of y but we can see from the separations and the connections that x2 is not a good feature using our definition and even adding x1 and x2 together actually creates more problems and that's because essentially x2 opens this path here but in any case if you do know this operation and you know the graph you can figure out which is a good set essentially that has like a bounded generalization shift and it's not necessarily the parents so now I will talk a bit about an example so I wanted to talk about this actually this data set that we use in this causal domain adaptation paper so we will focus on unsupervised multi-source adaptation and multi-source just I forget to mention it before but it just means that there are multiple source domains and so in this paper we focus on this data set because we were actually competing in a causality competition and so we had some mice and for each mouse we had some phenotypes for example I don't know height and weight and then maybe we wanted to predict something else for example of color another phenotype and we had a few samples from normal mice but we also had a few samples from mice which were genetically modified and we didn't know what the gene was actually doing because we only had phenotypes so the distribution was different but we didn't know exactly how and then finally we had our target set and there were mice which in which another gene was modified and so in this case we wanted to predict let's say our labor wife and so in this case we needed some assumptions so some of them are quite technical but we can also discuss them a bit later if you're interested or you can check in the paper but one of the discussions one of the assumptions that we had and as far as you know almost everybody has is that the label Y cannot be intervened upon directly and when I say intervened upon I mean that let's say there is no edge from the context variable to the label Y in the target domain because otherwise we're not able to transfer anything to this domain if in this domain things change and it doesn't change in a way mediated through the other variables but this doesn't mean that the distribution of Y cannot change it can still change through the other variables and so in this case the problem was that the graph wasn't known and even worse I mean you know if the graph is unknown you're like okay I can do conditionally planuses in the end I want conditionally planuses right no because in our case we have you know the unsupervised domain adaptation case we didn't have any data in the target domain and so we couldn't test this conditionally planuses and we needed to be very very careful about which conditionally planuses we could do because anything that used Y had to be conditioned on the fact that it was in the source domains and so the idea was actually we can just do a lot of other conditionally planuses and then maybe we can use so we had a theorem problem maybe we can try to prove somehow and I will show you how or like you know high level how we can essentially prove these conditionally planuses that we cannot test from the other conditionally planuses that we can test and so we needed some assumptions for example the fact that Y was not intervened directly and some other technical assumptions and we needed this joint causal inference work that we had and so now to go back to the joint causal inference that I discussed before in this context which we also have missing data what we did is we did a lot of conditionally planuses tests but we had to be careful if Y was in the conditionally planuses test we needed to condition on the fact that it was in the source domains then we could find several graphs that fit this conditionally planuses and then we can let's say we can represent them with a single graph in which these dotted lines are edges that can or cannot be there but then the idea that in all of these graphs in all of these graphs actually we could prove that Y was uh separated from C1 conditionally x1 if there is this edge or not this would always be blocking it so this is like a minimal example that we can you know we proved by hand and so and it doesn't matter if there is this edge or not here x1 is could be bad could be not bad we're not sure but just in case we're not going to use it and so this was done by hand you know by proving theorems by hand so we thought maybe we should just use a theorem pruger and so in in this theorem pruger we started with a query we wanted to check for example if x1 was a separating set we have some assumptions we throw all the testable conditional in the planuses that we had you know carefully so that when Y was there with the only use source data we use the logic and coding of this separation and then we throw everything in a theorem pruger and the theorem pruger would spit out essentially three options and one was provably separating which means in all of the graphs that fit essentially these conditionally planuses and these assumptions in all of them these uh x1 would separate Y from C1 another option is probably not separating which means in all of the all of the graphs that fit the data it doesn't happen or simply it's not identifiable so in some it does in some it doesn't I'm not sure which is the correct graph and so we use this method in a like a very simple causal feature selection algorithm and the idea was first first data we use the source domains data and then we did standard feature selection and just with the source domains we would list you know which features and by features I mean both you know the features x1 and x2 etc but also this context variables we use as features so we have like sets of features that that are most predictive in the source domain then for example we select a new set let's start with you know the top set the set which has you know the smallest loss in the source domain for predicting Y and then we would just kind of use our theorem prove it to check if this set would actually work and maybe for example let's assume in this case this set doesn't work so we cannot prove it separating so then we would just you know iterate and take a new set and so the idea would be would again do it until we could find the set that was probably separating so we just try everything so that's why it's very simple it's kind of brute force and just follow down and find the set that was probably separating and probably separating just means it can guarantee us that regardless of what the intervention is in the target domain because we cannot see what Y is doing under some assumptions obviously but regardless of the intervention we can still bound the generalization ever and so now we have this set S that we can prove is probably separating and then we can learn this function f hat of S just using the source domain data and yeah and if this doesn't happen we just iterate we continue iterating until we find one but let's assume we find it and then we can just once we have learned f hat we can just use it on the you know on the target domain to estimate Y and we can prove and we have some proofs in the paper the generalization error is bounded and so this was like our all the work now I'm going to talk about about uh sorry I'm going to talk about an application of this work so we have a new MIT VM project in which we're trying to apply this work to cross-species translation and so this is with uh Daniel Zoo and Douglas Ruffenburger and some other works from IBM and the idea is that um I'm not sure whether you guys know it but a lot of drugs which are tested in mice in the preclinical studies um up to 10 percent of the drugs which are you know which are which pass the preclinical studies actually do not only 10 percent pass sorry only 10 percent of that which do pass the preclinical study do also pass the clinical trials which means they're effective for humans only 10 percent of the things that they're affecting on mice are affecting on humans and we essentially we're not sure exactly how the physiology translates and so this is cross-species translation we want to know how the physiology translates and so there is some work actually from the PI from MIT who's working with us on how to do computational modeling for this cross-species translation and most of the approaches try to find things that mice and humans have in common and instead our hope is essentially by using the state of the art the main adaptation approaches for example the one I showed before that we should be able to do better and so for example this is just an example of what we would like to have and the example is we have some control mice maybe we have some treated mice maybe we also have some other species so the moment actually this is not happening but you know this is the hope and then we have some humans and the humans have some for example gene expressions and we want to classify the humans for example now we are looking into tuberculosis we want to classify the humans like who of the humans have tuberculosis and we want to learn it through mice and the essentially the distribution of the gene expressions in the humans so this is one of the use cases that we're looking into and now for something in a sense completely different I'm going to talk about another method so not the method I mentioned but it's still a method that uses ideas from causality for the main adaptation and so this is a method here it's a YouTube video from Kunzang talking about it on the online causal inference seminar this is a method on data-driven domain adaptation and the idea is that you do have this graph and you have these theta parameters that can encode the changes so instead of the context variables that we had in their cases the their parameters are titas because they want them to be not just zero and one they want them to actually be numbers so that they can estimate them more efficiently and the idea is if you do estimate these an assess of source domains when you're in a new target domain you can just estimate a subset of the titas and the rest is the same and so we actually have used this method we have collaborated and we have used this method in something that it's a bit of a work in progress so I wanted maybe to tease you a bit with this so it's an application to fast policy adaptation so reinforcement learning so the idea is for example we have Pong and here we're playing Pong normally in this direction maybe we have another source domain we're playing Pong but we have rotated Pong so we're playing this other direction here we have added noise to Pong so we have a set of different source domains and we want to essentially be able to learn a model such that we're able to learn from the images a representation that is shared and kind of semantically sensible and disentangled here and then we want to learn domain specific parameters for example we want to learn something about the noise value and there are some assumptions with respect to the previous work for example one of the assumptions there are no new targets sorry no new edges in the target which is like something we didn't consider in the method before so we don't have this simplifying assumption and so using this model estimation that we estimate on the source domains we're able to then identify we can prune it a bit to just use the subset of you know the states and then we can do policy learning on the source domain so we can learn pi star and the pi star is parametrizing this domain adaptation sorry domain specific parameters when we're in a target domain we already have the policy we can just learn this the values of these titas and then we can apply the policy and it seems to work quite decently so you know maybe next time we talk I hope we'll have a better uh results so maybe just to come to the conclusions I think maybe in a kind of a roundabout way I hope I did convince you that causal graphs and dissipation are a principal way to reason about imbalances and distribution shifts and especially I wanted to also convince you that this can also happen and this is completely not obvious if the graph is unknown and even when there is missing data or conditional dependencies and especially with missing data conditional dependencies logic based methods like heating and et al from 2014 seem to actually help us deal with the fact that some of the conditional dependencies are missing and so as I said there's a lot a lot of work that I hope you'll see and so that's it thank you