 We're starting so please take a seat I'm kind of hoping that there will be more people actually coming after lunch or you know just after lunch so Yeah, the people who are here are the best people so that you know and Actually, I wanted to introduce cool. So cool is that from CMU and He's been doing causality since 2006. I actually discovered recently and I mean, he doesn't like to say it because he's very humble But I would like to say that I think as far as I know at least he's the first one to do causality and transfer learning together Since 2013, maybe we can argue on that. Maybe Elias has some ideas We never know But I mean he's clearly very good and I think this talk will be very interesting to all of us Oh, thank you very much. You're so nice, Sarah And by the way, I'm the first among those people who are doing our work. So basically So in this talk, you can see essentially how we can connect causality with independence And then how we can do adaptive prediction or transfer learning in the causal and non-causal way Basically, I will start with this example So in 2015 This system called Google or sorry photo categorization system Developed by Google photos made some huge mistakes, right? The system basically take it to African-Americans as gorillas Right, this is really a huge mistake. So they use a deep learning. They really try to identify to do classification Why what what caused the problem and after almost three years two and a half or most of the three years They saw finally that solved the problem. How did I solve it? Do you have a clue? How did I solve the problem? Sorry And that's one way that didn't use it because they want to use the automatic system To us this is very simple and finally they solve the problem by Removing the label. Yeah, you want to say that, right? So by removing the label from the level set up by remove gorilla level from level set You can see they spend very much time They really want to solve the problem the problem is very easy for us to solve but for the computer Deep learning system is so hard. Why and similarly we know adversarial attack problem, right? So after some noise all of a sudden it's not a panda anymore, right? So what caused the problem basically? roughly speaking this is just because The machines are using a different set of features than what we use, right? We use these kind of features We derive the features in some way machines just use the features that are very useful for prediction Implied by the training data. That's all so basically we use different set of features now We say we are intelligent. We want machine to be intelligent worth intelligence traditional machine learning machine learning basically assumes that you have data training data and That the test data are from the same distribution, right? But you have a single distribution That's why you can learn something here and apply the same thing in the future in practice Almost all the time. This is wrong, right for for us for human beings if you play badminton very well You can easily learn to our play ping-pong and you can quickly prepare or play it very well You know how to decompose the whole skill set into some small modules and you can automatically do the transfer For human beings, right for human driver You can easily identify or make the right decision even if this is a completely new scenario Because you can see the essential part of the input of the feature and you can throw the connection between these Kind of essential representation and your experience something in your experience. We can do that Overall for machine to be intelligent. We have to basically Ask for the following things first of all the system should have the ability to understand what's going on I'm not giving a definition of understanding later. We try to I'll try to do so basically the machine has the ability to understand What's going on and because of that the machine is able to do control intervention The machine can automatically decompose a complex situation complex task into a lot of a simple modules and The machine can do information fielding you mentioned in the fielding basically you can just put it together Information or pictures that you learn from different scenarios in different situations And then you can form a bigger overall picture and you can learn with a few examples in a new scenario Right, you can really see the connection between the scenario and the previous ones Then you can do something with very few Level of data points you can do not only Interpolation but also extrapolation right even if the data points is not contained in the training set You can really see the connection you can make use of the connection We can do a lot of things. What is intelligence? We shall talk about a lot of phenomena Related to intelligence. How can we really achieve that? How can we define intelligence? Basically, what's intelligence? Some people may say so okay I can see exactly how brain work and then I can try to build something to mimic that I don't think this is really the right way to achieve intelligence because there are so many ways Right, we just care about the input output. We don't care about the the particular detail in the system then how can you define intelligence basically you can view this problem from the Evolution selection and the growth perspective we Now we want to define intelligence Because we have those capacities now we want a machine to have the same capacities what capacities first of all we survived Right, we survive different scenarios meaning that we have the ability to make good prediction or good decisions across different scenarios across all scenarios, right So here we're speaking this means we have to come up with some inner Compact representation of the external world you see different things you really want to see analyze see and Encode the connection between different scenarios between different tasks and so on so we can do this Second of all we are creative in some way. That's why we can create new things and we we are powerful, right? So creativity clearly to be creative you have to make a use of causal representation that because you want to make intervening you want to change something to achieve your goal and We prosper right. That's why we define intelligence We want to do something we want to develop something that we have similar to us, right, so We prosper because we are creative and we survive now we want to Create some machine that has abilities. What can we do first? Let's notice the following thing the two kind of representations are somehow consistent When we were very young as infants, we couldn't really do any intervention, right? We could just cry we observe right we try to Be secure of basically we try to make some prediction We try to form some inner compact representation when we were very young later We had the ability to make changes we discover a lot of causal relations that seem to be Consistent because there are no sudden big change between them Now give this let's try to end today I'm not going to talk about how to really achieve this achieve a general purpose AI system That's another story, but today we want to have a look at the connection between between you guys We want to see Let's talk so basically we want to see the connection between the two kinds of representations And then you have we are going to answer the following three questions first We just analyze the data observational data and then the relationships of the representation is somehow consistent with the causal representation How can we really analyze observational data to find the causal relations? This is the first first issue Knowing as a causal discovery you see how we can solve the problem the second one is Yes, the first of your presentation is really useful for Transfer learning or demand adaptation because you with this you can do you can make a very good predictions across all scenarios That's basically transfer learning right now You can see oh, yes clearly the second issue is that how can we really make use of causal representations? To improve our ability to do transfer learning right Suppose I know something about the causality in this system How can we make use of it to do transfer learning? This is the second issue the third issue is Yes, we can use the causality. It's not necessary because not all everything involved in this representation is causal Right, that means yeah, probably we don't have if we are giving data only data Probably we don't have to go to the causal level. We can make use of something else Which is kind of similar to causality to really do transfer learning. What's that? Right, so you want to find the efficient way to encode the particular if Essential property of the data distribution to make transfer to achieve transfer learning. What's that? so I'm going to focus on those three issues and I will take questions during talk. So if you have any questions, please let me know We distinguish between causal connections and Associational information. This is the example basically the news report from telegraph Couples who share housework are more likely to devolve It's horrible, right? So we really want to see whether this is the causal or just Associational that's why there was a lot of follow-up discussions does sharing housework really lead to divorce We care about such questions because we want to see what we can do to improve your life Right Association information most of the time doesn't really make sense, but for machines that make sense because they make prediction This caricature I think all of you right now this character basically explains my experience when I was very young Yes, if I see something With very strong we see very strong correlations or very strong dependency. Yeah, they're the causality There's some called a relation later because a very famous Claim or settlement in statistics Cosession imply correlation body correlation doesn't imply possession, right? A lot of people just change their minds However after doing research in this field for as you said about 13 years So now I'm very optimistic Basically, I can say that in almost the all situations if you believe that the causal relations take place between Giving measured the variables then we can always discover causal direction and the causal relations However, this is the only part of the story in a lot of situations We don't have a causality between observed variables We have a causality between hidden variables We only observe a lot of reflections of those true underlying causal variables This is much harder and we are basically a couple couple people are to currently doing this problem Hopefully within five years, you can see some very good solutions. Maybe because of you Okay, what's causality basically with X or cause Y if Intervention if you basically Interventions on X Can give you different values or different distribution of Y then by when you do intervention Basically only change the value of X everything else in the system at least for the moment is the same as before That's why whenever you observe a change in the distribution Y, you know, oh, this is the because of the change intervention on X That's why X cause Y. So clearly if they are causally related Generally speaking, they are associated. They are associated if only if they are dependent It's a much weaker Okay In reality in many situations that we really want to discover causal information from observational data This is a particular example So basically this paper says the largest scale psychological differences within China can be explained by rice versus the weed agriculture So there are some Explanations I'm going to skip this if you are really interested you can read the paper But basically here you can see in this kind of scenario. We cannot do intervention Even if you are very powerful, you can force people right to from different things You have to wait really long to really see the effect, right? So we have to analyze data We have to make use of a set of properties of data. This is not a another application another example basically my clever Marlene no back from the Netherlands and her last name is very cool. No back. Just go forward. So So she collected 255 skeletons all over the world and made the eight variables They really want to see how those variables are causally related Because I really want to see why those features include sex cranial size right here diet what you eat Climate cranial cranial shape differentiation basically what this part looks like so they really want to see why People in different areas look different They want to see why and then they want to make prediction to the future Giving some things that will happen in the future. So this is a very interesting problem But you can see we cannot really do interventions. We have no choice but to analyze the data okay You are going to see how we can make use of independence to solve this problem But basically we make use of independent different levels of independence to find a color structure And then after that you can see how we can do transfer learning also by making use of Independence or some particular type of independent modularity Basically, that's why independence is really essential property or essential concept in causality How can we really discover color information color structure from data observational data? First of all, we know that if you have a cold structure valley the color structure We can see a set of independence and conditional independence relations Given by the Markov condition called Markov condition, right? If x called y if y called the x and x called z in this way Suppose they are no co-founder Then we know why and they are conditionally independent of giving x because there are no other way for y to be related to to z meaning that if I fix the value of x then No matter how this guy changes this one will not be changed by by that. They are conditionally independent given The middle thing each variable is a conditionally independent from its non-descendants giving its parents. This is a Markov condition So from this condition you can say oh, we can see independence relation in the data if we are giving the graph now we want to do To solve the inverse problem we want to go from the independent property in the data Right that can be discovered from data Back to the color graph. What can we do? If they are independent basically you can see the control positive of the Markov condition said that if they are dependent then they are not they are Regrets speaking they are disconnected if they are dependent then they are disconnected This makes no sense because we cannot really see any sparse graph if you make use they make use of this property Right, we can only start with the dependence to say to the conclusion that they are Deconnected, but we want really want to find a sparse graph to explain the data. What can you do? You have to make use of another assumption, which is known as faceless This faceless is almost the same as a faceless in the relations It says all observed the independence relations in the data Can be seen from the graph by applying Markov condition It means there are no fake or extra of spurious independent relations Whenever you see independent relation in the data Right, it's the property of the graph then because of this assumption we know always there is some connection between Color graph and the statistical property of the data then okay from here you can Derivate the following procedure. This is the PC algorithm To discover causal relations from data We make use of conditional independent relations and in particular this set of method is known as a constraint based method Here by constraint I mean conditional independent independence or conditional independent constraints So from data you can discover a lot of constraints independence condition independence relations And then you want to find a deck or a set of decks to satisfy those constraints And you can show from the previous condition and assumption You can show that oh if they are independent or conditional independent then clearly there are no age between them They are not adjacent you can show this Now I can see oh I started with this graph. I know one three are not adjacent I know one four not adjacent. I know three and four not adjacent You can directly see this from the graph. Oh, sorry from the data distribution distribution on information now One three independent. This is the only way for them to be independent. Otherwise, right? You are going there otherwise they must share information. They cannot be independent So this is known as the v structure and then you can go further. This is not a v structure because they are not independent They are independent given this guy, right? Clearly that this guy cannot go this way because if this guy goes this way It means oh before you condition before you consider this guy. They are not related, right? After you condition this guy consider this guy that become dependent which is different from what we observe from data That's why this direction must go this way. This is the cloud Orientation propagation you can find the v structure and the directions and then you can go further by propagation Okay, so here you can see a summary very dense summary of the whole procedure Basically this method PC or other constraint based method for causal discovery Makes use of condition independent relations They rely on called Markov condition and the fascinist assumption and typical algorithm is a PC algorithm P is for Peter's vertices sees for clogged glimmer and Basically, there are two steps first step is to determine the skeleton undirected graph from the data and the second step is to find the v structure and do orientation propagation if all those decks Discovered from data have the same direction everywhere. Then it's complete. It's just a deck because all directions are Consistent otherwise if you if the same age it can have different the directions Across different solutions, then we just use a undirected graph Undirected edge right here in this case suppose. This is the true ground truth. We know why is the independent game x? All those three graphs that will satisfy the same set of independent relations then we say all basically They are they are not adjacent. They are just as they are adjacent We cannot determine the direction because each of them can go in either way Okay, this is known as a pattern or mark of the equivalent equivalence class So this is a now you can see how this method works on this dead state We apply the PC method combined with our current best the condition in contest the method This is the output of the method a lot of ages are directed Some of them are not right and if you further make use of some background information You can go much further This is the oh So this is a location basically this is a climate Climate that shouldn't be a cause of location, right? So I know this guy should go this way if this guy goes this way this guy must go this way Otherwise you are going to have a fake this structure Okay, so all of a sudden by making use of some background information I have a deck and a lot of relations here Reported here have been discussed in the literature for instance this one This one says climate causes the cranial size. You can see that in very cool the places Usually people have a larger kind of cranial size. It's because if it's very cool Then it's better to warm the air longer for longer, right before really Exalted the air. That's why this path tend to be bigger and so on. There are a lot of explanations in actuality furthermore From data we recover not only those things But with more details so in the literature they said all dieter is what we eat basically is a cause of a cranial shape Differentiation essentially what you eat will change your looks We from data. We say yes, that's true. This is a cause However, relative to this set of variables, this is not a direct cause This called the influence takes place via another two variables So from data you can really discover more detailed or more complete cause of picture Yeah, I was what was the the kernel based conditional independence test I'm curious what why you had to innovate on the conditional independence testing So basically we this is just should I say what should I say? Oh Okay, so now I see basically you If you want to apply such a method you have to make use of independent condition in kind of relations You have to see where the variable a conditional independent So basically this is just a non-parametric method for conditional independence test Okay, so that wasn't a contribution. It was just you just you had a non-parametric method that you Bound and then used it. You're to the right. Yeah, it's just a kind of module. We just call the function. Yeah, wonderful so now so far We haven't really seen anything about the confounder by confounder I mean a hidden direct cause of two variables, right? hidden direct cause of two variables and What if you have a confounders you only measure a finite number of variables? Can you really say anything about the confounders? This is really cool because if you can say something about confounder essentially you can Say something universe regarding the universe, right? Giving only a finite set of variables The answer is yes in some in a lot of situations you can really say something about the confounders Let's just have a look at two examples in the first example Basically, this is a graph which are discovered one two independent one for independent gaming three two for in my game three Right, if you assume there are no confounders, this is a graph Now the question is whether it's possible to have a confounder behind three and four Do you think it's possible at all to have a confounder there? Given this set of independent relations The answer is no If you have a component here, then basically x1 x4 will be deconnected a gaming x3 This is not going to be the case The same thing happened to the relationship between two and four gaming three so in this case Suppose the independent relations are really really Oracle basically suppose we are very confident about that then we are very sure that there are no confounder between Three and four and in front of all we say x3 must be a direct cause of x4. So this is really informative, right? Okay, now the scenario You can look at the example later. Okay, so suppose you observe the independence those independent relations one three independent One for independent to three independent Of course in reality you can condition on some variable down the matter Basically, do you think there are confounders between two and four? So this is a graph One three not adjacent because they are independent one for not adjacent Two three not adjacent to for not adjacent This is a graph So now you can see the following thing One for independent. So if there are no confounder then this guy must go this way Otherwise they share information right this guy must go this way Two and three are independent meaning that this guy this direction this or age must go this way You can see contradiction Why do we have the contradiction is because of confounder? So in this case we can conclude that there must be confounders between x2 and x4 and furthermore You can say definitely there's no direct causal relation between x2 and x4 So in many situations you can really say something about the confounders by making use of only conditional independence relations okay Unfortunately, if you make use of condition independent relations Most of the time you cannot really uniquely identify the deck because you have the equivalent class You can only identify all the decks which have the same set of independent relations Now let's try to go further and here you can see some examples We are giving different data pairs each time we are giving only two variables We have no idea about the temporal information. So there are no temporal precedence information We are giving ID data We are giving each time we are giving x1 x2 a number of data points So and then we want to determine which one is the cause and which one is the effect. What can we do? Clearly we cannot make use of a conditional independent relations because we have only two variables, right? So now you can see oh you have to assume something Slightly stronger then you can all of a sudden you can discover a lot of information regarding the process here you can see a number of functional classes Basically in the first one the effected linear function of the cause is plus noise This is the linear model and we assume the noise from the non Gaussian then this is the linear non Gaussian model and in the second case The effected non linear function of the cause plus noise the third case you further have non linear distortion or measurement error or measurement distortion in the data Basically if the true generating process called a process follows either of them But this one in the most general one follows this model then in theory you can show that the reverse direction wrong direction cannot fall cannot give you Independent noise For the correct code direction we have independent noise, but for the wrong direction you can never discover or estimate the Independent noise this will give you asymmetry between the two variables And that's why we can uniquely identify the color direction here. You can see some illustration. I Generate the X first and then I generate a Y so Y is the Effect of X with some noise, right? This is the functional form color process mechanism First let's consider joint Gaussian case right here X is Gaussian and E the noise from the Gaussian So basically everything is Gaussian all variables are jointly Gaussian in this case I try to explain why the effect with the cause by linear regression This is a residual you can see a residual is uncorrelated from the cause from the predictor by construction And in the Gaussian case as long as they are uncorrelated They are independent the same thing happened in the reverse direction I try to explain X the cause with the effect with the linear by linear regression The residual is uncorrelated and hence independent from Y They are symmetric X and Y are symmetric cause the effect are symmetric This is another scenario Here X is uniform noise is also uniform now You can see oh if I try to explain why with X basically I just estimate Recover the underlying truth color mechanism the residual is clearly independent from X Because I use the independent residual Here the residual if there are no confounder then the noise term here must be independent from X Otherwise that there must be confounder There must be some other way for X Y to be related meaning that there must be some confounder That's why we can say if they are no confounder then the noise must be independent from X from the cause and Then let's try to explain the if cause X from by making use of the effect Y You do linear regression Yes, the residual would be uncorrelated from the predictor by construction whenever you do regression. They are uncorrelated However, as you can see from this scatterplot clearly they are dependent. They are not independent The distribution of the other variable will be very sensitive to the value of Y Right, just as you can see from this scatterplot So this means oh Generally speaking this is just an example, but generally speaking as long as at least one of them is non Gaussian then we can we can recover the causal direction because Wrong causal direction will not give you independent error term. All those error term is the uncorrelated from the predictors And here you can say basically I just talk about to the variable case You can handle a lot of variables here. You can see some application in Stock market analysis here basically we analyze the daily returns of a lot of stocks in Hong Kong market Hong Kong stock market basically all Major relations are kind of consistent with the ground with the domain knowledge here for instance This guy is this is HS. This is the Hong Seng bank. This is HSBC HSBC holds something like 60 percent of Hong Seng bank then you can see the causal relation from Hong Seng to HSBC and so on. Okay, I'll skip this Previously we assumed everything on the linear Clearly is too strong right in reality We really want to make use of very flexible model to approximate the process At the same time we want to guarantee that even with this very flexible model the causal direction is Identifiable the cause and the effect are not symmetric in terms of this model That's so we come up with we came up with this model called post nonlinear model 13 years ago. Yeah, so basically The effect is a nonlinear distorted version of the nonlinear function of the cause and a noise This is very important Because very often times that we have sensor distortion measurement distortion right whenever you use Instrument to measure something right very often times you have some additional distortion so put them together we have this model we have two nonlinear functions and You can see this model is rather general for instance Even if the noise is a multiplicative not additive I can see oh it's nothing but a special case of this model, right? so now With this model, but we applied this model to the causality challenge the second causality challenge Basically with this method we correctly identify the all color directions That's why this change a lot of people's mind now. You can see a particular data set. This is a data set 8 We have dead points right here. You can see the dead points and we first assume x x 1 cause x 2 We fit the model as intimate the nonlinear functions and the noise the noise is independent from the from the hypothetical cause Then we assume x 2 cause x 1 we want to see which one is more plausible And then we estimate the noise and nonlinear functions. You can see the noise is not independent not independent from the Hypothetical cause which is x 2 so basically we say that Call the direction should go from x 1 to x 2 actually x 1 is the age of a lot of people and x 2 is wage per hour Clearly x 1 should cause x 2 but not the other the other one. Yeah This is the perfect is this assuming no noise in the parent variables that was the question This is a wonderful question. Yes, and later you can see how you have to deal with the measurement error Wonderful. Yeah, you are totally right Okay, otherwise basically the true variable for the cause is the confounder. Yeah Okay, so you can see yes empirically speaking We can really discover causal direction and the color model from data even if there are no temporal information We're giving ID data So let me show some theory Empirically, yes, it works, but can we really guarantee that essentially we just Take this problem with proof of contradiction. We assume x 1 cause x 2 We have the data then have the distribution implied by the data now We let's see whether the same distribution can be explained by the wrong causal model wrong causal direction How can we really solve the problem? We assume the same data can also be explained by wrong direction Which is from x 1 from x 2 to x 1, right? Now, let's see In what situations this can be the case After solving some equations differential questions, you can see oh in only five situations This is the case meaning that in only five situations Given data generated this way the cause and the effect are Symmetric because even for the wrong direction you can find the independent noise But apart from those five situations cause and the effect are not a symmetric you can discover causal direction Those five cases are very specific first one in the linear Gaussian case Right, you already saw the picture in the linear Gaussian case basically called the direction the information called the direction just disappears All of them are very specific You have to tune the nonlinear functions and the distribution very carefully so that one of them holds true Means that generally speaking we can discover a causal direction from data Even if they even if the data were generated by this very general very flexible model By the way, it's very interesting to see why the linear Gaussian case Which is supposed to be very simple, right? Turned out to be so so strange in the linear Gaussian case We cannot find a causal direction. Why because central linear theorem if you have some of independent variables, right? And if the number of independent variables is infinity, we have a Gaussian distribution, right? On the contrary, we have the criminal decomposition theorem Which says that if the sum of any finite number of independent variables is Gaussian Then any of them all of them are Gaussian This means all when I when I really discover or see Gaussian distribution I already the process already converges, right? If the process converges then the process information disappears you only observe something that is fixed, right? But before that you can always recover the process information which is the causal information from distributions Okay, so now First let's have a look at a lot of practical issues. We have to deal with in causal discovery Why do we have to deal with those situations? This is because we have to analyze the data the data were generated by not only the causal process But also how you measure the data, right? So you have to deal with nonlinearities You have to deal with not only continuous variables, but also categorical variables and the mixed cases We have a lot of papers basically Yeah, we published those papers to deal with those situations You have to deal with the memory error the memory error is a really important issue and we just published some papers How to deal with the memory errors in different situations? You have to deal with selection buyers for causal discovery because selection buyers that can change the dependency relation dependent pattern in the data Confounding is the most important thing In causal discovery here, I have I observe one two. However, there's a variable Z as a direct common cause of them Unfortunately Z is not observable. What can we do? Can we discover two causal relations? So basically we are trying to solve this problem We observe a lot of variables X i's they were generated by some hidden variables and the hidden variables are causally related You can consider those variables as concepts, right? You see that a lot of images those images were generated by a lot of concepts those concepts are related Right, so how can you this make no sense because what you observe is nothing but reflections of the underlying true causal variables You have to go to this level To really make mean best make sense of the data. What can we do? How under what condition can you really recover this kind of deep? Representation graphical representation of the data we call this a confounding network discovery because this is something like a company network And also you can see the relationship between this and the deep learning This also explain why we have to use deep structures in a lot of scenarios. Okay Missing values you have to deal with missing values because if you just ignore missing values You can have extra of thick conditional dependence relation in the data so we have some publications there and Causality in time series is a traditional problem. You have temporal information However, if you really want to apply grande causality, right to the data In order to for the discover the causal relation to be really causal you have to make very strong assumptions You assume the unknown compounder you assume you have the data recorded as the right frequency Right resolution and so on so to solve those problem Basically, we could as a team at the time delay the color relations as well as the instantaneous color relations are very fast color relations We can discover true color relations from a low-resolution data here by low-resolution data. I mean you may have sub sample the data or Aggregated the data like a daily return the daily return is nothing but the sum of a lot of a local or short-term returns so and also you have to deal with a Call of discovery from Partially observed processes some of them are not observable. How can you really recover the true causality? between those observed variables basically we made some progress in those line those lines, sorry those lines of research furthermore Recommended systems if you really want to design recommend a system you care about the causality because you make this particular recommend recommendation To change the behavior right if you just want to make prediction then this is a nothing but Recommendation for convenience. I know you want to buy this. That's why I recommend the same thing to you Right, but what you really want to do is to make make changes To to change the user behavior for instance if you want to improve the revenue Maybe you want to find a way so that the user will buy more products in the long run And if you want to make the world better, you want to find a way so that the user that will be more fair Open-minded and so on so that means whenever you design any recommend system You have to come up with some goal to achieve and then you can make a use of some causal inference techniques to achieve your goal and then finally non-stationary and the heterogeneous data we observe such data Very often in reality if you have time series as time goes on The distribution of the data can be different can change. This is non-stationarity If you have multiple datasets collected in different Locations on the different conditions then the distribution can be very different. This is known as heterogeneous data So how can we find a causality from such data? From the purely statistical perspective Non-stationary data is really really horrible because I don't have a fixed distribution Right, however from a causal perspective We really enjoy analyzing non-stationary data because causality and Non-stationarity or distribution shift are heavily coupled. We care about the causality because we want to make changes Right in other words in other words call the model tells us how the distribution may change To make it more specific suppose X called Y raining causes where the ground? What do we mean by that? There are the process to generate X There's another process to generate Y the effect from X. The two processes are not related They are separated meaning that all I can have changed the NPX. I can have changed the NPY gimmicks They are not related Right, and we can easily Discover and verify those conditions from non-stationary data. That's why generally speaking With non-stationary data we can discover causal information more easily. Let's see how we can deal with the problem Basically, we want to answer the following questions. How can we find the where the mechanisms change means that We want to discover the variables for which the generating process changes So we can do so Suppose this is the ground truth from data window all for V1, V2 and V4 the mechanism change and furthermore How can you discover the true causal skeleton and the third problem is how can you find a direction in the non-stationary non-stationarity case? it's really interesting that we can Find the direction more easily because we can make use of independent changes in Probability distribution of the cause and distribution of the effect the given cause because the modularity, right? We have separate independent modules. This is the one module. This is another they are not related They are if you have non-stationary data, we can say they change independently Although each of them is a high-dimensional object Distribution, but we don't care. We can verify whether they are independent or not so we can easily find the direction by making use of independent changes and finally we can also do some non-paranormal method to Editing that to have a to produce a one low-dimensional representation of the non-stationarity I know this I know this color relations this color mechanism changes. How can I really realize how it changes? Basically, you just want to find a low-dimensional representation of the conditional distribution across different scenarios over time Okay So let me say a few more words about the independent changes. This is really a very nice thing If you have non-stationary non-stationary or multiple domain data, basically independent changes between the cause distribution and the distribution of the effect given a cause and this Independence is generally violated in wrong color directions There are some special cases. I think you are familiar with some of them in this case We had we have a invariant cause basically see the surrogate You see points to this guy we say all this distribution this mechanism the mechanism for this guy changes This guy is not adjacent to see and from the moment we know all this guy is independent from see giving some other variables Then we can say basically we have a invariant cause vi the invariant cause That's why the direction goes from vi to vk. This is the invariant cause This is another scenario here. Yeah, V L is independent From we see giving some set of variables containing vk. This is very different Now you can see all the distribution marginal distribution of V L changes However, the conditional distribution of V L can V K will be the same. We say this is an invariant mechanism Basically, if you have those invariance properties, you can discover color relations and the independent change is just a generalization of Invariance you can make use of that and you can discover that and here you can see basically You can discover color color relations by making use of changes So here you can see application we analyze the number of stocks Basically the daily returns of a lot of stocks in NYSE You can find a skeleton or the direction and you can find that but you can see very nice patterns Those companies in the same sector are closely related between different sectors that color relations are kind of more sparse and Basically, we try to interpret the results here. They seem to be consistent with the domain knowledge And also furthermore, we have the non-sessionarity or visualization of the non-sessionarity of the color mechanism here This is a USB. I forgot which one it is Basically, this is the how the causal mechanism to generate USB Stock return daily return changed over time. This is the 2008. This is 2007 financial crisis Right, you can see all some things totally different from before and actually this line this curve is very Very similar to the so-called curve of a TED spread TED spread which provides a way to To analyze the risk in the market Calculated in a very complex way. Basically, you can see this line is consistent with that and furthermore Here in 2007 the risk is already very very very high by in 2008 clear the risk is even higher Different environments Separately or it can automatically figure out the non-station. That's a wonderful question You don't need to do that. You don't need to basically segment the data sets. You just need to it's non-stationary Right, you can use it sometimes a kernel method to make use of all data at the same time Okay, finally Let me see how much time I have. Let me spend about 10 minutes on this part transfer learning Why do we need to call the model first of all if you want to make changes if you want to intervene on something or if you want to make Proper changes to achieve your goal clearly you want to make use of causal models, right? This is a I think we all of us agree with that In many situations that we don't really want to to do intervention. We just want to make a prediction, right? So in this in this sense call the model provides a compact description of the properties of the joint distribution Remember example running with the ground right X cause the Y then the two processes are not related Basically, if you observe multiple realizations of the process then the multiple distribution will follow some property Given by the color structure by the color model now, let's try to do the following thing With the same color process that we can observe a lot of distributions But with different parameters or different color models, right? So how can we jump from one particular distribution to another? So we're here, right? We are familiar with this scenario. We know x y value We can predict y from X in the next scenario. I I just observe something I want to predict y I want to transfer information from this place To this place to another place, right? What can we do? So first of all If you have no idea about the relationship between them, then it's not possible to do Proper transfer learning because you have no idea what to transfer if they are actually different You cannot do anything and second of all you can see color state structure is rather stable We can make use of this. This is the bridge, right? I jump back to the bridge Then I can jump to another place by making use of a new proper parameter, right? So This is the transfer learning basically we have data from some old data or source domain data in which we have both the values for the features and The the value for the class label or regression target We have x y values and the future we collect data only the feature values and then we want to make prediction Clearly the distribution the distribution can be very different This is different from the classical setting in which the distribution is always fixed, right? You always have idea sampling from some distribution. Now we have different distribution So we just want to jump from some places to some other place and we know something can change What can we do? First of all we want to I want to convince you that clearly Yeah, just to understand the setting of the previous slide There are two things happening one the distribution is changing and you are measuring different columns because we just measured the x in the testing You don't measure the y there is in the target domain. Yeah, correct Yeah, then two things one there is something happening between P1 and P2 That is changing and the other one you have different measurements. That's correct. This is the setting Yeah, okay, so For classical transfer learning a demand expectation with a single source domain, you're right in the target domain We don't have y value we want to estimate it and a lot of people including Sarah basically Focus on the multiple source domain scenario. You have multiple source source domains in which you have both x and y Now you have the same. Yeah, you are totally right You're totally right. Okay so Let's just consider this example Very simple example. Suppose this is me, right? You can see my figure my shadow Now suppose I go to a new environment complete a new environment First question If you see my shadow, can you predict anything about my figure? Yeah Can you yeah, okay? Why is big? Oh, let's talk about why later So this is the first one second if you see my figure Can you see my shadow? Can you predict anything about my shadow? Do you know how long it is don't know where it is? That's true, and you just see me suppose you just care about me So can you really see anything? No, right? Why that and furthermore what's the color direction? This is because here. This is the cause. This is the effect, right? in our experience Daily life experience. We already learned a lot regarding this mapping called a mechanism We learned a lot constraints regarding the relationship between this guy and that guy Right, and then by making use of that constraint and some property here. We can recover this guy perfectly Okay, this is because that still provides a set of constraints that you can make use of more intuitively the effect Contains the information of the cause as well as information from the environment Including where the light is from and so on right That's why with some constraints you can separate the effect of the Cause from the rest meaning that you can recover the information of the cause from effect Even if the environment is non-stationary Right in contrast if you only look at the cause There must be some information in the effect that cannot be explained by the cause Right, that's why without strong stronger assumptions You cannot really predict the effect from a cause but the prediction in the reverse direction Would be much easier generally speaking if you have non-stationary environments Right, and here you can see basically we no causal direction and we no causal constraints that's why we can do a lot of things the idea is If why the thing you want to predict is yeah I was just gonna point out that's a good example for so not just transfer learning, but Semi-supervised learning to to the right direction. Yeah, you're to the right Okay, so Fortunately in classification as well as regression In many situations a lot of situations why the thing you want to predict is actually the cause of what you see of the features Right here. What you what digit you want to write down is the cause of the digit you actually have Here the disease which you want to predict right is the cause of the symptoms and so on That's why you have this color model and you can have changed here You can have changed the conditional distribution and you can have a change in both in the distribution of y and the conditional distribution or the mechanism However, because of the previous Intuition or illustration, we know as long as y cause x we can solve this problem We can do transfer learning fairly easily even if we have no idea about the y value in the target domain We are given only the feature values, right? We can do that So you can see the methods in quite some papers Here you can see our application basically is a remote sensing image classification problem We have a lot of remote sensing images and we want to do classification in total. We have 14 classes We have two different areas now you can see you can do transfer learning from one to two and from two to one basically with this model we can the misclassic misclassification rate Was reduced from 21 percent to 12 percent and from 26 percent to 14 percent in the two scenarios you can really benefit a lot from the Understanding of the underlying color process Here you can see another scenario essentially still you can see why generous eggs, right? And we have some different different error to explain the generating process in different across domains We have very low dimensional representation of the changes. That's why we can really see all with between domains We can see the change in the color mechanism with this structure We are giving a source domain in which you have class level and the target domain. We don't have class level, right? We do transfer learning by making use of this structure and we have theta one dimension of theta to explain why the Distribution in the target domain is different from the distribution we have in a source domain In a source domain, we found our theta takes this value and the target domain The theta has this value. They are very different. That's why the two distribution are different If theta value is the same then distribution must be the same because everything else is shared across different domains Okay, what's more interesting that follows I Observe I use this structure. I can change the theta value after I estimate the deep network, right? You can see what happens. Oh, I change the theta value now I see basically we just learned that oh, it's only the change in the angle You can discover that and I think this is how we understand the difference across different scenarios We want to find a compact representation of the difference, right? And then we make use of a difference Make a prediction Now let's consider another more general problem question You can see that in transfer learning causality really helps if you know color relations And if you assume that data was generally a kind of perfect sample from the underlying color model Yes, cause that it helps because the causality provides some constraints, right? You can make use of Oh, and you can further make you if you have a color structure You can further make use of an invariance in the relations, right invariance better prediction So I did some very nice work in this line this line of research Now the question is do we really need to use the causality or causal picture for transfer learning, right? Probably we don't have to go that far Essentially, we just want to want to we are giving data We just want to make use of the data to discover how the joint distribution changes and then we want to make use of a Changes to make use to predict to produce optimal prediction. That's all So this is by the following way or following automated method for demand adaptation We are giving multiple domain data From data you want to find a graph or something to encode the changes in particular here If this guy has a theta variable this means all this guy the mechanism the conditional distribution of this guy giving his parents Can be different or called different scenarios. That's all those things are independent. We guarantee that those variable independent That's why we have independent the changes between PX1 and PY giving X1 Different modules, right? We discovered that from data. I Make a use of color discovery from non-stationary data or lots of methods, right? However, we don't need to assume anything. We just discovered the property of the data. It's not necessary color and then What's demand adaptation basically in the type of man? We are giving information of the features Given those values, we just is nothing but Influence of the Y values on this graphical model giving the observed the data for other features So in this way domain adaptation or transfer learning is nothing but Influence on graphical models Right, you can discover everything from data. You make use of proud distribution Discover from data and structure from data. You make use of a type of domain data. It's done This graph is not necessary color this graph. Let me just give a single example So suppose this is an online to color process. Why cause X? Maybe disease symptoms clinics, right? We just assign passions to different clinics according to their symptoms However, if you have data generated this way and apply Color discovery method or graph graphical model learning method learning methods from non-stationary data You are going to learn this graph Px can change p y game x is always the same across the scenarios Why this is because why is the condition independent from s p y game s is the same as the p y game x and s So p y game x is the same. This is what we discover from the data Clearly, this is not a causal Right, but this is a valid compact representation of the change the property of the data That's enough and then you can make use of this graph to model and observe the values in the type of domain to make a prediction And here you can see a simple Application of the idea basically you want suppose this is here. We have a someone we want to predict the location from the Wi-Fi signal strength is we have a lot of routers and Right here. He can he or she can receive signals from different routers and from the strength is of the signal We want to predict the location of the of the person. That's a set up So we can do this we can measure the signal that we can measure everything from different times Different time periods and then we can do transfer learning say we can use the Signal that we may we recorded in the morning and noon to predict something that that is happening in the evening So we want to transfer learning. This is a first of all from data We have multiple domains. We have data signal strength is we have wide location. We can first of all construct a graph The graph to encode the distribution of property of the data and Then by making use of this graph and the target domain data We can make prediction so here you can see the result accuracy given by different methods This is by domain invariant component analysis so basically the 29% and With this method automatically we have transfer learning is 64% You can really see a big gap between the result given by this method automatic approach and other approaches because here we can really Somehow optimally make a use of the property of the data Chaining property of the data. Of course the prediction should be more flexible and at the same time you don't assume You don't assume anything that is a fake. That's why is the result is clearly superior to other to other results Okay to summarize We care about the causality even in machine learning right if you want to deal with a problem of Adversarial attack you really have to find a way to derive the features that we are using to the classification Otherwise, if you know the features the machines are using are different from our features clearly you can always do something like attack and we Can discover causal information from data by making use of different kinds of Independence relations with condition independent we can find a skeleton and some directions which Basically, which is known as the Markov equivalent class with independent noise condition In some constrained functional called a model that we can discover called direction. We can discover the whole deck uniquely Furthermore, if you have non stationary or multiple domain data You can make a use of independent changes in the code of modules. This way you can discover direction as wherever skeleton more easily and also if you want to do transfer learning essentially you want to make use of a Compact representation of the changes in the distribution, right? The more compact the model is the better the prediction will be because you have fewer changing variables Right, you have fewer parameters to estimate and here modularity implied by causality clearly matters if you have causality causality implied modularity and if you are further assume what to observe is a perfect sample of the Color model then you can do transfer learning. Otherwise, if you just observe data, you can just make use of in the Some graphical model to encode the independent the changing property of the data and then Domain adaptation is nothing but the inference on graphical models. Thank you very much basically a lot of collaborators, so I cannot and all of them one by one basically listed here so Questions so specifically with regards to non stationarity Domain adaptation Many of the settings that I think about The underlying causal model is not really on the variables that I observed but rather on some hidden process How would you think through applying these types of techniques in that setting? Wonderful look at this picture Here you can see we have theta theta the latent variable you want to observe the generating process You want to observe you want to explain how they observe the features were generated? By making use of some hidden variable you can think of theta i as a confounder Yeah, you can do this and in a non-parametric way Because in theory you can prove the identifiability of the conditional distribution by making use of a non-parametric model for the generating process Essentially to make I can put it this way You want to see why support why generous x you want to see why and how px given y the generating process Changes across domains and you want to include such information in a very compact way You can achieve that by making use of some Structure like this Is that the answer to your question any other question also on the other side of the room by the way If I could be chance anybody has any question Okay, so you said if you Encode the changes in different environments appropriately in a graphical model although it is not causal That is still sufficient because the rest of the problem is just inference on a graphical model because that I mean I don't exactly mean what I don't exactly understand what do you mean by just an inference on this graphical model because if I do not know In the target domain where my changes are going to be How can you get the causality stuff wrong, but just still have a graphical model, but then still generalize Okay, very good one. So whenever you do learn you have to assume something even if you do regression Right, you assume the future data points will follow the same distribution That's why we can you can guarantee that the learner model will have some capacity in the future, right? The same thing happened here basically whenever you discover this model is a in its induction problem You have to where is it? There are something that follows you have a mother distribution or hyper distribution regarding all variables to generate different distributions Right, and then you want to recover the property of the mother or hyper distribution represented by the graphical model right and in this Distribute in this representation you have structure You also have the prior distribution of the theta variables then basically you specify everything Now in the new domain target domain you observe a set of values for those features You are giving structure. You are giving prior distribution. You are giving observed values for some variables. You do inference Yeah, but I have a little bit of difficulty because You impose a prior on the Theta's you learn that from so you learn from the you okay You learn that from the data that you've already observed with respect to that prior You're just saying do an inference on the graphical model and you should be okay So maybe one last question and then we should probably go to the poster session anybody Everybody's shy Okay So there's some recent work on invariant risk minimization which also seems to Attack the same problem that you're talking about the causal graph is completely hidden and you have a projection Yeah, like in the form of an image that I see yeah, and what they advocate is I mean there's a criterion right so I just want to know your comments is like how is that Very good one very good one. So basically I would say the invariance the best prediction is a special case You know all of this party the invariance then I can make use of this relationship for prediction However, this is the far from optimal prediction because even if everything changes we human beings can still make prediction It's about on how certain you are about a prediction, right? So ideally you have to make use of not only invariance property But also everything that is informative right here, right? So that's why basically with the inference on graphical model. You have posterior you have certainty, right? So the point is even if everything changes you can still make non trivial prediction and in order to make optimum prediction you have to find a way to Incorporate such information. That's why it's more general Okay, just about the setting are you assuming that you are measuring everything in the target as well say again Are you assume just I'm trying to understand your setting here? What are the constraints and so on? Are you assuming that in the same way that I asked before are you assuming that to measure all variables in the target? We don't measure why as well or some you don't matter why in the target of a target That's our okay. That's why you want to infer the y value from x values So it's unsupervised multi multi-sortament adaptation No, you have finite finite data. Yeah, okay, so maybe we can still thank Kuhn again. Thank you very much