 Our next speaker will tell us more. Let's welcome Ruben Martínez, data scientist and AI researcher at Data Hack. Ruben, hi, how are you? Good afternoon. Hi, good afternoon. Ready to talk about causality? Yes, if you want, I can start right now. Take it away, Ruben. It's all yours. Are you sharing a screen, Ruben? Yes. Okay. Well, first of all, I'd like to thank you for being connected in this event where I've been lucky to participate in the last few years. I'm going to introduce myself. I'm Ruben Martínez, computer engineer and I work as a data scientist at a company called Data Hack. And what do I dedicate myself to? Well, in the last few years I've been developing a research project precisely with the friend you can see in the picture on the screen, with Robert Pepper, in which we've been coding and displaying a series of deep learning models. Why? Well, in the end, so that this robot can assist people with Alzheimer's problems. And precisely, when having to display those models in real environments, that's what has led me to see both the potential and limitations of artificial intelligence paradigms such as machine learning and deep learning. Okay? Well, I'm going to leave you here a series of contact guides, my LinkedIn, my Twitter, if you want you can write to me. Normally I use Twitter account to share papers, information related to quantum computing, with data science, with robotics, my website and my GitHub repository. Okay? Well, the question that I would like to ask and the one that has motivated this talk is why do we need causality when we already have paradigms such as machine learning and deep learning? Okay? Well, in this slide we can see how there are many cases of use in which these types of technologies are already being applied, these types of algorithms of artificial intelligence, well, in sectors such as health, financial services, government services, development, optimization processes, both for energy, for transport, travel, for industry. And, well, in the end we are seeing that these types of models are being applied quite successfully, with, at least, media, okay? That is to say, from doors to outside, we see that there is a great boom in which it tells us that we must invest our money from our company to develop projects that carry this type of algorithm. Okay? And this information we can see in such important media, for example, like the Forbes magazine, in which we already talk specifically about the power of deep learning, okay? That is to say that it is something that is not the future, but it is already being implemented. And as I said, well, basically it is being bombed with a lot of articles of how this type of algorithms are going to help the company precisely to take off, to maximize its benefits, to improve the products and to improve the services that are given to the users. And one thing that is heard a lot is precisely the recognition of patterns that are going to carry out this type of algorithms of machine learning and deep learning, okay? That is to say that these two types of algorithms, these two types of paradigms are pieces of what would be the field of artificial intelligence, which is much larger, much wider. Okay? So we are going to introduce a few basic statistics with which we are going to work, because what we are going to do is a trip to take precisely from that famous phrase of correlation does not imply causality. So to be able to understand better what we are going to know in the talk, in which I am going to try to do it in the most practical way possible with Python, okay? Well, we are going to see a series of basic statistics. Well, first of all, is to distinguish the concepts of population and sample. Population, well, basically as you can see in the slide, it would correspond to all the elements of our dataset, okay? The complete set of the elements of our dataset. But as many times this set is too big, well, that's going to make it very complicated to work with it. So what we do is take samples, take sub-conjuncts or samples. And the idea is to work with these smaller sets, with these samples, and then try to infer if the results that we have obtained from the sample are extrapolable to our population. Okay? It is a bit what leads us to inferential statistics. Okay? For now everything is very simple. And a series of basic metrics, of basic statistics, well, one that is used a lot, well, both the arithmetic media and the geometric media. Well, throughout the presentation I am going to put a series of formulas, but precisely to lose the fear of formulas. Okay? As this talk is a presentation in which there may be all kinds of profiles, well, the idea is precisely that after this talk the fear is lost, because in the end what they want to say are concepts that in the end are not complex. They are a arithmetic media, well, we simply add all the elements of our population and divide them into the number of elements. And the geometric media, well, we make the root of the product of all the elements of our population. Okay? Simple. Well, then it is also in the world of data science, and also what leads us to machine learning and deep learning, well, it is also very interesting to have dispersion measures that measure how much an example of a population is separated from the average of that population. And that can lead me to measures such as the variance in which what we do is for each example of our population we take that example and we subtract the average to avoid problems between negative and positive values that can be obtained when making this difference because we raise it to the square and in the end we promise it between the number of elements of the population. Well, as we have raised it to the square the result of this dispersion measure because perhaps it is not very representative of the values of our population. So what we can do is just apply the standard deviation, the typical deviation, which is just apply the square root of the variance so that in the end the result is more representative of the values with which we are going to be working. And in all these very basic statistics with which we work daily within this sector. Well, this leads us to have started counting this series of statistics because it leads us, for example, to want to compare two random variables and a way of doing it is through the statistics of the covariance which is a measure that will tell me the average grade average in which the examples of each random variable are separated from their respective media. That is, we take, for each example of the data set, we take the element that is in the position i of the random variable x and we subtract it to the average value of the random variable x and we multiply that by the element that is in the position i of the random variable y minus the average value of the random variable y. And in the end we promise it between the total number of elements of the population and in the end this measure will be understood in an interval between minus infinity and plus infinity. Well, what interpretation does this have? Well, if the result of the covariance is positive, we can say that the random variables x and y are positively related, that is, as a variable grows, the other one grows. And if, for example, the result of the covariance is negative, then they will be inversely proportional. That is, as a variable grows, the other one grows. It is an interpretation that we can have an intuition of the results of the covariance. But of course, when working with a large range of values, as it is from minus infinity to plus infinity, then that can be a little scary. And that takes us to another measure, it is used a lot, which is the coefficient of Pierson correlation. What we are going to do is to normalize the result of the covariance between the product of the typical deviation of the random variable x by the typical deviation of the random variable y. What we have achieved with this is to reduce the interval of possible values to minus 1 and 1, maintaining the same interpretation of before. The problem of this statistic is that it makes a series of assumptions. Well, as the random variables x and y must follow a normal distribution, that must be linear, or must be homosterastic. What does that mean? Well, that the variation of the examples of the data set must remain constant. That variation must remain constant. So maybe those assumptions can also be a little restrictive. And that takes us to a last statistic, which is the coefficient of Pierson correlation, in which what we are going to do is to order from greater to smaller, or from smaller to larger in this case, the values of the two random variables. And then we are going to calculate Pierson's linear correlation coefficient of pairs that have the same range. And in the end we can edit that requirement that the two random variables must be normally distributed. Well, at this point, if you have not run yet, you can ask me, and this brick, why don't you let it go? Well, basically, this brick, I have wanted to let it go. Why? Well, because the models of machine learning and deep learning what they do is find correlations. Correlations that can be linear or non-linear within the data set. But they do not do any type of study of causality for themselves. That type of algorithm of machine learning and deep learning by default is not in charge of analyzing what causality relationship is between the variables. They are simply going to measure what type of correlations there are between the variables and the label with which we are going to work. So that leads us to the concept of pure correlations. Well, I wanted to share a page that is quite entertaining. If you want to take a look, that is also put here in the slide, in which you can see correlations. For example, the consumption per capita of cheese that is positively correlated with the number of people who died in real life between their Saturdays. In this case, as we see by intuition, there is no type of causality but instead, a model of machine learning or deep learning would find this correlation. And that is already dangerous because although this is very intuitive that there is no relationship of causality in other situations, we can think that with the model of machine learning and deep learning what we are finding are causes that produce the objective that we want to predict. Another example of pure correlations for example, the consumption per capita of mozzarella that is also correlated positively with the awards of doctorate in civil engineering. In this example, there is no causality relationship but a model of machine learning would not find this pattern. And that is dangerous. Well, a last example showing a little bit of code simply, I have loaded a data set in which the monthly number of shark attacks the monthly sale of ice and the monthly temperature and if we for example obtain the coefficient of pierson correlation that we have seen we can see that the monthly sale of ice and the number of shark attacks is very positively correlated and we could think that maybe one causes the other and that would lead us to a wrong conclusion. Why? Because if we analyze more the data that is precisely an analysis of causality would lead us to a concept that I am going to advance, although we will see it later which is the one of confounding variables. Confounding variables are variables that cause both our treatment variable and our objective variable. The treatment variable is the one that we want to know if it is the cause of the objective variable. So if we have a variable that causes both our treatment variable and our objective then in the end that confounding variable is the real cause that the number of shark attacks increases as the number of ice sales also increases and in our case that confounding variable is the temperature. If we see here as the temperature increases also increases the number of shark attacks and also increases precisely the monthly sale of ice. That is why it is very important not to just stay with what the other models of machine learning are. That leads us to talk quickly about the limitations of machine learning and deep learning. Well, this summer we have seen a boom, especially with architectures such as the Transformers, which have been applied both with great success both in the processing of the natural language as well as in artificial vision and for tasks for those who had not been trained. But in the end if we analyze the functioning of this type of algorithms specifically the GPT-3, which is based on this Transformers architecture, we realize that precisely it is the amount of data with which it has been trained which allows it to apply it to so different fields. In which originally or initially we could think that what the model has done is learn when in reality what it has done is memorize the great scale. And for example I wanted to finish this part with a quote from François Solet, the creator of Keras in which he tells us that the models in this part tell us that the models of deep learning are simply a chain of geometric transformations that what they are going to do is map a vector space in another, a manifold in another. But there are many situations in which that mapping is not evident or even can not be done with simple geometric transformations and therefore not even increasing more data, increasing our data set or increasing the number of layers of our model we will be able to solve the problem with algorithms of deep learning. Particularly there is one thing that always worries me a little, which is that from the business area you can think that the data are the new oil, which is a phrase that has been made very famous. In reality, I would say or I would like to sing a little more about what the focus should be on the generation process of data because what really interests is precisely the physical processes that have been responsible for the events that we want to predict. And for that we would have to go down to a series of physical laws, both macroscopic and microscopic, in that case quantum physics. And in the end when we go up to that macroscopic physics, how those quantum physics phenomena collapse when we observe them in those laws that we can observe in the day to day. And in the end it is a little in what would have to pay more attention and focus on those generation processes of data and not so much in thinking that if a large number of data will be able to predict through machine learning models your goal, that is, the machine learning model will not be in charge of looking for causes within the features that produce the event that we want to predict, but it will simply find linear correlations or non-linear, okay? So that leads us precisely to the subject of this talk, which is causality, the causal inference in this case. Well, how could we define causality? Well, if we have two random variables x and y, then what will interest us, for example, is to know how the random variable changes and when I do that the random variable x has a series of values, it has certain values, or more specifically how the distribution of the random variable changes and when I force that the random variable x takes a certain value. Well, this, forcing that the random variable x, what would be our treatment, takes a certain value, is what is called intervention, and that normally is represented as the probability of the random variable x conditioned to du of x, that operator du, du of y, sorry, that operator du, what it would do is precisely to force that the random variable x takes a certain value. And when we are working only with observational data in which we cannot force that the treatment variable x takes a certain value, then in that case, the only thing that we will be doing is working with an observational distribution that is called, and there what we can calculate is the probability of the random variable y of the label conditioned to observe that x takes a certain value, but already without the operator du of x, okay? Well, then the thing is, how can we address causality problems when many times we only have access to observational data where we cannot perform interventions, because as we have seen to perform a pure analysis of causality, it would be necessary to perform those interventions, but many times it is not possible because the means are not enough or because it is not ethically possible, okay? It is not ethically possible then it is precisely what brings us to the techniques of causal inference, that is, how to infer causality from a set of observational data where we cannot perform interventions, well, well, now what we are going to do from this point is to work with a dataset, which is a study that was carried out during the years 71 to 82 in which what we wanted was to check if smoking stopped affected the weight increase, okay? because, again, there is the belief that if you stop smoking you will get fat. Well, precisely a study was carried out during 11 years in which it was tried to check this fact, okay? In our case, well, I advance that the treatment variable is precisely the stop smoking and the objective variable is the weight difference that there was between the beginning of the experiment that the individuals began once, at the beginning of the experiment in the year 71 and they began again at the end of the experiment in the year 1982 so the label is going to be that weight difference between the 82 and the 71 and the treatment, as I say, precisely the stop smoking. Okay, well, here, I'm going to pass this a little quickly, if someone has curiosity, I'm going to share the code it is precisely on my website, okay? because it has curiosity, or if you put in contact with me and I can pass more code than I have done of this type, okay? Well, we are going to be working with a library that is the library Causal Inference, okay? There are others too that allow you to work from Python with Causal there is one with which I want to put myself, which is Dubuay from Microsoft, okay? But if someone has curiosity, then contact me and we will continue talking, okay? Well, above all, I will bring that, well, our dataset we have loaded here, this is the URL if someone has curiosity to download it okay? and well, here I have done a little bit of data exploratory analysis of this section, I'm going to skip some slides because of limited time and what I want is precisely focus on one thing what is going to be the treatment feature the label feature and what are going to be our co-variables that is, features that also influence but that are not the treatment or the label I agree, well, here we see some of the features okay? I'm going to go down a little more here I have checked if there are null variables okay? here I have written a little the dataset, here I have seen a little the head of the first five rows and here a little of types of columns, okay? here the save because we have 1629 rows and 64 features okay? and the columns because they are all these that we see our treatment feature, and I advance which is quit smoking, which is qsmk, which we have here, okay? and our label is going to be precisely the difference of weight which is this one that is between the years 82 and 71, okay? and then we are going to have a series of co-variables that we will see okay? well, what an assumption that we take is that as it is an experiment that lasted that lasted 11 years, well there are people who died during the experiment unfortunately and then what I have done when I cleaned it of those individuals who did not to weigh themselves in 82, okay? well, well, here I am going to show what the features that are the treatment, the output and the co-variables, okay? well, in total we have 1566 smokers initially, okay? the treatment variable in which we are going to call a, because it is quit smoking, okay? and we are going to do it because it is going to be binary and A is equal to 1, is that the individual quit smoking and if A is equal to 0, is that the individual does not quit smoking okay? okay, well, here we see as there is a imbalance there is a big imbalance that this does not concern us okay? that is, there are 400 or 300 people who quit smoking and 1163 who did not quit smoking okay? and well, here normalized those values and that variable the QSMK law we keep it in the QSMK variable the exit there is this feature is the difference between the weight between 82 and 71 of each individual well, here we have the distinction of the values that this variable takes, because it is an individual who won 3 pesos the negative values are that they were lost, for example, 4 kilos okay? and so on and that we keep it in the QSMK variable okay? well, and now the confounding variables that are as I said before the variables that influence both the variable and the independent variables, such as the dependent variables well, causing with spurious relationships well, they are going to be, for example, these that we have here for example, ACTIV which is a measure of the daily activity that the individual had in 1971, age, education recreational exercise especially we are going to look at ACTIV I agree that it is a bit the daily exercise activity that the individuals did that takes 3 different levels but that an inactive person a moderate exercise and 2 that is a more intense activity okay? and well, another series like sex, the intensity of the smoker, the number of cigarettes I agree and well, the description of these values and above all as I said, what is the distribution of the label for that would be the control group that is, those who did not stop smoking well, we have 1,163 individuals okay? and for those who did stop smoking well, 403 and as we see, it is very unbalanced and then well, here what I have done has been well, graphically that is, painted a little with KDPLOT and histograms okay? this simply to give us an idea of how of how they are distributed the data between what would be the treatment group and the control group and one thing that we can also see is the average of the weight of the difference of weights between the end and the beginning of the experiment for the control group well, which is 1 kg that was won, okay? of average and of average 2 and for the treatment group for those who stopped smoking well, 4 kg of increase of agreement and 3 kg of average and this makes us think first of all, doing this simple exploratory analysis because stop smoking has a positive impact that is, the treatment stops smoking has a positive impact on the weight gain okay? well, well, let's start with what would be the analysis of cause-and-inference why cause-and-inference? because in reality we are giving these data we are not doing interventions properly said okay? so, well the analysis of cause-and-inference well, a pure cause-and-inference could be done as I said doing interventions for example, doing what would be a randomize control trial in which our entire population is randomly divided in two groups of the same size one that would be the treatment group and the other the control group the key part here is that it is divided in a random way to the individuals with that randomity, what we are trying to remove is what is called the selection bias the selection sesbo okay? and that in the end would allow me to calculate a measure which is the average treatment effect for the entire population that is, the average effect the average effect of the treatment so, if I am able to do that what I would do would be to take I am going to put the formula here I am going to tell you a little about the formula well, what we would have to calculate for the average treatment effect is precisely a formula which is the expectation of the output for those individuals in which the treatment has been applied minus the output for those elements in which the treatment has not been applied okay? that leads me to a framework which is the one of the potential the one of the potential outcomes in which we are going to see a notation with which we are going to be working well, when the output corresponds to an individual who has not received the treatment to that output I am going to call it and sub-zero okay? and when the output corresponds to an individual who has received the treatment I am going to call it and okay? well to be able to calculate the average treatment effect well, as I said, we would calculate the expectation of Isu-1 minus Isu-0 that is, we would take the column of Isu-1 for all the elements for all the individuals who have received the treatment that is, we would take the weight the difference between the end and the beginning of the experiment for all those individuals who have stopped smoking and we would do the average we would do the same with all the individuals who have not stopped smoking and then we would subtract them we would subtract those two quantities that is, the average of Isu-1 in all the individuals of the training set minus the average of Isu-0 in all the individuals of the control set and that would be the average treatment effect okay? that was the case that we could do that randomize control-trial okay? but if we have an observational distribution the only thing we can access is precisely this is a little what I told you and what we have calculated the expectation of Isu-1 minus Isu-0 but if we only have observational data the only thing we can access is to this formula to the expectation of I conditioned to that A is equal to 1 minus the expectation that I that individual has not received the treatment or what is the same? the expectation of Isu-1 conditioned to Isu-1 minus the expectation of Isu-0 conditioned to Isu-0 and this normally is different from the average stream-benefit real that we have calculated previously doing a randomize control-trial and you can ask me why these two quantities are different? well, precisely because if we calculate the expectation of Isu-1 conditioned that the variable X is a certain value that is different to calculate the expectation of Isu-1 why? because we are working with different data sets to calculate this expectation we are going to work only with those individuals we are going to take the Isu-1 of those individuals in which the value of treatment is equal to I that is, we are going to work with the individuals that we observe that they have received the treatment but in this case we will be with the whole population or with a randomized selection not only observational but it has been randomized the selection of treatment of the treatment group or the test group the data sets that we are going to work from an observational distribution or from an interventional study are different to see if there is any effect that can be obtained from an observational distribution it may not be the same to see if there is any real effect then what we are going to have to do is a series of assumptions when we run a randomize control trial what we are doing is randomize the assignment of the treatment and what that causes me is that the outputs are independent of the treatment and that would lead me to and sub-zero and sub-one are independent of the treatment A and if we do the join distribution that is, if we do the probability of the treatment A, sub-zero and sub-one then that would be equal to the probability of A for the probability of sub-zero and sub-one because they are independent then we can think that the expectation of ISUI conditioned to X is equal to the expectation of ISUI then that leads us to what I have put in this slide that if we want to run a randomize control trial from an observational distribution we have to assume that we have enough additional information to explain in a complete way the assignment of the treatment to each individual and if we to that additional information that explains the assignment of the treatment we call it as the set of X variables that can be several then we have this that the outputs that is, sub-zero and sub-one are independent of the assignment of the treatment as long as they know those co-variables X that leads me to the joint distribution of the treatment and the two potential outcomes conditioned to the co-variables will be equal to the probability of the treatment conditioned to X by the probability of the potential outcomes conditioned to X, that is to say as they are independent conditioned to the co-variables then we have this result of the joint distribution and this is what is called the Assumption and the Ignorability Assumption because it has the theme that if our set of co-variables this set of co-variables does not contain all the possible confounding variables then any prediction we make because it will be successful it will not be reliable and that is a problem and then it has the limitation that is called the causal inference and this individual we will not be able to simultaneously observe the two potential outcomes why? because if an individual receives the treatment then we can observe and this is what we will call the the actual output and this will be the counter-factual it is precisely Ruben, I'm sorry to interrupt you Ruben but we really are running out of time you did warn me you told me that you had a lot to say exactly would you mind just wrapping up in 30 seconds to a minute and then we'll ask you some questions so sorry to interrupt perfect, I'm going to finish right now I'm going to make an overview what I wanted to tell you is from these assumptions what we are going to do is with machine learning models that we are able to observe and from that model of machine learning that we are going to use to model that output that we do not see then we are going to be able to calculate that open streamer effect that we have told you before and that actually takes us for example to use a linear model as we see here and when we apply precisely that linear model then it takes us to have positive values this would be a technique in which if you use a linear model then you can use other types of models like for example matching you can try to solve the imbalance between the two groups with techniques such as trimming or propensity score all of that if you are curious I have this whole notebook on my website and there you can check it the trick is precisely the frame of the potential outcome is to model the label that we do not know through some model of machine learning and assume that ignorability assumption that the co-variables explain the complete treatment and this would be a little the summary of how to address these types of projects if there is any doubt if you want we can answer it this is a very interesting talk it has been a privilege to hear such an expert and I think you could have gone on for so much further as well you had me very worried at one point I was very scared when you started talking about cheese I love cheese so I had a small it is very dangerous you have to have high teeth you have to have high teeth that the relationship goes back there it is getting tangled up there we don't have that much time but there is one thing I was curious about Ruben when the Covid pandemic started to really grip the world in April and May and I found myself at home like many of us with more free time perhaps than we were used to I found myself in Excel making a column for each country looking at the death rates and the infection rates what could be the causes for why there was such disparity why were some countries so much worse than other ones it didn't seem to be anything to do with the wealth of the country I looked at variables such as religion and all kinds of things but obviously very superficially I wondered if you had applied causal reasoning in any sense to what is happening right now in the world I have to say that it is very difficult to carry out a causal inference project because in the end it is a physical project and if we have a lack of understanding of the physics behind your problem it is very difficult to get an accurate result and because of that as I said in the beginning of the talk you have to dig into the real confounding variables of your problem you have to have a very good understanding of your problem because the key here is to identify the confounding variables that you could have in your problems and these confounding variables could be unobserved confounding variables you have several techniques to fight against these kind of problems and if there is anyone interested in these kind of topics you could keep in contact keep in touch to talk or to discuss about these kind of problems and solutions Okay Ruben, let me just finally just ask you a very I'm going to need a very quick answer but someone here was curious to know if there are any real business applications for what you've been talking about Wolf No Every business you could apply of techniques and I have to say that you should apply these kind of techniques to every business to every model because it's a it's a technique that is a must in the end because machine learning and deep learning models have limitations and causality will be the next step on your analysis of your project if you don't do this kind of of analysis you are going to miss a lot of information a lot of useful information So it's all to come In the end you should try this kind of analysis to be sure about your conclusions Ruben thank you so much indeed that was a fascinating talk I'm sorry to have cut you a little bit short but the time is our enemy here so we've come to an end thank you once again Ruben and we're going to take a 4 minute break and then we'll be back here at the garage thank you very much