 Hello everyone. Today we'll discuss the mathematical ethics of clinical trials. This is the podcast of the AI safety working group in Lausanne, Switzerland. This topic is very relevant in the current context of COVID where people are looking for drugs, but as we'll see in our discussion, this is also very relevant when experimenting on humans on a large scale and in the general context of safe decision making. Yeah, this is pretty relevant for a large scale system, but maybe you can talk first about the clinical trials, which is a big deal, of course, these days with the COVID-19 situation. And what's quite remarkable is that the WHO, the World Health Organization, has started a new clinical trial based on the ideas, on some very mathematical ideas around the prime of exploitation versus exploration. Maybe you can discuss this a little bit? Yes, so definitely the goal of clinical trials is somehow to come up with an estimate of how good different clinical interventions can be. And this is typically the part that is called exploration. So exploration consists of testing several methods and exploring to measure how good this method is. On the other hand, and this is something that we also desire a lot, which is exploitation, means that once we have collected information about the different available methods, exploiting it means that we will use the methods that we know are the best. It can be case dependent or context dependent, but still have its exploitation consisting of using the information we have collected to be able to make good decisions. Overall, both exploitation and exploration are needed. An example to illustrate this could be that you have the choice between two treatments, A or B. And the treatment A, you already have given it to 1,000 people and somehow you estimated a 50% success rate. And the treatment B has not been tested a lot. Let's say only three cases and it has been successful for just one case. So you estimated a 33% success rate. Now you have 1 million person to treat with either treatment A or B. So a pure expectation would say now let's use only the treatment A because the estimation of the success rate is higher than for treatment B. But this can be a mistake because simply because now if I ask you the question, how sure are you that treatment B is less efficient than treatment A? The answer is that you should not be totally sure of it. Treatment B has been only tested with three cases and it's possible that it has a 70% success rate and that these three cases were simply a little unlucky. And that's why you estimated only 30% success rate on these three cases. So actually if you want to implement a strategy that will save more lives out of these 1 million people that you want to treat with treatment A or treatment B, it's necessary that you do some exploration using treatment B just so that you have a better estimate of the chances of success of treatment B. Maybe you will estimate that it's less than 50%. In that case, treatment A would be the option that you want to use most on these 1 million patients. But there is a small chance that treatment B is better and not finding this chance. Actually, it concerns the life of more than 1,000 people. So yeah, that's why there is a need for exploration and expectation. Yeah, I'm going to try to generalize right away to this point because it's extremely general. And the reason why it's extremely general is that you can just replace treatment A and treatment B by action A and action B. And anything. And usually when we think of ethics, ethics is a lot about decision making. And when we think about ethics, I think intuitively we have this sense that there's the right thing and the wrong thing to do. And that it's a matter of doing the right thing. But prior to this, there's actually a very important phase which is understanding what is good and what is bad. And in the case, for instance, of the COVID situation, it turns out it should be very, very, very difficult because there are different treatments. Like right now, the WHO is testing four different treatments in addition to the standard treatment. And the differences between the treatment may be very small. But you may say, well, maybe it's not that big deal if we're not successful at making the difference between the effectiveness of two different treatments. But if there's only one person difference between two treatments, then we suppose that one treatment save one more life every 100 patients than the other treatment. Treatment A is slightly better than treatment B. We would say, yeah, of course we should do treatment A. But the problem is how can you make sure that you detect this small deviation? And this turns out to be very, very difficult. In fact, if you do rough calculations, you can see that essentially the number of experiments that you really have to be doing is one divided by this difference, so squared. So the difference is one patient every 100 patients. That's a one person difference. This means that the order of magnitude of the number of tests you should be doing is one divided one percent squared. And that's 10,000. That's already huge. That's more than what we usually do for clinical trials. So that's really, really big. And potentially, you may need to find even better differences. Maybe the differences are going to be smaller. And it is a big deal because if you can save one percent more lives out of the patients that are in, say, in critical condition of the COVID-19, depending on sources, the number of total cases in the world can be sometimes estimated at hundreds of millions, if not more. And let's say that there's 10 percent of critical cases out of them. Then if you can save one percent of these 10 percent of 100 million cases, this turns out to be 100,000 lives. So making sure you're doing the exploration right is a matter of saving 100,000 lives. And thus it is a huge, huge deal. Just to explain maybe where this is coming from, the exploration, exploitation dilemma, which is, as you said, a very general dilemma that you can face whenever you're doing decision making, and you have to choose between decision A or decision B, public policy A, public policy B, health decision A and B, or social welfare choice A and D, etc. The thing is that in the second half of the 20th century, there was a field who had to study this with large numbers and on tight deadlines, and this field turns out to be computer science. So in computer science, we had to scale to make these decisions on large numbers and fast. And so this was some weird to people coming from outside computer science. But for example, where this has found a lot of mutations is in advertisements. Online and online, whenever you are on a social network, there's an optimization of which ad you are seeing that is the result of an exploration, exploitation experiment where we test an advertising policy on you, and then another advertising policy on another group, and then we online, like on the way, we're optimizing and selecting and transiting from A to B. In clinical trials, we talk in terms of clinical arm or trial arm, imagine you have many arms, and the mathematics of this kind of decision making, of this kind of optimization, turns out to be very, very relevant in building the adaptive clinical trials the play was trying to extend. So though the maths has developed in computer science, we believe it is very urgent that it gets known a bit more from other fields. And it is urgent, we are now in an urgent situation where people from outside fields of online advertising or machine learning know about what we call, for instance, here multi-armed bandwidth and all the developments that have been made in exploration, exploitation, so that we don't jeopardize 100,000 lives. And so for the multi-armed bandwidth, which is this framework, to this nice framework to think about this, which you've just presented, we can actually ask, we go see the question, when is it that we should stop exploration or carry on exploration before moving to the exploitation phase? So typically for clinical trials, the question would be, what is the sample size of the set of patients who are going to do the test on? And the mathematics of multi-armed bandwidth has given us the answers to these questions, at least in some model. And the answer, if you optimize this sample size, turns out to be 120,000 people, if you assume that they are going to be 10 million critical cases. 120,000 people is huge, is a lot more than, like, maybe it's 100 times more than what we usually do. So that's very important not to neglect the mathematics, because again, like this is hundreds of thousands of lives that are at stakes. But maybe you can do better than this, because here I'm just talking about an experiment where you, a case where you only do first pure exploration, you test all of your things, all of your treatments, and then pure exploitation, you only administer the best treatment out of the ones you've tested. And it turns out that mathematicians have found a way to optimize even this, instead having a sharp distinction between exploration and exploitation. You can have a smooth transition between exploitation, exploration and exploitation. Essentially, as you go, you use the data that you've learned to not cause harm to the patients you're doing the test on, but you're still exploring so that future patients can be, can receive the best possible treatment. And there's, again, a very nice framework to do this. Multiple algorithms to do this. One of the algorithms, most successful algorithms is called UCB for Upper Confidence Bound. And essentially what Upper Confidence Bound is going to do is it's going to have uncertainty, it's going to use its measure of the uncertainty of the effectiveness of different treatments. So you can imagine that a treatment that seems bad, but on which there has not been a lot of tests so far, it's something that probably is worse than the best treatments, but there's still a lot of uncertainty, and maybe actually it's better than you just haven't gathered enough data to know it yet. So whenever this is the case, every now and then you're going to try a little bit, but very rarely, not too frequently, but every now and then you're going to try a little bit this to reduce this uncertainty. Just one small comment, like some people might be scared hearing that we need 100,000 people, we need 120,000 people in experiments. Actually the mathematics that gave us these numbers also give us good news. So for example, for instance, if the difference between treatment A and B is significant, so if in the first trials you see that there is a, I don't remember, laymate exact computation, 15% more success rate. So if the success rate of A is 15% larger than the success rate of B, then you only need what, 2,000? Yeah, 2,000 patients. So those mathematics are not only useful to know that we need a lot of people in clinical trials when the difference of success in drugs are very small. They also tell us that if you have a significant difference, you can stop at a small number of trials and then you can stop and give people treatment B and not give them treatment A. Yeah, that's a very important point because right now most of the clinical trials, you first set a sample size and then you run the experiments over all your sample size, even if along the way you found out that one treatment was definitely better, like hugely better than the other, using adaptive clinical trials, which is an idea that's been out there for the last maybe 30 years or 40 years, but over the last decade also like we really have compelling mathematics and there's a huge push towards these ideas by a lot of biostatisticians mostly. Using this, you can optimize as you go, as you learn more and more data, you can use this data to know when it is that you should stop the experiment and do the exploitation. And this is again a matter of saving lives. It's not that one of the treatment is going to be bad. I think that there's this idea that we're doing the experiments all like there's one good and one bad treatments, but what can happen is that there's one good treatment and one very good treatment and you want to find this out very quickly early on, like you want to make the difference so that you know that this good treatment, even though it's saving more lives than no treatment at all, you still want to discard it because there's a better one. So the framework of multiarm mandate in particular this idea of adaptive clinical trials is critical if you want to maximize the number of lives saved. Another advantage of this adaptive method is that they are very flexible. For example, in the case when a new treatment is added during the course of the experiment, these solutions to the multi-ambanded problem, they easily adapt to a new treatment. A new treatment will simply be one on which we have from the start high uncertainty and will be selected in an adaptive manner at first a lot. But if the first test used with this treatment are negative, then it would be very quickly not selected anymore. In the other hand, if it shows performance that is similar to the best treatment or better than the best treatment, then it will be continuously selected more and more. But what you described before, the way experiments on treatment are run today is they don't use these multi-ambanded processes. It's very complicated to make them flexible enough to incorporate a new treatment in the middle of an experiment. You can show that you have great gains in terms of expected number of lives saved or effectiveness of the treatments overall. It's both the exploration phase and the exploitation phase. Ethically, it's very, very important. Again, it's a matter of a huge number of potential lives, of psychological dramas from all the families and so on. You really want to be identifying rigorously, quickly and efficiently the best possible treatments. This framework is really posing this question and it's really solving in a sense a huge ethical problem that was unsolved as well. Which brings us maybe to one aspect of it is how much overlooked it is by researchers. I will be provocative, including the researchers who work on it. For example, if you look at the statistics in the machinery community, the fact that it was motivated by adaptive trials is known. But many people in the machine learning research community, if you ask them about multi-armed bandits, most often they think of it as a way to optimize ads. Because this is what we hear these days. The thing is that it is there, it is ready, it is already deployed. It doesn't need too much change to be used for something else. We will maybe develop this later in the discussion, but it can be also used for experimenting like policies on social networks, decreasing hate speech, all the bad things we don't want to have on our public discussion. Yeah, I think there's a reflex, too often maybe associated with the scientific method, the idea that you should first set a sample size and do the experiment. This has been the norm in science over the last century, thanks to work by Fisher and all the statisticians. But arguably, we've grown since then, like mathematics has evolved, and we have better tools to do these experiments in a more efficient way, in a more rigorous way, but also in a more ethical manner. I think it's very important that everyone understands a little bit this idea. There's investment in the pedagogy and the explanations of these ideas, because this is extremely relevant not for clinical trials, for sure, but not only for clinical trials. Maybe you want to move on to other applications of this framework to other areas. I guess one thing we have in mind is trying to understand what is the impact of social media algorithms, especially recommender systems on people's behavior in general. This is hugely important, I think we've discussed this a lot on this podcast, but this is extremely important because there are a huge number of people on these social media, just to recall a few numbers. YouTube is used by two billion users worldwide, with an average of half an hour per user. That's one billion hours of watch time on YouTube per day. Most of which is coming from the recommendation system. 70, 80% of it is coming from the recommender system and not from active search by the user. Yeah, so it's really important to know what is the implication of this. What does it imply in terms of people's willingness to wash their hands, to keep their social distance these days? How does it affect what people think about climate change, about different potential risks, about the future pandemics, about the investments in education in different fields? I think it's a big deal and we don't have a lot of data about this. We are in a situation where there's nothing to exploit yet because we don't have the understanding of the impact of these algorithms. At least arguably not the sufficient understanding to know what ought to be done. In this context, if you want to do good, you first have to understand that there's a lot of uncertainty and you should also, like if you apply the exploration, exploitation framework, you should try to do some exploration to better understand what's effective, what are the robustly beneficial actions that you should be undertaking. And arguably you should apply this framework of material bandit to what algorithms are doing these days and to use this information to minimize the harm they're doing or maximize the good that they are doing. One clear example of this kind of experiment on social network is something we talked about on the podcast already, which is the experiment on emotional contagion on Facebook, where they slightly modified the algorithms to show more negative messaging on Facebook feed or more positive messages on the Facebook feed of different users. And they observed a statistically significant difference. The reaction of users were changed by something between one per 1000 or one percent. So, and this experiment was done on a very large number of users. What we recommend here is to use, to do similar experiments following the mutual bandit concepts. And the reason for this is that if this kind of experiment can have very important information and can also have a very positive impact, but the risk is that they can also have a negative impact. Showing more negative messages on Facebook to users made them also post less positive messages and post more negative messages. It's clear that it can be something that we don't desire and it's inevitable. Just maybe, before you move on developing this, I just want to react on something. So these algorithms recommend the systems so that you can study them, to make them do good and not do bad. Just because I see already people probably thinking that then you have to define what is good, what is bad. We're not talking about defining good or bad in an absolute way. You can just pick a subset of very consensual things that society agrees on. For example, we don't want people to commit suicide, but this is something arguably would all agree on. So you can study algorithms and you can do studies and understand how much they could influence. For example, suicide rates and arguably would all agree that our common goal is to minimize suicide rates. And then you can pick up like girl the list like this on topics where you have social consensus. We're not talking about defining good or defining bad in a mathematical way because the title of this podcast is mathematical ethics. So it's not about defining what good or bad is. It's about starting from the easy topics on which we have a social consensus. Let's reduce insults. Let's reduce hate speech. Let's not trigger suicide in teenagers, etc. So you can study these algorithms and their effects on these consensual topics. And by studying them, you can make them do less harm, at least on topics on which you have social consensus. Yeah. And where mutual benefit is interesting is that while doing this exploration, while doing this experiment, we are affecting the world. And we want to be affecting the world in the best way possible during the experimentation phase. And that's what mutual benefit give us as a solution. It's the solution to in expectation have the highest positive impact on the world while still doing your exploration and your experiments. Yeah. We stress that the exploration is needed to do good in the future. It's not exploration for exploration's sake, which is sometimes criticized. And I think for good reasons to do some experiments. But here, we want to do good. And if you want to do good, then you need to exploration. So just to simplify again for people who are just hearing about these topics for the first time. So exploration is the phase where you want to know about the state of the world, what works, what does, what does, what action causes which effects. And then exploitation is about taking the action. So to take this action, you need to have done some exploration initially. So that you know that action A would lead to less suicides than action B. And that's how you take action A. Yeah. Yeah, maybe it's worth stressing this a bit more still. Like maybe there are things we want about the world. We want people to be happy when people to be enjoying themselves, having intellectual or whatever, being in good health and so on. So we have some not to be infected by COVID-19. Yeah, there are so many things we agree on and that we think are desirable. The difficulty is that the action we take, like this can be medical treatment, but this can be what is recommended by the YouTube recommender system. These actions that we take is not clear what the impacts are going to be on all of the things that we care about. And if you think that we already know what this impact is, then probably we are in huge overconfidence because this is extremely complicated. Like for instance, there's this paper that shows that if you want to reduce polarization, then maybe adding diversity in the recommendation is actually a bad idea. What they did is that they forced people to follow some Twitter accounts of the opposite side. And probably what happened is that people got exposed to the politicians that they already hate and this made them hate them even more. So it's like apparently good ideas may actually be very bad ideas because the world is much more complex than we usually think it is. And in order to understand what is good and to do actions that are robustly good for the end goals we have in mind, we absolutely need to understand what are the impacts, what are the implications of the different things that we're doing. And this is very, very difficult and it requires a lot of exploration, which will be maybe a bit harmful, but we have to do our best for it not to be too harmful. And that's exactly what the Muti-Arm-the-Brain-Bit framework is giving us. So just like to put this in context and maybe be more explicit. So using adaptive clinical trials, like using multi-advented frameworks, you won't explore things that actually need to suicide. But you would explore very incremental things that will suggest that the mood is going bad. And then in a very slight manner, so that you can have the insight very quickly and know that this is not what you should show the people. So this is something safer. Yeah. And again, it's also about like even if you have two you may have different treatments that are all positive, like that help people in terms of mental health. But maybe there's one that's much more effective than the other. And using this framework, you can detect this much more quickly and be doing a lot more good. So yeah, I think it's very, very important that all of these ideas are like more people get very familiar with these ideas, that it becomes something that everybody knows about. And unfortunately, so far, I think we're still a bit far from this being very mainstream. Yeah. Multi-Arm-the-Band-Bit very rarely pops up. Yeah, about ethics and ethics of decision making and the ethics of algorithmic decision making. But it pops up a lot in terms of risk minimization in terms of investments and ads. Yeah. So we covered, yeah, we covered that. So there was like the first part, like we really wanted to insist that so given the current debates and controversies around molecules, what works for COVID? What doesn't work? And we're in the face, we're facing a sickness where the rate of spontaneous healing, so the rate of people who recover without any treatment is quite high. So it's very easy to get confused because it's like, we're in the face of a sickness where let's say 90-ish percent of people would recover without any treatment. So if I take a sample of 100 people without any controlled group, without any proper clinical trial protocol, and I give them just water, 90 percent, about 90 percent of them would heal. Should I conclude that water is a treatment? So we're really, we're in a situation now where like this is simplified but this is almost where we are now. And it's very dangerous now that many controversies are growing based on arguments that, for example, deny the role of mathematics in clinical trials or that we don't have reliable tools to quantify. So it is wrong to say that we can't quantify the ethics of the clinical trials. And that's maybe one of the most, one of the reasons we wanted to do this topic this week is that awareness should be real. So we should raise awareness that there is very serious research on this and people should be more aware that it exists. Yeah, I guess one of the things we're trying to push forward in this podcast but in the podcast in general is the idea that mathematics is very important actually to ethics in general and in computer science. And the idea of computation is really critical because in the end, yeah, ethics is about making the right judgments and making the right judgments depend a lot on understanding the computing the outcome. Having the data and doing something with the data and this is computation and you need good computation to have a good ethics. Now maybe the second aspect, so I hope it was clear from the second part of the podcast is that beyond the current debates about how to do proper clinical trials etc. There is also the part in like the AI safety discussion that might have overlooked a bit this question in terms of how to experiment on humans on a large scale and how to do a better, how to do the ethics of social media in a more mathematically informed way. We hope to have conveyed in the last part of the podcast. Yeah because you can imagine, I think I've heard it in the last Your Invited Attention, which is another podcast that's really about recommender systems in particular and that shows all the harms that they're doing. Here in the last Your Invited Attention, they really stress the fact that essentially what's going on right now on social media is the biggest, largest, most important in a sense. And least controlled scientific experiment ever undergone by mankind. And it's pretty scary that it's going on like this and people are not using arguably the right tools to do this kind of experiment. So just like if you want to come back to the exact mention, there was an experiment by Facebook in 2013 on 600,000 Facebook users. And it drew a lot of controversies around the paper, which is fine. But the thing is that since then, so Facebook stopped publishing their research, but social media are there and they are doing experiments on humans on a large scale. And I think it's high time, really it's high time that public health professionals and mental health professionals and social scientists become, so get a word to say on this as much as you can't run a large clinical trial on drugs without having the approval of the FDA or whatever your national institution of health. It should become a standard that public health authorities have a word to say about the experiments that are going on on social media. Yeah, and there's a potential of saving a lot of lives, especially because of what you talked about suicide, for instance, and this is a huge deal. I don't remember the figures, but it's at least hundreds of thousands of people per year, I'm guessing. And arguably this can be like you can do a lot of good, arguably, by using social media to prevent such tragedies. Yeah, there is also experiment on mental health, and this is also something that he's neglected since he was born. And then there's the infodemic, what people call now the infodemic, the pandemic of misinformation, on which you can act. If you start to treat, if you start looking at social networks as something that should be considered with the same rigor as we consider clinical trials. Yeah, so for instance, if we talk about this, there's an instinctive reaction to combating this misinformation, which is debunking, the debunking approach. And this is going to be efficient for some people, but it's not clear at all that it's going to be efficient for most people. And actually, there are different data that suggest that maybe it's not at all the best way to go. And the research on what it is effective to communicate to different people, depending on them also, you can do a customized targeting of the effectiveness of deconstructing misinformation is a very important research area these days. And I haven't seen it tackled as much as I would want it to be tackled. Right. I think we will cover all we wanted to do for today. Thank you for your attention. Next time, we're going to talk about a difficult topic, the problem of privacy and security, related to machine learning. And I don't remember if there's going to be a connection with the COVID situation, but there's very loosely because yeah, there's this problem of contact tracing, which is going to be a this is a big problem. And there's so much to say about this. But yeah, maybe we'll wait for next week to discuss a lot of this. Cool. So I hope to see you next time. Bye.