 Welcome to this new podcast of the robustly beneficial group in Lausanne. Today, we're going to discuss a paper that has been widely shared, especially in the effective altruism community. It is a paper called AI Safety via Debate, co-authored by Geoffrey Irving, Paul Cristiano, and Dario Amodei, all from OpenAI. It was published in 2018, so it's a year and a half now. And it's a paper about the idea of using debates to guarantee AI safety. Louis, do you want to describe it? Yeah, so the idea of a debate is that we want to answer, we want answers to questions, and we want aligned answers. An aligned answer would mean an answer that is in our own interest. But sometimes, it's possible that we are not able to come up with an aligned answer by ourselves. But if we are given an aligned answer, we somehow have the possibility to verify it. And even better, if we are given a debate between two AI agents about a specific answer or about different answers, it can be even easier to verify from these debates what answer is the most, let's say, truthfully aligned and usefully aligned to us. So a quick example could be you want to know where is best for you to go on holidays. And so you ask two AI debaters to give you an answer. They could propose Alaska or Bali. And then given simply this answer, you still can't figure out where is best for you to go on holidays. But then the two AIs would set up a debate. And they would try to give more arguments to explain why their answer is better. So the first one could say that Bali is warmer than Alaska. So Alaska is better. But then the second one could answer that you will need to have a specific visa on your passport to go to Bali. Argument to which the first one could, again, replicate that the visa application only takes 20 minutes, so it will not be difficult. After this debate that can take any number of steps, we expect you as a judge to be better at figuring out what is the proper aligned answer to the question. Yep. So I think there's some interesting analogy with what we know from computer science, in particular computational complexity theory, where there are different classes of problems. One of them is the polynomial class, which is about solving a problem, let's say, or verifying something. And it turns out that you can have more powerful algorithms if you can solve the so-called NP problems. So NP problems is a problem where it typically you need to search for a solution. And so typically you can imagine that there are two algorithms and one can prove that it understands more than the other just by sending the solution that it has computed using this way. And there are more larger classes of capabilities of programs known as IP class, or IP's interactive proof. I think it's interactive polynomial, I don't know. IP, in a way, it's about the interactions of different algorithms. And using this, you can have proofs that are much more efficient. And in particular, you can solve not only NP problems, but even like so-called p-space problems. And so the intuition of the authors is that this can be applied as well to non-formal problems. And essentially by interacting, you gain power. And that would be a reason why having these sort of interactions and debates could add to the capabilities of, for instance, algorithms proving to you that they are more aligned or more powerful. OK, and another advantage of this framework is also that it somehow it feels easy to implement in practice. And the reason is that so the way it will work is that AI systems should not simply be able to answer questions, but also do the task of pointing out flows in answers. So if now we have just one algorithm, one artificial intelligence that can answer questions and point out flows, you simply need to let this agent be trained by a self-playing against itself or against a different version of itself. And this is a technique that is very well known and mastered for artificial intelligence. For example, this is how the game of Go was beaten by having one algorithm's play against itself. And this is very useful for training. So to say that this framework has some advantages in practical advantages. Yeah, so the idea of algorithms interacting with the user to, for everyone to improve in particular for the human to be able to know what's wrong, for instance, with the algorithm, how to improve on it or to verify that it's working properly is something that's already widely deployed arguably. Like, for instance, if you have a if you're programming in IDE, remember what it means, but like with a debugger, for instance, of your program. Yeah, thank you. Then you can think of this interaction with the IDE as a way to improve your capability and your ways to verify that the algorithm that you're trying to design is doing what you want. You can query information. So it would be like the equivalent of asking questions. You can say, well, could you stop here? Tell me the value of this variable at this point, for instance. And this is clearly a way that improves at least the pace that you can program your algorithms and improve upon it. So yeah, so I think that there are advantages of this, but it's debatable. So let's have a debate whether these are actually robust solutions to AI safety and in particular to alignment. So there's a lot of such discussions in the paper. The paper lists actually in section five what the reasons to worry. And maybe you can spend some time discussing the first and second point, which in my opinion are the most problematic, which is the human biases, like confirmation bias, et cetera. And then the complexity of the debate being beyond what humans can read. For example, I mentioned the IP, so interactive polynomial in the introduction. This assumes that in this debating framework, it seems very obvious that humans will lag behind if the debate is too long or too complex or too many things to read to understand before making judgments. Yeah, I'm even more concerned about the first one, but we can discuss the second one first. Like for instance, if you take a concrete example, like for instance, what should be done about the coronavirus situation? Yeah, what should you be doing tomorrow? Should you stay at home? Yeah, maybe this discussion is very, very complex. And maybe even the right solution is something we as humans have not thought about yet. And this may be just because it's very complex and you need to analyze a huge amount of data to get to this conclusion. And then if you made your whole career on a specific molecule, you would be tempted to say that this molecule is the cure? Yeah, so it would be more like the first point. First point, like confirmation bias. Like the other disease were about 90-ish percent of people heal without nothing. So if you have a strong bias towards believing that such or such molecule or such or such approach, so you'd have a lot of debates based on confirmation bias by people who wants to come to validate what they believe. Yeah. And Patrice, it's quite horrible. Like our figures once sense about the correlation between the belief in some scientific consensus. For instance, a human made a climate change. Or should we worry about the coronavirus situation? These are like there's been a consensus among experts at some point about there is a consensus among experts on both of these. But it turns out that if you look in the US, what educated people believe, they tend to have opinions that are extremely dependent on the party affiliation. In particular, Republicans in this case are more skeptical of climate change if human made climate change. If they are more educated, which is a bit weird. And similarly for the coronavirus situation. So I don't have this figure for the coronavirus situation. But what I've seen is that Republicans are much more skeptical of the danger of the coronavirus than Democrats. Is it because Democrats are more smarter? I don't think so. It's more because of probably some confirmation bias, which is a very widely well-studied and really established phenomenon by psychology. So one of the worries is that the AI is instead of trying to give more true answer and more useful information, they would simply try to manipulate us and play into our confirmation bias. And if we are in this scenario where the judge is a human and the two AIs are debating, most likely, the AI playing best on the confirmation bias of the judge on what the judge already believes will be the one that is selected. That's also one thing that we haven't yet discussed and which is somehow an assumption in the paper. They expect that as AI has become better at debating, they will also converge towards being more honest. And the reason why this could be true is that somehow if one of the agents is trying to lie, in the case of such a debate, somehow it's very hard to win a debate if you have been lying because the other agent could easily point out what you lied about and then win the debate in this way. But this is only an assumption in the paper and it might be not true. I'm very skeptical of this assumption, unfortunately. Especially if the judge is a human. I mean, lawyers are well-known to debate. Like a lawyer debate does not only give emphasis on Bayesian probability, let's say. Like if you want to convince humans in general, so it seems that there are other strategies that are more useful by playing with emotions, with confirmation bias, and other things like this. And like we made the point also when we talked about this paper that even if instead of a human, you would have a Bayesian aligned algorithm, then honestly, it would still not be clear that honestly would be an optimal strategy. In fact, I would highly doubt this because if it's a Bayesian algorithm, it's going to have a prior distribution. And if somehow there's one agent that because of some information that it has access to but cannot share it because it's something it's seen and it has not signed proof of the data. If you cannot prove this data, you can just talk about its own experience, the data that it has been collecting without proving that it has collecting this data. If it just has a posterior that's based on this, and if this posterior turns out to be un-nikely according to the prior of the judge, then it would be extremely hard for it to make the case for itself. And in particular, it would probably lose against another algorithm that's just saying what's most likely according to the judge. So there would be lies that the judge would think are more likely than the truth. And the malicious debater can use these specific lies and beat the honest debater. Yeah. And this is all because of if you're trying to be Bayesian, you need to have this prior distribution on the judge. So of course, we can debate Bayesianism. But I think the authors are pretty convinced, I'm guessing, that Bayesianism is an important way to go at least and having prior is important. And if you have a prior, then you're more likely to be some things than others. And that's the laws of poverty, I guess. It's not about, you know. Yeah, so somehow it seems to solve the question that. So the question whether the agent will converge towards optimal strategy, which are being honest or will not, is somehow reduced to the question whether it's more easy to lie or to defend against some specific lie. And if I understand well, your point is that if the judge is a Bayesian, then it will depend on the lie. There will be lies that are very likely given the prior of the judge. And for these specific lies, the refutation of the lie would most likely be very unlikely. And in that case, it would be a case where it's easier to lie than to refute the lie. So all right, what we said about confirmation bias and also about whether to lie or not to lie is actually also something that's mentioned in the paper that is quite scary about this kind of framework, is that imagine these two superintelligence, artificial intelligence that are working towards not maximizing being aligned, but maximizing convincing the judge that they are aligned. And if we discuss also something that we also often discuss good-out-low, and this is something really to take into account that, what is the objective function of this agent? We want the objective function to be being aligned. But unfortunately, within this framework, the objective function is convincing the human judge that they are aligned. And this makes a big difference. So that's why you said when we discussed this, that it would be a great framework if the human judge is robustly aligned himself, which is not something we expect. Yeah, the judge needs to be both aligned and good and performant. Because if he's just aligned, but he's not very good at understanding the different arguments applying basal to infer what is more likely and what is not, then probably if you have these two algorithms trained to win debates, judged by this human, they will try to, or this algorithm could be also an algorithm. Then they will exploit the vulnerabilities of the judge as opposed to trying to be honest. So of all the framework, I don't think he's very robust at all. I feel there are so many things that can go wrong. And the conditions to make it go fairly well, I'd say, are essentially creating a judge that is already robustly aligned and very performant, which I feel is algorithm being aligned and performant. Yeah, so the judge is just a protocol. Yeah. So I don't think this problem solves the alignment problem at all, because it seems to really require alignment to be effective. And even then, it's not clear to me that it's going to be effective. Yeah, and the second weakness of this was also the scalability, that if the judge is human, he cannot understand questions that are way too complex and way too long, or he also cannot judge too many questions that day because we are limited. And for this, this is again something we discussed in the Rebuild AI paper, that the way we could solve this is by having an intermediary and giving to every judge an algorithmic representative that will learn to imitate the judge and a vote and make decisions instead of the judge. That's also a similar idea that Paul Christian or discussed in the paper called a iterated amplification framework, which is also a similar idea to what they propose here with the debates, which is that the iterated amplification framework starts with one aligned agent with low capabilities and with some method, they amplify, they make this agent better and with higher capabilities. And they were also under the assumption that if the original agent is aligned, similarly to when, in this case, the judge is aligned, then the more capable agent would be aligned. And when we discussed it a long time ago before the podcast, we were also not fully convinced by the robustness of this technique. Yeah, I feel these techniques can improve capabilities, or at least it does in the case of the game of Go. And that's interesting in its own sake. I see this as a way to do faster computation, essentially, because you could always simulate the whole things and these are just like accelerations of the computations, which are still important. But there are two flaws, I'd say. One of them is you still need to do some inference from the data. This is adding capability to solve a task that is already well-defined, but in machine learning and probably to build powerful algorithms, you also need to get a lot of data and do inference from this data. I don't think this is tackling this problem. And the other thing is the problem of alignment. Like I don't see a good argument for why this system would be a robust way to preserve alignment. Like in the case of the game of Go, alignment is easy because the objective of the algorithms in the game of Go is very simple, is to win the game and we know we can write easily an algorithm that tells whether one of the player won or not. But in real life application of algorithms that we want to make robustly beneficial, I think the problem of alignment is a lot harder. Determining what the YouTube recommender should recommend to different people. It's so complicated and so context dependent. Like we talked about the coronavirus situation. Like because of the coronavirus situation, our YouTube should have recommended a lot of videos that were explaining the importance of washing your hands and stuff like this very early on in the coronavirus crisis. But just to know this was extremely hard. Like most people were not convinced by the need to wash their hands. So you needed the algorithm to understand this, even though most people did not understand this. And this requires a lot of techniques that I don't feel are addressed by this kind of approach. One of the only reason for me to see that you could still... So the way to solve alignment is that because they expect a judge to be aligned. And then I think one of the most interesting things is that it's more easy to judge something based on the debate than it is solely based on the answers. So that's why somehow this debating framework is better than simply receiving answers from one single agent. If we... Yeah, so I also think it's true in practice. I would somehow prefer to see a list of positive and negative arguments against one question to help me figure out what is the correct answer. Yeah, I guess it does have some application for... So if you take the framework of rebuild AI at some point you need to design your algorithmic representatives and one way to go, which was what was done by Wibble AI, is to have different options, different algorithms. So the Wibble AI framework, there was this machine learning algorithm that was based on your per-wise comparisons of alternatives. And there was this other model that you built yourself using like rules like if and then and stuff like this. And you wanted to compare these two algorithms. So these are very simple algorithms. So I guess comparing them was not that difficult. And so what they did is they showed example, they asked the user's preference for this example of ethical dilemma. And they also showed the opinions of the different algorithms. So I guess this is a very basic kind of debate, like really just seeing what they think in the end the human judges. But there may be some interesting research to be done in trying to use this framework in more complex settings. So for instance, you can imagine everyone trying to design his own algorithmic representatives to moderate YouTube videos. And maybe there would be different algorithmic candidates for being the representatives. And maybe you could have a more sophisticated discussion where maybe one algorithm say, I think this should be censored because it's incorrect here. It's about coronavirus and it's very important. And maybe the other can then say, well, actually it's aligned with what the World Health Organization is saying and stuff like this. So maybe there's some interest in this kind of framework, but we stress again the fact that I don't think this is sufficient at all, especially if you want the human judge to be saying the end word, because humans are easily hackable to use the words of Yuvano Ahaha'i. Yeah, we have a lot of cognitive flaws. I quite agree with that. There are like a lot of points where what the paper proposes boils down to solving alignment in an algorithmic manner in the first place. Coming back to the judge example. Yeah, so I guess this paper is also interesting. Like the ideas in the paper have to do more, I think with expandability than with alignment. Like again, I think alignment is what's most important, but if you want to get to alignment, I think interpretivity can be extremely useful to know what your algorithm is doing and to verify, for instance, that it is indeed aligned. And well, that this is a lot what we build AI people did when they had these algorithmic representatives that people could test and play with. And I think this is like extremely important moving on. But yeah, again, I don't think that this paper is very relevant to alignment per se, which I still think is the most important point. Next week, we will discuss another paper called Realty Unbanded for the design of clinical trials. It's not directly relevant to the field of AI safety, but it's very relevant to the field of health. It is very relevant to the field of health. It's extremely relevant. For example, if you go back to the paper on emotional contagion, so clinical trials, having safe protocols for clinical trials, for sequential clinical trials, could have made that research more robust and more beneficial while controlling the harm, the potential harm it can have on the users. So it is very relevant to the field of AI safety as we will try to convince you next week. All right, I'll take it back. And it is very timely, actually, since its initial motivation was not from AI, but from clinical trial. And now we are in a situation where people are facing dilemmas on what to try and volunteers are scarce. So now, like yesterday, French agents, he's reported that they are struggling to recruit volunteers. So you don't want to harm these volunteers, obviously, and you want to have the clinical trial as safe as possible and as meaningful as possible to get the results. So it is both timely in the context of the spread of COVID-19, but it's also very relevant in context of AI safety when it comes to large-scale deployments of algorithms on users. And on the same time, you want to deploy an algorithm and you want to have a sequential deployment of policies. So we'll try to convince... Like we're tackling the problem of philosophy with a deadline. Yeah, philosophy on a harder line. It needs something like multi-unplanned solutions and this question, how to design clinical trials. So you want to do it fast, you are in a tight deadline and you don't want to harm the volunteers who are there for the clinical trial. And this situation is the same as what social media is facing. They want to try things to see what is the most attractive to users, but at the same time, you don't want to... All right, join us next week for our next discussion. We'll be very exciting. Bye.