 I made explicit changes to classes from the one you're on this side. Yes, no matter how you started. Hello. How are you guys started? Thank you, everyone. Oh, is the mic on? I can't hear you. No, you can't hear me? No. All right, we'll speak a bit louder then. Thank you, everyone, in the room and on Zoom for joining us for today's RSM speaker series event. We really cannot be more excited to welcome Dave Rulner for a talk entitled moderating AI and moderating with AI. I'm sure everyone is excited to hear Dave. So I'll offer just the briefest note of introduction. Dave was a member of Facebook's original team of moderators playing a key role, writing its earliest content policies and building the teams and force them. After leaving Facebook, he took on roles building the community policy team at Airbnb and as head of trust and safety at OpenAI. And he is now a non-resident fellow in the program on governance of emerging technologies at the Stanford Cyber Policy Center. Dave's experience in content moderation and trust and safety expands almost the entirety of their histories as fields. So we're extremely lucky to welcome him today to hear his thoughts on where the space may be heading. So Dave Rulner. Hi, folks. It's great to be here. Just wanted to start by apologizing as noted. I've been doing this the entire time and it is all my fault. So sorry about that. I wanted to sort of make a case to all of you around the use of AI content moderation and how I expected to change things. I have come to think that how we're full foundation models are going to fundamentally transform how we do moderation. There's been a lot of focus on the novel risks that those models present. That's fine. Those things are true. I'm not going to dwell on that today. I think it's been fairly well covered. But the models are also because of their unique capabilities and ways of working going to be very useful in solving problems that have previously been intractable. Those sort of solutions are also, I think, going to prove deeply relevant to alignment questions in AI itself. Because at least today our current ways of controlling and steering models are themselves downstream of techniques that actually have a lot of shared DNA with how we do content moderation in the present. So just to sort of briefly recover who I am, why you should care or think about any of this. I've been working in this field for about 16 years at the forefront of sort of controlling social technology first in social media, then in the sharing economy, then in AI. I spent a lot of that time trying to not just grapple with the problems of emerging technology, but grapple with using emerging technology to solve the problems it creates. In addition to working on policy itself, I spent a lot of time at the intersection of operations policy controlling and figuring out how we actually do the news that these platforms claim to have. And I'm going to dwell a lot on that question of actual performance in today. Currently, I'm a fellow at the State Concert Policy Center. I'm spending a bunch of time on this subject, learning how to use LLMs to do content moderation. With a guy named Sameed Chakravati who's another fellow at the Center. He ran Civic Integrity at Metta from 2015 to 2021. We ended up doing the same fellowship sort of coincidentally and we're having very similar thoughts. One is using off the shelf models. We're also doing some work trying to train smaller, large language models to do good at this task specifically because we think that more efficient models to be able to do this would be a helpful contribution to the space. I bring all of that up just to make the point that my perspective here is very much a practitioner's perspective, not an academic perspective. As someone just desperately trying to solve these problems for the last nearly 20 years and very focused on what practically works and how we use these tools in practice, not merely in theory. So, first to just sort of set the table about why I have this strong belief about the importance of AI in the future of content moderation. I want to do some grounding about how I see content moderation today, why it works and frankly why it doesn't work very well. I think there's broad spread agreement it doesn't work very well. It doesn't seem like a good contribution to say here. Bad things keep happening to people watching the child safety era in the Senate earlier this year is enough to sort of demonstrate that to anyone. There are very serious social externalities that are going on. This has naturally led to a lot of public theorizing as to why. There's a lot of discourse in the atmosphere about why social media moderation doesn't work well. And I've come to think we're basically having a problem of evil conversation about tech giants. The problem of evil is this idea in theology that's concerned with reconciling the existence of a benevolent and omnipotent God with suffering in the world. And we're having a problem of evil conversation about Mark Zuckerberg. If Mark Zuckerberg is good and in full control of Facebook, why do bad things happen to people on his platform? And a lot of the popular explanations focus on this idea of benevolence. There is either the notion that they aren't trying, that the tech platforms don't care, they're indifferent. There's the notion that they're greedy, that they sort of do care but they don't want to spend money. Or there's the notion that they're actively malicious, that they have bad values that are very antisocial and that we don't want. Those things may or may not be true. I don't think they're the root of the problem. The root of the problem is that they're very far from omnipotent. We're put another way. We're bad at content moderation because we're bad at content moderation. We're not good at doing the corrective. And to understand why we're bad at it, it's important to take apart moderation into a couple of components. Values and the actual classification task. The values piece of moderation receives a lot of attention. There's a lot of discussion about what the rules should be. Community policies are primarily understood, I think, as expressions of companies' values. That's not untrue but it is not the most significant thing that those policies express. I've come to believe that the focus on values is a form of bike shedding. Bike shedding is this idea sort of relies on a story about gaslighting. That if we were all on the board of a nuclear power reactor, all else equally would spend more time discussing the color of the new bike shed of the reactor than we would discuss in nuclear safety because more of us can have opinions about colors and bike sheds than are qualified to have opinions about nuclear safety. Values function in that way in this discussion. Everybody has values so it's easy to have opinions about values. The reality is that the sorting task underneath the values, the classification task, is the thing that we are very bad at and that dominates any possible set of values. To get into why classification matters and to give you some examples of how that is the case. There are lots of situations where the value proposition that you want to achieve is not particularly in dispute but where the ability to do it is very, very hard. To give you some social media examples, reclaimed slurs are a great example of this. It's very intuitive to say we want to allow people, more members of the community to use certain language but actually doing that requires you to know at scale who is a member of that community, who they are speaking to, what the actual context of the conversation is. So doing the thing is hard even if agreeing on whether or not to say it is good or bad is hard. Similarly, the controversies here are often made worse by public pressure. Facebook's breastfeeding photos controversy back in 2009, 2010 was very much one of these problems in classification. The question of should breastfeeding mothers be able to upload photos of their children on Facebook is not really that interesting of a policy question. Getting a moderation system to very consistently distinguish between immunity where there's babies involved and it counts as breastfeeding and immunity where there's not is hard and flawed and so the ability to execute one policy is challenging. The napalm girl incident that happened in 2014, napalm girl is a reference here to a photo of a Pulitzer Prize. It was a photo of a girl who had been attacked in Vietnam. It's very, very intuitive to say all of the Pulitzer Prize winning photos should be on Facebook but you have to know what all the Pulitzer Prize winning photos are everybody doing moderation to know that too or you can't actually achieve that policy goal. So why are we bad at large scale classification? We're bad at large scale classification because fundamentally we're trying to solve an industrial scale problem with pre-industrial solutions. Social media is this mass distribution machine that requires no intervention from the enterprise that allows billions of people to talk to, billions of other people nearly instantaneously without any direct human intervention in the communication itself is pure mass production of speech. But we don't really have mass production capability for the ability to moderate speech. We're still stuck in essentially a piecework system. Piecework was a way of manufacturing textiles and articles of clothing sort of early industrial revolution when we invented modern machines but hadn't invented machines that could do sock knitting. And so work would be parceled out to people in their homes to be done in an artisanal way to a particular spec. Moderation today essentially works exactly like that, right? We have specific people sometimes distributed, sometimes in one place working against a document that tells them how to make content on an artisanal level except they're doing it on mass. And systems like that have trouble achieving very high degrees of consistency. We don't have machinery to do those core lines of the process. And I'm going to sort of get into a little bit why humans struggle so much to do those processes well. Even where we do have machinery for different parts of this classification process machinery itself mostly replicates the problems that humans introduce into the system as it exists today. And then I think finally the nature of language itself probably caps how well we can do this. We are not, you know, making machine parts here or making fabric. We are ultimately dealing with classification of language and culture which is a fuzzy activity inherently. And so the upper boundary of excellence is probably fairly low. That said, until we make progress on at least some of those human or machine constraints we're not going to see better moderation online. I think AI is going to help with that because it sort of passes both human capability and the capability of our current machines in a number of very specific ways that I'm going to get into after this after sort of digging into specifically how humans fail because the ways in which we're inadequate to these tasks are important understanding the ways in which elements are important to be helpful. Okay, so why are human beings bad at classification? We're bad at classification for a lot of physical reasons. Our working memories are really, really small. The length of the sort of content policies you can feasibly write as a policy writer are like maybe a few pages, maybe five, maybe six. That's not because we couldn't write a much longer treatment of what hate speech might be or how to tell whether a photo contains unity. It's because most people can't actually use a hundred page document about what hate speech might be to make a thousand decisions a day. Particularly not if you want them to stay up to date on what that document says and you're changing it all the time. Our long-term memories are also very underlying. I sort of alluded to this in the napalm context. But sort of notions of art, which is again the thing I think we all think is probably good and want to have you on Facebook or notions of what a real name is are basically lookup questions, right? There's no art pixels. Art is just all this stuff we've decided is art as a society. And so in order to treat that stuff differently, you have to know what it is which means you have to remember what it is. And people are not terribly good at remembering huge amounts of very specific facts about individual pieces of content. It can be done, but that's what getting a PhD is for. It's not something that you do as an hourly job. We're quickly exhausted. This work gets coded as introductory level work. It's entry level work. Focusing intently on content for thousands of repetitions for eight or 12 hours a day is intensely, intensely draining. People get tired. They make mistakes. They get bored, which is another sort of understated part of this. There's the trauma and emotional part of the labor that's been much discussed, and that is very much real. But honestly, a lot of the time, the work is dull. Many of the things you're looking at are not interesting to classify. They're not violating. They're just kind of random noise. It's a little bit like staring at white noise on a television screen and waiting to see if something meaningful shows up. And it's very hard for people to maintain focus under those kind of conditions. We rely on our own internal models. People don't really use the rules to make these decisions. They read the rules once, use them a couple of times, internalize some sort of approximate model of what the rules say, well enough to not get in trouble with their boss, and then just keep doing that until they get in trouble again. And so as these rules change, people lag in that change. We also typically can't recall our reasoning. If you ask a given moderator why they made a particular decision when they made it yesterday, if they made a thousand decisions, they're almost certainly not going to be able to tell you. And so while this is this human process, which sort of seems like it has meaning, the meaning is often not retrievable. And then finally, and this one is often hardest for folks to grapple with, we really do not have any shared common sense. And there's a couple of specific examples that were really important in shaping my thinking. So we, and morning up front, all of this is going to get a little unpleasant from content. We were trying to figure out how to classify CSAM, Trial and Sexual Abuse Material, at Facebook, for the purposes of creating the photo DNA databases that today underlie a lot of the attempts to control that material online. We had 12 folks who'd been doing this for a year and been reporting information to Necnec that entire time. These were full-time employees, like me. These were kids who went to Stanford and Harvard. And when we asked them to classify material to simply report to Necnec and not report to Necnec without talking to each other, they could only agree about 40% of the time. And they've been doing this for a year. On what you would intuitively think is the worst thing, the easiest thing, to get consensus off. No consensus. And most other areas are actually worse than that. When we first tried to outsource nudity moderation to India, we ran into a similar problem where we had rules that had said, also take down anything that is sexually explicit beyond a bunch of particular things we listed, and immediately the moderator started taking down photos of people kissing and holding hands. Because what we meant and what they understood those words to me were not the same, because you just can't assume shared reference or shared values. All of that is made worse by the economic way we currently organize this labor. So the work is very notably poorly compensated. I think some of that is probably inevitable given both the sort of scale of which it's done and a bunch of the other things we're going to get into here. Enforcing being sort of forced into consistency is pretty demoralizing. It's an alienating form of labor, particularly because this is labor where people have strong moral feelings about what they're being asked to do, and so it's unnatural to be asked to put on a sort of another morality, but putting on that other morality to get everybody on the same page is like the heart of the activity. So you kind of can't avoid that, and that is itself draining and not a ton of fun. So people who have better options leave. These are high turnover jobs as a rule. In the context that this is well known in the context of our customer support, but even for Airbnb's outsourced trust and safety teams, the average retention is about 90 months. It's a very, very short period of time because when people have the ability to do better work, they go. That undermines the approval of expertise. It also means that you have to invest a ton into sort of training and updating the system because you're constantly teaching new people to do this stuff and having to constantly reorient them as you make changes. So the entire system is extremely cumbersome and doesn't need to sort of update or accent results. Cool. Okay. I hopefully have an interview at this point. People are not good at sorting things into the audience. Why is our automation bad at this? Our existing automation is bad at this because all it's really doing is statistically copying what all those people who are not good at this did. Our most advanced automation techniques, our black box machine learning, is just predicting what a human moderator would do if you ask them to sort of piece of content according to a policy. It's just a mathematical simulation of the results you would get if you had bothered to ask a person, which is very useful to be clear because a lot of the time it does a pretty good job and it means you can avoid asking people, which is great. It avoids trauma much faster, has some real upsides, but it also means that it inherits the fuzziness and the unreliability and the nonspecificity that we bring to that process. There are other forms of automation, but they're actually even simpler, right? They're either asserted if and rules, exact word or pattern matching. They're all even less nuanced and less capable. We have no automation that does the activity that humans are actually doing as part of this classification process. We only have automation that simulates the outcome of the activity that they've been doing. And actually, the automation we have today adds a bunch of other problems. Their decisions are meaningless, very literally meaningless, right? They're not making an argument about why a particular piece of content fits or doesn't fit a given policy. They're simply saying 95%, this is shaped in the same way as other things you told me about in this policy. There's no meaning to the decision, which both makes it hard to debug individual decisions and is very disturbing for, frankly, people who are subject to those decisions because we want these things to have meaning, and to dispute them and to be able to argue with them. Updating the models is also extremely cumbersome. Training large machine models under current circumstances needs thousands, tens of thousands of examples, which means every time you change your policy, not only are you having to update the humans, and wait for that to all phase in, you then have to wait for all those humans for the tens of thousands of things to then be able to change your machine model, machine learning model. So our automation is also often very out of date. We're giving platforms policy allegedly at a particular time, which is confusing in results and outcomes that are not ideal. So also machines not very good at classification at least in our current circumstances. And then on top of that, I think the best we can ever hope for here is significantly less precise than what we can ever hope for in material manufacturing. Despite all of the manufacturing analogies I've been making, we are not in fact making steel cylinders and metal-firing lathe. We are playing around with words and words that are inherently vague. The language itself is just not terribly precise and it's particularly true in mass-scale social media where people often write, frankly, fairly badly or untheory or approximately from a derby point of view. Everyday language is not meant to convey precisely specific meanings. It's meant to be efficient for people with a shared context. And social media moderation doesn't share that context. It's a very radically post-modern exercise. All of the authors might as well be dead. There's only the text. You're sort of just staring at these word words after the fact and that renders them very, very difficult to understand. A version of this is the conversation about cultural context that comes up a lot. Having more specific cultural context is helpful here, but that's only a version of it and a lot of it is the easiest version to solve. Local interpersonal context is as big a part of this problem as broader cultural context. If I call you an elephant, am I calling you old, wrinkly, grey, fat, wiser or republic? There's no way to know the answer to that question outside of our specific personal relationship and there never will be. And so we're sort of tapped at the maximum here. So that's the doom part of this. Aside, all of the problems I outlined are also problems for AI alignment in your current circumstances. So our primary techniques for AI alignment, reinforcement learning, rely directly on curated datasets of desirable behavior that we are trying to get the machines to copy. All we're doing in ARL is curating this out of promise and responses to the model and allegedly from the model that we want the model to behave more like and then doing a mathematical process to get the model to aid that behavior, which means that ultimately what we are aligning the model to do is dependent on the same kind of content classification and having the same kind of content classification problems that I've just been talking about in the social media context. And you can see this in the two kinds of reinforcement learning that we talked about most frequently, reinforcement learning with human feedback, which is where we're having humans do the classification, and reinforcement learning with AI feedback, which is where we're using AI to do the classification, like even in the activity itself, this is baked in there, and that all of our other techniques for controlling the output of general AI today are just wrapping content moderation techniques around either the input prompts that people send in the model or the outputs from the model in response to inputs. It's all the same stuff. And so thinking about how we can do this core task better is relevant both to social media and actually also deeply relevant to conversations around AI safety. And you can see this in the Google Gemini Glowup, which was, at least for my point of view, almost certainly either some poorly thought-through alignment instructions or some poorly thought-through moderation of the, and moderation and modification of the AI content problems to try to correct for problems in the model itself. It's simply failures of classification technique and a lack of... For example, chat GBT when we first launched it wouldn't tell you facts about sharks because we had taught the model that violence was not good and we didn't want it to help you play in violence and we also didn't want it to graphically describe violence and it wildly overrotated and was like, got it, bears are canceled, no more bears, we can't talk about bears, which is this perfect example of this sort of classification of a response of these kind of systems. So setting all of that context, generative AI, actually I think going to be very helpful here, used properly, it is possible to, for it to exceed both humans and machines under existing real circumstances and by used properly, I mean very specifically. So the naming generative AI is in a lot of ways a bit distracting for this purpose. It's more important to understand it as language parsing AI, reading AI. We have machines now that can do something that functionally is equivalent to a human reading a document and responding to what it said, which means we have a machine that can directly address the core activity that a human moderator is doing instead of merely producing the result. And I'm not speaking theoretically here, this already works shockingly well. One of the first things we did internally with GPT-4 when it became available to us in August of a couple of years ago was try to figure out how to use it to do content moderation. And within a week or two, me and a couple of other engineers were able to get to 90% plus consistency with my decisions with the model reading a document that any of you could read and following the instructions it provided in order to classify content. And things have only gotten me better from there. Open AI published a blog post about this middle of last year about using GPT-4 content moderation. There are multiple startups pursuing this path and it's something that I've continued to work on at Stanford with Savide, who was particularly interested in fine tuning smaller models to be able to do this because the smaller you can make the model the more broadly adoptable it will be. Doing content moderation with GPT-4 is a little bit like going to a grocery store in a Ferrari. Like you get there, it's very expensive and most people don't have one. And so building a smaller or compact, more usable, more broadly accessible system seems to us to be pretty important. But when I say, you can in fact use these models to read policy text, follow it and classify content that is not theory that has already happened. And used in this way the LLRs directly address a number of the problems with human moderators, right? Their short-term memory is already better than ours. The largest models have context lengths of hundreds of pages of text. And so you can load tons of information into a model for making a specific decision. Their long-term memory is or will be more reliable using things that databases plugged into a model to be able to give very exact recall of large amounts of information. They don't get bored, they don't get tired, they don't lose focus, they don't seek better jobs. They don't experience trauma, right? Which is a pretty important part of this. There's, I think, a real moral case to do that here as well. We can reasonably expect them to record what they did and why they did it according to the text as they understood it at the time every single time to make a decision. And store all of that information which helps with things like the requirement for explicability that is embedded in a lot of recipient law. They're also way, way, way, way faster. The largest models are much faster than people even doing this very cumbersome policy text-driven process. And then on the flip side the LLMs are better than existing machinery because, again, they're directly doing the task. And so they produce responses that are scruitable, or at least feel meaningful. There's a broader, like, philosophical question here about whether or not they're really reasoning. Well, honestly, for these purposes, I don't think it matters because they are producing reason-shaped answers in language and those reason-shaped answers can be used to debug the decisions the model made by changing the instructions you gave it. So when you have a model make one of these decisions if you don't agree with this decision you can simply ask it why. It will make a bunch of words at you. Whatever you sort of think that physically they are useful for understanding how it was functioning and you can incorporate that feedback back into the policy text to produce change behavior. And this works really, really very well. And so it's functionally equivalent to an explanation in the sense that it is a word-shaped response that helps you understand what happened and why and do something about it. And so at least to me, a lot of this sort of handwritten around whether or not this is true or not is relevant and is functionally for this task in short-term. That's not me dismissing those concerns in a sort of longer-term, more AGI-focused way but for this purpose with something like GPT-4 it's neither here nor there to a very great degree. I'd also point out that when you're dealing with really any people but certainly people in a mass bureaucratic system you don't understand why anybody does what they do either. They don't know how to get reasons that mean anything out of the systems that we have today. So it's also not super clear to me that the alternative is really well thought out of clearly described reasons. I don't want to stand here and sort of make the case that this is magically going to solve all of our problems. So please do not take me as saying that. So first the systems themselves will have flaws. They're going to make mistakes. Some of those mistakes are downstream sort of language limits that I talked about earlier but some of those mistakes are simply going to be errors or problems with the performance of the model. They're going to have biases. There's been a fair amount of reporting about this already in the use of these systems for things like hiring decisions or other kinds of adjudication decisions that's very real. I'm not minimizing the need to work on those problems but I would say that at least those are static engineering shaped problems instead of this intuition we have now where all of the individual moderators also make mistakes and also have biases but who they are changes every nine months and so understanding correcting for controlling those biases is essentially I think impossible at present because it is this sort of roiling massive chaos simply pinning down the biases to a single set of them such that we can start to study, understand and engineer what we have for them to be more aware and more tractable than where we've been where there's this essentially like ever churning cauldron of biases that is never static and therefore cannot be stabilized. I also do think that this is going to circle the background of my earlier point of classification is more important than values. I weirdly think if I'm right about this we're going to have more fighting about values because we're going to be better at doing the thing and so what the values are is going to start to matter more and I think we're already seeing shadows of this in some of the sort of woke AI culture of more stuff that is starting to creep into AI alignment conversations and some of the reaction to Gemini so oddly as we get better at doing the activity we're just going to fight more about values. That said, I do think and I sort of like I've shaded this earlier I think it's morally urgent that we figure out how to do this even given all of those flaws. People continue to do this work is bad and there's been a lot of focus on ways in which the work conditions can be made better ways in which pain can be more equitable breaks can be given, preventive techniques for controlling wellness. All of those things are good given no alternative but in a lot of ways they seem to me to be questions of like engineering a better radiation suit when maybe we could just have a robot do it instead so much about radiation protection, right? The problem with radium girls making watch faces wasn't just that they were licking the radium, it's that they were painting watch faces with radioactive material which is if not a safer good idea and I think getting to the point where we can relieve the sort of direct cold-faced labor here from humans is an actively good thing even though it is fraught I think that's actually doubly true for marginalized groups personally and part of the sort of perverse shadow of the request for more cultural context being injected into moderation is it's essentially a call for the enlistment of people who are victimized by speech in the controlling of that speech to begin with which is perverse when you think about it that way this will mean I think job losses particularly at BPO's but again it's not clear to me the job losses are first say bad if the jobs themselves are dangerous toxic and not conducive to human thriving I also think it will need more jobs overseeing these systems on the flip side so even with all of those caveats this is a really big deal if you accept the case I have made to you not just because it's not just going to mean we are going to lift and drop AI in place of human from our moderators it will change the kind of systems that are viable to have it opens up new possibilities of moderation things like super deeply personalized moderation become more feasible ambient moderation become more feasible I think in the future LLM Howard systems are going to allow things like Siri to prevent your grandmother from being pig butcher over the phone which is an inconceivable thing to try to do right now but seems very possible in this sort of future in the same way that deeply personalized moderation filter seem possible it is utterly transformative of the policy drafting process right now a lot of content policy is basically astrology about how moderators will react to the words that you wrote you're sort of guessing because the update time is so long and retraining is so cumbersome that you can't really do empirical testing of the outcomes of your decisions feasibly these systems respond instantly to your word changes which means you can actually test different versions and approaches to things and see how that produces different outcomes which is revolutionary in terms of the policy process directly I think it also is going to open up new policy vistas not just new processes right right now we have generally global moderation standards on the social web because frankly it's cumbersome to do nation by nation moderation for anything but the most sort of large scale blocks that may no longer be true right we could potentially start to think about really localized or regionalized moderation standards and then similarly like different moderation philosophies that no one has ever really to my knowledge seriously tried to engage in at scale become possible things like deeply intersectional approaches to moderation which no one has ever tried to do because it's just like wildly cumbersome and impractical might become possible with these sort of tools a bunch of those ideas are probably bad to be clear like I'm not saying all of the things I just said should happen I'm saying they're now not impossible and there will be other better ideas that are now not impossible which is going to be very interesting similarly changing the cumbersomeness of our moderation technology will change the kind of platform designs that are available there's been a lot of discussion of network effects in social media yes I think an under discussed aspect of the reason you see a lot of centralization social media is how annoying moderation is to do at scale I have been very skeptical of federated solutions simply because I did not understand how that was going to work at the level of mastodon except scaled into Facebook size these sorts of systems might actually provide the ability to make that kind of system work just as an example of this sort of consolidation I think you can actually see due to moderation think about reddit and the power mod situation reddit is technically a sort of flat federated system but the reality is like a few thousand people moderate half of reddit because it is in fact a full-time job that has caused a bunch of concentration in the bureaucratic processes to do that moderation even in a system that is designed so I think that's really really interesting closer to the extreme why are we even talking about moderation at all if I'm right about this why aren't you ending up in dialogue with the text box you're trying to write in about whether or not what you're saying is constructive again maybe creepy maybe a bad idea but a possible idea now I think there will be more versions of that these systems are also going to create new kinds of abuse right ultimately this technology is technology for sorting things into piles regardless of what the piles are and why you want to do that so it is going to be useful for things like censorship and surveillance it's going to be useful for things like job owning the virtuoseness of these tools is simply a product of how they are used it's not a product of the tools themselves I think we can easily see the law start to specify exactly what the content moderation process standards will be that is probably a bad idea but I suspect we will see some folks attempted at some point as it becomes more and more possible all of that said though I think this is coming no matter what, I'm really quite sure that some version of this is going to come to fruition and so I think we all have an obligation to sort of embrace it and try to figure out how to use it well now so that we're not left on the back foot when it just becomes increasingly prevalent so with all that said A hope that was convincing and B just want to leave you with a more provocative question which is what would the internet look like if we weren't terrible at content moderation? The internet has been assumed to be a sort of semi anarchic space there's been a lot of discourse about how that has become less true over time and some I think about a listful warning for a freer version of the internet among these certain quarters but really we're still pretty bad at content moderation if I'm right about this I'd start to really seriously change the dynamics of how the web even could work in ways that I think are really hard to sort of get our heads around maybe bad, maybe good but worth sort of noodling on in so far as you accept my case so with that, I'm done Alright, thank you Dave I'm just going to start out with a couple related questions from folks online but if people have questions in the room just fly down my colleague, we'll be running around the mic the first question deals with you mentioned sort of the ability to retrieve the reasoning of LLMs while they're making these decisions how does that intersect and interact with the possibility of hallucination and then also policy statements are often intentionally kept high level and broad and how well can LLMs capture the socio-cultural context that are relevant to implementing broadly described policies and a related question to that second point words change for the meaning of words change pretty rapidly and increasingly so especially with regional dialects and how does that address or how does that interact with the set of problems that you're describing Great question, so on the recall of reasoning really specifically what I'm proposing there is asking the model to actually print a reason that you then store every time it makes a decision the hallucination question is a little bit separate from that mechanically in the context of how we've used these so far you're actually feeding the model the policy text every time as a prompt and that is that sort of grounding in the specific document you're using to make you're asking it to use to make a decision is very helpful not perfect but very very helpful in controlling hallucination hallucination is often worse when you're asking the model to sort of remember what it knows whereas if you're asking it what does this document say that is helpful in reducing hallucination that is it is absolutely an issue again though people also make all kinds of really big choices and so the question is not is that an issue or not is it better than the status quo and will it continue to improve we have spent nearly two decades and billions of dollars trying to squeeze more juice out of the human moderation lemons and I just don't think there's any more juice in those lemons but we have some new lemons and maybe there's juice there you have two other questions the changing meaning of words over time yeah that's happening rapidly yeah so that definitely is a problem part of that is something that you can address with the sort of database and long context stuff that I was talking about where you simply tell the model here is how you will understand words to be understood for this task at any given time and so there is actually I think in some ways more ability to direct models there relative to people who again also have that sort of freshness problem so there are very serious advantages about keeping everything on this page we have a question in the room Hey David, back here I'm very interested in fellow Brooklyn clients this year and I used to work on content moderation issues before I got to law school you know I think that time horizon for what people were concerned about with like social media and content moderation I think like when you were first starting like I faced way back in the day probably a lot of folks in the hill in particular weren't as concerned with content moderation they weren't thinking about it as much I think you fast forward to today like it probably would be I don't know probably most legislators folks in the hill know a little bit about sexual behavior at least they've heard about it right it seems a time horizon with like artificial intelligence specifically generative AI is much shorter but the fear and like concern with it is also much shorter like the rising concern happened much faster than I think the content moderation issues and I'm wondering to the extent that you're speaking to like policymakers or regulators what do you or if you have thoughts for them what are you saying to them about like things that should actually be concerned about instead of like maybe some of these like culture war like bookie man type issues yeah there's been a lot of focus in the sort of focus on AI around longer term questions there's been a lot of discussion around bio risk and similar issues I actually think I'm not saying those aren't important but I actually think we sort of missed the boat a little bit on some of the shorter term content related issues less in the text model space but in the image video model space open source image models are able to generate large amounts of CSAM and it's really really difficult to control those there's been some interesting work from SIO around this topic right now I think regulators are focused on the long term versus some relatively specific short term problems that are happening that honestly shouldn't be super controversial from a culture war point of view because they revolve around some pretty core abuses that we still do have social consensus on another question go ahead yeah seems like one difference between the internet that worked fairly well 20 years ago and the one that doesn't work that well now is the scale we talked about but maybe the answer here isn't to try to throw more AI into the column but to back away from the large scale and back to a time when you had many people on many different platforms each of which had its own idiosyncratic moderation policy and people would just sort themselves into communities that they felt they belonged to yeah this is essentially the federation proposal as a solution I am skeptical the federation is a solution because it doesn't address the worst kinds of harms because some like space owners are going to be bad actors but in so far as we're dealing with harms where the issues I don't want to be exposed to a particular kind of content sorting is great and we should encourage it this isn't me saying it's like a bad thing it doesn't deal with the kinds of harms where the problem is content that people see about me a problem in the context of NCII non-consensual intimate imagery isn't that I saw my nudes it's that you saw my nudes right and so there's no really great answer sort of absent essential for those kinds of problems and those are in fact the worst harms that happen in the content space nearly universally in my view my name is Matilda from behind the region business school I work with UN on online harm circulation particularly CSEM and with MIT on a analysis so my question to you is you have mentioned there are many developments currently that are leading towards end-to-end encryption and we know first what happened with the CSEM regulation after what's up once in end-to-end encrypted how and now we're seeing the same with a messenger right so how do you see these developments that are leading towards agronization or end-to-end encryption and all of these other similar developments affecting the moderation thank you so this is where the sort of like wild possibility stuff gets really interesting and I wish in a lot of ways we had a healthier conversation about the trade-offs between my I wish in a lot of ways we had a healthier conversation about the trade-offs between privacy and safety because I think they are real and they're uncomfortable and that leads us to sometimes try to dodge them when I say that I feel like when people say that they often sort of steer into a and we're overvaluing privacy versus safety that's not necessarily what I'm saying I'm more saying we need to be honest with ourselves about difficult trade-offs on the downsides that they carry in both directions I also think though that some of the stuff I'm pointing to might have really interesting implications about ways to compromise you could conceivably if I'm white get an AI moderator to be really really excellent at interdicting certain kinds of super problematic material and lock it inside of an encrypted space such that it's never doing some courting but it is doing interception again maybe a bad and creepy idea but a now not impossible idea in a way that previously that was sort of outside of the design space to even consider so I think there's some interesting implications for even the attention of some of this if I'm correct about where it leads great and just a couple questions from online one is asking you about how the moderation tuning and red teaming of LLMs themselves has parallels to and differences from moderating social media have the gender of AI companies learn the lessons they should have and then a couple related questions about using using gender of AI in linguistic and cultural contexts other than English and outside of the minority world so in terms of the similarities and differences social media moderation so content is content at some level the products here are not AI is not capable of making a picture that a human could not make conceivably it doesn't work quickly but we have Photoshop we have been uploading billions of us have been uploading things up to the internet for a decade so functionally all of the photos that can exist have been on social media and so in that sense yes there very much are lessons to learn and there has been a movement of trust and safety professionals with social media backgrounds into the AI space I think because the companies do realize this there are other parts of it that are unique and red teaming is actually one of those areas where it's particularly unique and that's because the question with red teaming which is the process of sort of prodding the model and trying to get it to do something that it shouldn't do is a multi-turn conversation it's not just an interception or a single sort of past question you're essentially trying to convince slash bully a robot into generating something for you that it's creators trying to teach it not to create and that process is very different than sort of testing social media systems because there is evasion in those systems but there isn't that sort of convincing process in quite the same way and that is really interesting I do think the AI companies so it's hard to generalize about that right Google is a very different thing than open AI or anthropic as an organization because it's a massive behemoth that directly owns a bunch of social media open AI and anthropic are both very much startups and so having learned the lessons of social media is sort of a weird question to ask about that group but I do think that we are not starting over in quite the same way there's still an organizational capacity building question particularly on the startup side but there has been a lot of learning from that in terms of low resource languages it's actually a great question that I skipped over in my notes but supporting low resource languages and cultures with traditional moderation solutions is really hard there's as many people in Massachusetts as there are in Serbia and that's difficult to staff for and Serbia is frankly like a fairly big example right as you get to smaller minority languages that becomes really hard LLMs are going to magically solve that but I do think that if we intentionally train for them to be capable of it they will eventually prove better at load balancing because you can support the capacity to understand those languages in detail better than you can when you have to get a group of humans where if you want 24-7 coverage 365 days a year you're talking about at least 21 people to cover like a decent amount of content and so that gets difficult to justify if you're not seeing enough content in a particular language to actually justify that many jobs so I think there's hope there as well but it will require focus it's not just going to magically happen just another round of online questions and we'll move back to the room two about the commercial implications of what you're describing one question is asking for your bull and bear case for when a company could implement 100% powered moderation that's fast, cheap and accurate and another question is asking about the way in which content moderation has become a competitive advantage for some platforms and do you think the development of AI moderation models will commoditize content moderation for better or for worse taking the second one first it's already commoditized and in a lot of ways the fact that it is mass produced commodity is sort of the thing that I'm pointing towards and I think a root of a lot of our problems I do think we are going to see trust and safety as a field and moderation as a part of that move in a direction that is more similar to cybersecurity over time where there are more external vendors providing solutions instead of everything being homegrown and rule your own which is the case at today's very large platforms mostly for historical reasons because the external solutions just like didn't exist so we had to build them because there was no one to buy them from I suspect that will change over time and I think you're already seeing some of that in terms of the bull and bear case my bull case is like now there are startups that will do this for you and at least if you are focused directly on text and you have a flexible definition with the flexible relationship with the definition of cheap they can do this pretty well with automation using LLMs today once you talk about images that is also possible but probably not even any visible definition of cheap bear it's less about possibility actually than adoption bureaucracies are slowly changing the uptake of new technologies into these big processes is slow so I've been using five years as a number but that's kind of because I don't have any idea what the future is going to look like more than five years from now for any really good thought through reason super interesting thank you so you said that you think that one of the problems with the AI's is that they're relying on early chat and so there's kind of like a cap which is the cap of the highest performing crime workers and then many of the benefits that you talked about for using generative AI in this context were more of like they don't have the same problems of exhaustion and lack of memory and stuff as humans do you think that the first set of problems of the cap of the best crime workers will also be removed such that AI is able to exceed in performance even in like the most focused and best human context or will that always remain we only will see the benefits from the kind of soft replacement of the soft because it's a few I suspect it's going to exceed us for these kinds of tasks even under our best conditions I am positive it's going to exceed us under the actual conditions is I think the way I would think about that but that is mostly a statement about how difficult and mediocre the actual conditions are I don't think it's a super high target another example I have had it change my mind in working on policies with one of these systems that I was using to label content and asking it why it made a choice that I thought was the wrong choice I've had it respond to me with an answer that was better grounded in the policy text than the reasoning I was using such that I was incorrect if correct is defined as fidelity to the document I gave Hi, I'm from the University of Toronto and I'm a law professor there I'm super sympathetic to everything you're saying someone I call a U of T in the text phase have been telling me how good things like T4 are at classifying content so I say that because I want to push a little bit on some of what you're saying not to undermine the direction you're going but to maybe open up another problem so it gives me examples where people just like that the amount of disagreement on human label errors is huge and that's a problem and you also talk about what we have values in the policy and the problem is at the implementation level and one of the things that this proposal does is it turns that implementation into a sort of human problem into an engineering problem so here's a perspective from law while we deal all the time with analogy we deal all the time with these kind of broad policy like statements rules whatever and we have to apply them in the real world and the application is deeply normative all the way down there's values and there's implementation in case it simply depends on do we have enough information in other things like that and so and we have these in law that often look deeply contested so it's not just that we can't agree and there's some ground truth we can't agree and it's normative all the way down so what law does is it doesn't give you the right answer it settles the matter for a period of time and that's open to change and in settling the matter for a period of time we agree to that if we agree that the process is legitimate and once that trust goes away then we've got lots of problems as we can all know so what interests me about your proposal is you know there's all these problems with the human moderation you can have a technical model and we can learn how that model settles the matter but there's no ground truth so we still need to answer the question about the legitimacy of that and so that seems to open up kind of interesting possibilities like how could we do that and I'm just kind of curious what you think of that. Yeah and actually there was a question that you had asked online that we skipped when I asked you for we call around the sort of policy standards being very vague that is true of the standards that are public and I'm going to navigate this and answer your question that it's true of the public standards that most companies publish the actual written standards that are being used to make these decisions those things are basically like PR for what the actual rules are the actual rules are very very very very detailed because they are trying to solve this problem of inter-human coordination so they get extremely explicit and exhaustive and you do need to write in that explicit and exhaustive not high level way to get good performance out of these LLMs so that it is a very sort of very very concrete writing process. You're totally right about legitimacy I think there's some and this is not an area where I'm an expert but there's some really interesting experiments going on around using the models themselves to gather people's preferences and help to sort of combine them to create proposals for policies of various kinds. Anthropics done some interesting experimentation here of using sort of an input process to figure out what those rules should be and a lot of what is cumbersome about and again in the non-expert understanding but a lot of what is cumbersome about those kind of processes today is because of how manual they are and actually using sort of AI assistant systems to sort of gather those things in a more dialogue based way might actually make doing them at scale more practical which is really cool so there's some very interesting sort of directions there where gather the input as to what the rule should even be as a way of establishing legitimacy might also become easier under this system not my area of focus or expertise but I think it's a valid problem and I think a really really interesting one and also your point is totally correct about sort of they're not being full difference between values and implementation please don't read me as saying there is it's more that the act of performing the instructions whatever they happen to be has a technocratic component which we can be good or bad at and also there are these values things intersecting with that and our technocratic inability is a big part of the problem as the system exists today particularly because at least at the mass scale we're talking about which I can't figure a way for us to get out of some of the processes that law uses to make these decisions simply will not scale up and so we're sort of stuck in this more technocratic mode thus the focus on improving the technocratic mode just another question online but I think we have a few more in the room as well one question is that you seem to imply that as content moderation improves the work of trust and safety will be focused on updating content policies what would be your guess about who decides these policies so related to what you were just saying will be platforms themselves the users of platforms you were talking earlier about platform assemblies or citizen assemblies so I guess the question more broadly is about mechanisms for deciding those values as implementation becomes increasingly audited I think it's going to depend on so a couple things one even with more automated implementation you're going to be human oversight over these sort of frontline decision clouds decision making agent clouds to make sure they're still dialed in so there is very much going to be work I think that work is going to be a frontline labeling work to sort of quality oversight work in addition to the policy work I also in terms of who decides the values it's a really interesting and difficult question because this is an adversarial space right so at a very high level it's pretty easy to get like want to sort of have buy-in systems to create legitimacy and we should do that and that's great at the specific level of exactly which things violate which rules and where we need to change tweak language that becomes harder to run consensus processes for simply because the amount of stuff that happens every day on our internet of three and a half billion people means that you are constantly responding to tax and you do need that sort of flexibility so I think that's going to remain contested and I think the obvious trend has been towards more government intervention in this space and I don't think that's going to change because it's a site of power and will become more accessible but I don't think it's going to like settle to one great resolution where it is happy on that point there's another question from online about how the regulatory space seems to be moving towards requiring some sort of human in the loop or human reason or explanation for content decisions so the question is about how your proposal or what you see as the future how does that intersect or interact with regulatory movements? Yeah I think that this I think the rise of Gen AI problematizes a bunch of the European moves here because they assume that a human answer is going to be better more fulfilling and more correct and my basic thesis is that's wrong and that we are very very quickly going to end up in a world where some of these large model powered systems produce better more fulfilling feeling more detailed more accurate with accurate and consistent answers and they'll probably have to change the law or here we'll just have a weird version of the internet probably some mix of the two. But like we had stopped with weird version of the internet and now we're going to try it. I have one last question I thank you for your time so it's super interesting. Do you have any concerns about the idea that AI tools are also used to generate images text or otherwise to confuse this as a overwhelming simply for the purpose of moving the needle in the norm? So you could think about it you could basically keep gaming it until you push the norm to the place that your particular thing you want to say or a way that you want to articulate an idea or misinformation it becomes valuable if you know what the rules are and test the system until it forces it. So I guess I keep thinking yeah it's faster but what happens if you also have the creation of text that's faster learning faster figure out what's being rejected and what's acceptable keep moving the line and what's acceptable to where they want to go. So a couple things one most of these systems don't do online learning so simply interacting with them isn't going to change what they do and you can decide as a system designer if you don't allow that or not and you wouldn't for that reason. The other thing though is I flip it around they're going to do that so we need to learn to use the robots to help us because we're not going to be able to do this quickly enough given the scale of what is going to happen given the existence of this technology. So it's like less will you create that arms race and more like the arms race is upon us better start building battleships that's never ended badly what could go wrong? We're technically at time but I think I'm not leaving until tomorrow. Perfect. Well a couple from online then one is is anything that you're saying relevant to data annotation could it solve some of the problems there? Yes. What I'm proposing is a text to directable classifier it's just the sorting machine for content that sorts based on a document you wrote at root. You have a problem that's about sorting content and you can describe your feelings about how you want it sorted as a set of instructions this is useful to you for a lot of different activities in the same way that it's applicable to both AI alignment and content knowledge. Any question on there? So I have the general question so what is the difference between the content moderation and the content censorship because I think I'm a little bit you know confused with that and do you have any like defining lines 10 with speaking or I don't think there's a technical defining line to me I think we are careless with our use of the word censorship I think censorship is what a government does it and maybe we sometimes even But sometimes they overlap you know the issue they cover I think So the techniques overlap because again it's just a question of sorting but Mark Zuckerberg does not own any prisons and has never put someone in them and I think that is an important distinction right so from a technique point of view yeah I sort of flagged at the end of the talk that these techniques are definitely usable for malicious purposes by governments whether well intended or ill intended that is not disentanglable from the existence of the technology from a point of view if you're sort of hinting more towards jaw boning or the pressure that gets put on companies by governments that will remain a problem today independent of the existence or not existence of these techniques and is a thing we need to continue to to work on as a society a couple more questions online one is about the future of volunteer content moderators potentially on places like reddit do you think that those sorts of worlds will remain the sorts of spaces will continue to exist with the sort of developments that you're describing I think they'll change I think your role to move from being the actual direct content creator to being someone who is dialing in overseeing a version of these systems may be provided by reddit themselves to actually help you do this such that you can scale that up with less direct human labor which if you want dispersion of power at who is overseeing subreddits is a good thing actually because it'll allow normal people in sort of their spare time to violently moderate larger communities without having to be professionalized I really like your comparison between the development of industrial machine versus an artisanal machine in the case of content moderation you've been in and out of the platforms and of the platform and is it safe to assume that these companies are understanding on content moderation based on the legitimacy question proposed here I think that's a version of the benevolence thesis I sort of hit on the front like maybe some of them are some of them aren't I don't think there is an amount of money you could spend with prior technology that would produce a result we all really loved that the crux of what I'm trying to get across is we're bad at content moderation because we're bad at content moderation not because we're secretly good at content moderation and mailing it in to be jerks also yes, probably there should be more investment but I specifically think more investment in learning how to do what I'm talking about so we stop being bad at content moderation is important and there probably has been an under-response on that score simply because the large companies are giant bureaucracies that change slowly and so I think a lot of this will get figured out at startups and smaller companies and then purple in its way up or from like special labs within larger companies I suspect you'll see interesting things in this vein from like jigsaw within Google how quickly that translates into all of the giant processes Google runs is like a bureaucratic question more than anything else perhaps a fitting question to end on and if what you're describing it becomes the brave new world of trust and safety and content moderation what are the AI skills you recommend someone interested in entering the field develop or build what sort of skill sets having a schematic understanding of how these models work so that you you don't have to be able to build them but at least being able to understand how they work so you understand what their sort of tendencies and properties and leanings are is really really important they're still ultimately just prediction machines but they're predicting word outputs is helpful honestly in the policy space the same level of clarity of writing that is helpful for writing for like 10,000 guys at a BPO translates pretty well to writing clearly for these sort of machines because a lot of the places where policy writing fails is when it moves into questions that are not discernible to the person who has to use the policy so moving into things like what were their intentions when they uploaded this which is like does it exist from the point of view of a content moderator so a lot of those skills actually translate super well alright well unless there's any more questions in the room I hope everyone will oh John did no no no let's squeeze it in I'm literally not leaving till tomorrow I also know that people are going to lunch well I have questions about some of the let's see I have a few questions but I'll try to synthesize into one if you were well too if you could evaluate the upcoming startups that are entering this space how would you evaluate essentially the viability and pedagogy going forward and then also of those there are many of these problems and just to give you background I also have 17 years of experience and overseeing teams I am yeah concerned about some of the problems that I don't think we should be applying towards AI and part of it is I worry about some of the issues of bias and inequity and the calcification of the model itself and not adapting to the space so I'm curious from your perspective what are the problems that you think we really shouldn't be applying AI moderation to or that we maybe need to do an adaptive hybrid approach where it's like AI assisted but AI deferred to and and therefore also then what should startups in the space be avoiding because they all say that they're trying to solve it all and I don't actually think that's a safe way of progressing yeah in terms of evaluating startups that would evaluate them the same way you evaluate a VVO right I would actually ask them to do the work and see whether or not they're producing the outcomes that you want and you should be thinking through how you really put them through the paces in terms of what the actual set of stuff you send them is not just a random sample but a hard, weird, edge casey kind of stuff I prefer I would prefer personally startups where they're giving you a toolset for you to do it yourself not saying like hey plug this in and we'll make all the bad things go away because that feels like very much giving away your sort of control whereas a lot of the better ones aren't they don't do anything on their own they're like here's a way to write a document and we'll fire this at the LLM for you in a structured way and you can set up your rules engine so it's a system for doing the trust and safety work within not just like a plugged in thing that solves all the problems which keeps you in control of and able to assess whether or not it's working well which is the thing I would look for and would want as an insider in terms of where automation becomes dangerous there's an interesting question that you sort of got to in the second half of that around how much this becomes pure substitution versus like side by side working that's actually what I think that's an interesting question from like does it go to full replacement or does it go to like the actual interface moderators are working and just becomes profoundly different and in AI enabled that could be a direction it goes I suspect it's a little of both but I suspect for some percentage of decisions like we're already in this world right where classifiers make the very easy and very hard decisions and then humans make the middle decisions I suspect like you're going to get another band of the AI made decisions and then the core is going to be this AI assisted set of human decisions I'm not proposing taking humans out of the loop entirely and I do think human oversight over sort of the instrumentation of the model decisions you have going on will continue to be important for any foreseeable future that's a little like in the weeds for half an hour talk but I do think that's pretty important I don't know that I think there are any categories of decision that should never be made by a machine I think the question there is who will achieve better results for the people that actually buy the potential harm and so I think bias is an obstacle there that we need to account for right a very important one but to me the question is what is the most effective method that both like intercepts the harms we're trying to deal with and creates the least amount of suffering for the people who are part of the process and if AI wins that AI wins that and if there are situations where it doesn't and you should use the thing that is best personal alright well I hope everyone will join me in thinking and we'll be back here next week for another speaker series event so people will join us either online or in person but thank you again everyone