 Well, how are you? Thanks for your introduction. I'm Ryan and I'll be presenting on how we use AI for content moderation at PayPal. So quick introduction. You might see two names up there. Unfortunately, Robert from Streetlight Rage, my team lead and an AI architect at PayPal. He couldn't be here today. So I'm presenting alone. A little bit about me. I'm an NLP engineer primarily working in heat speech detection at PayPal. And I am fairly new to the field. I graduated in 2021. And actually the entire team that I work on is pretty new. So I've gotten to see a lot of things develop from the ground up. It's been really exciting. And this talk is basically is going to cover some of the issues that we ran into and the lessons that we learned while we were building moderation models. So quick overview of what I'm going to talk about. Firstly, some intros, then introduction to some problems that we ran into, then how we use ML to help solve those problems. I'll go a little bit beyond some of the basic ideas and finally a quick recap. So starting off with the problems that we ran into. And there's basically three big ones that I want to talk about. Firstly, why we need more content moderation. Secondly, why moderation is hard, not just for computers. And third, why we need computers to help out. So starting with why we need moderation. I think this one isn't too difficult, basically boils down to sometimes people are mean. And there are two main types of interactions that we want to look out for firstly, customer to customer interaction where people are talking to each other. And secondly, customer to agent interactions for customers are talking to PayPal agents. And there are generally similar use cases between the two. So in general, we're watching out for abusive messages, illegal activities or fraudulent messages. For example, of course, we want to watch out for these because we don't want to help facilitate these. So next up, I'll talk about why moderation can be hard. And this does not just refer to automated moderation. This can be hard for humans as well. And the first question to ask here is what types of content should be flagged. So for us, we have an acceptable use policy that basically goes over the different types of activities that aren't allowed on the platform. This includes drugs, firearms, pyramid schemes and so on. And if we look a little bit closer towards hate, for example, this is what our acceptable use policy says. So if you're not allowed to use our platform for the promotion of hate, violence, racial or other forms of intolerance, that is discriminatory. You might see this and you might think, well, that's not very specific. What are you supposed to do with this? So the next question we want to ask is, how can we take this pretty vague, just half a sentence and turn that into actual moderation decisions? I have a little example over here. Hi, Ryan. I don't like you and I hope you step on a Lego tomorrow morning. So this year, wishing pain on someone, specifically me, do you think this should count as violent? Maybe not. And this is a bit of a silly example, but it sort of gets towards the question of how do we decide where the line is? So this might not be violent, but what is that threshold where we say, yep, this counts. And for us, at least, the people who ultimately make those decisions are investigators who research flag merchants and come to a decision about what actions need to be taken. And these people are very well versed in policy and regulation and in precedent. They also help guide any internal policy changes. And the key point that I want to make here is that this does ultimately boil down to a human decision, which means that there's some subjectivity there. And there's room for disagreement, even if the people are working off of the same definition, because it might be bringing in different assumptions into the picture. So moving on from that, talk about why we might want to use computers for this. And it really boils down to data volume. You get millions of customer agent interactions alone per month, so there's absolutely no way we could manually check all of those. We need some amount of automation in the picture. And for that automation, that basically boils down to a binary classification problem where we're asking for each message, does it violate this part of the policy or not? Yes or no? This can be a multi-label problem. So for example, a single message could relate to both drugs and a pyramid scheme, violate 2 and 1 go. And the general pipeline is we have first stage of automated flagging, and then we manually review those flag messages. And if you think a little bit about what can go wrong in both cases, there are two big picture ways. First, we're going to have false positives. The big issue here, too many false positives means we overload our manual reviewers. And on the flip side, we're going to have false negatives. Downside here, false negatives mean we're exposing people to abusive language or to scams, which of course we don't want. And another issue complicating this is class imbalance. So I'm going to say, unfortunately, for our case, most people are actually nice, which in general is great for us. Makes things a little bit tricky because even if we have a relatively small rate of false positives, the actual number can still be really high. So it's really easy to overload the manual reviewers. And as a result of that model precision is key for us. We really want to minimize the amount of false positives that are making. So a quick recap of what I've discussed so far. Firstly, moderation is important because it protects both our customers and our employees. Secondly, translating the policy into concrete decisions is hard, both for humans and for computers. And finally, even if you were to try to do this with humans, you just can't just because there's too much data to review everything by hand. So moving on from that, now I'll talk a little bit about how we use ML for text moderation. And a couple of big topics here. Firstly, why we might want to use transformers, spoiler alert, we really like transformers. Then a couple of public models and public data sets that you could potentially use if you wanted to build moderation service from the ground up by yourself. And finally, a couple of limitations of those models and deep sets. So starting with why transformers and as a heads up, there will be some oversimplification in the coming slides. But we'll start with a simpler idea. Well, what could we do? We can just try using key terms or keywords. We basically have a list of key terms, keywords, and if any of them appear in a message, we flag it. This is the simplest method and it is pretty fast. It's pretty easy to update. Just update the list of words that you're looking at. So from that perspective, it's not too bad. But of course, it might not work super well. So continue on with the stepping on a Lego idea with a couple of potential examples of messages. So we have four here. Firstly, I hope Ryan steps on a Lego. Secondly, I hope Ryan stomps on a Lego. Third, I hope Ryan never steps on a Lego. And finally, some people hope Ryan never places his foot on a Lego. I disagree. Someone really has it up for me. So if we suppose that wishing that someone steps on a Lego is against the policy, then one, two, and four here should be flagged. If we see what a key term based model might do, it would likely flag one and three, because both of them contain that phrase, steps on a Lego. Two would probably miss, unless we have the foresight to include stomps as well as a sort of variant on the steps on a Lego phrase. And four would miss for the same reason, because it has a sort of never places his foot rather than never steps. So we can already see a keyword based model will definitely have some gaps, which we need to improve on. And that's where transformers can come in. This is a pretty big step going from basically the most simple thing we can do to end on those complicated. But I'll give a quick overview of what transformers are and how they work for the people here who might not know. So a quick review. They were introduced in 2017. The big idea here is that we compute relationships between all the words in the input at once. And as a result, they promised a deeper contextual understanding than a lot of previous models, a lot of previous architectures. And at least for the past couple of years, they've been based on massive pre training of open source materials to basically get a good understanding of a general language or the language in general. And then a second fine tuning stage that's specific to the task at hand. They've been the focus of a lot of research, because they've been shown to work really well on a huge variety of NLP tasks from classification to question answering and so on. So just a quick example of some of the big steps that a transformer will go through. You take this input text here, say the PayPal chatbot is really a delight to use. The first step is tokenizing the input sequence into individual tokens. You can basically think of this as words. So each model has an associated vocabulary of words and sub words. And if the word appears in the vocabulary, it'll be its own token. So we're like the or is or really or delight. Each of these is its own token. So it appears in the vocabulary or like chatbot that doesn't appear in the vocabulary. It's broken up into two tokens that chat and bot. Once you have the sequence of tokens here, each token is mapped to a high dimensional number vector, which is initial embedding and initial numerical representation. Then a whole bunch of transformations happen. This is where the sort of a meat of the model really is. And this is also where the contextual understanding gets encoded. And after all of that, you left with this contextual embedding. Again, one for each token in the input sequence. And finally, all of those contextual embeddings are aggregated into a task specific output. So for moderation, that can be the chance that this input sequence violates whatever policy we're looking at. So if we take a look at how a transformer that's been trained well might do on those same for example texts, you assume we'll get this all right. Where one, the keyword based example already got right. The transformer probably do well as well to which the keyword based example missed the transformer would probably get to probably basically know that steps and stops are more or less synonyms. And similarly for the for the second two examples here, again, a big draw is the contextual understanding. So they would basically get that never negates the steps on a Lego bit. And that similarly we have a double negative in the fourth example. And this is all very unscientific, but it's just to help illustrate some of the advantages of using a transformer. So I mentioned a bit earlier that these models are generally trained on huge amounts of data as a first step. And of course that's very expensive. That's very time consuming. And oftentimes we just don't have the resources at home to be able to do that. Here's where we have some good news though, is that there are a lot of models that have been released publicly that have done that pre training stage already. So you can take one of those pre trained models and immediately fine tune it on whatever task you want with a lot less data and a lot less computation. And here I'll just go for a couple of the big ones. So we have four examples here. I'm not going to go into too much detail about this. And this definitely isn't a complete list here. And one thing I do want to note is all of these are example via hug or say all of these are available via hugging face in the Transformers library and Python. It's really easy to use. And all of these actually use more or less the same backbone architecture, the transformer architecture. The big differences are in the amount of layers, firstly and secondly, in the training data and training procedure. So we can see as years go on, the amount of training data increases a lot. So looking at that pre training data a little bit more closely, we'll see that for the most part, they all draw from the same set of sources. That's generally firstly, public domain books, secondly, Wikipedia articles, thirdly, online news articles and fourthly, scraped web text. And one issue that we can run into when we look at using these pre trained models for moderation is that most of these sources generally don't have a lot of profanity or obscenity or abusive language. So the models might not be well adapted to handling that type of text. So sort of continuing with that idea. If we think about the types of things that we see at PayPal and specifically the types of things that we need to watch out for for moderation. Firstly, we have some domain specific terms, which appear more often in the realm of finance. For example, the MPL, which stands for buy not pay later. And we have some task specific terms that are briefly touched on earlier, which can include profanity or the names of extremist groups if we're looking at detecting hate speech. And for all of these cases here, these are types of words that might not appear very often in the sort of broad corpora that models are pre trained on, but they're things that are very important for our models to be able to to detect. That's one potential diet downside of just directly using those pre trained models, we need to be sure that they will adapt well to the domain and tasks that we are looking at. So in addition to those sort of new words, we also have words that can change meaning depending on the context. And I'll use the word dispute here for that, or for an example. So on top we have this example sentence here, I dispute that notion where dispute is sort of used similarly to debate, which is disagreeing. And on the bottom we have, I want to dispute this charge and the bottom is what we see pretty much all the time for our use case. We're dispute here specifically means a disagreement about some kind of transaction, not a general debate. And this is also, this is something else that we have to keep in mind when we are using these find these pre trained models. So now I'll talk a little bit about open source data sets that are available for the moderation task. And just as a sort of starter, in order to fine tune, we definitely need labeled data question of where we get it. And the good news is moderation is a very important topic. A lot of people have been looking into it. So a lot of data sets do exist that you can just find online listed a couple of them over here. And we're going to too much detail about this either. But at this point, what we could in principle do for any one of these is take one of the pre trained models that I mentioned earlier, fine tune on one part of any of these data sets here, evaluate on the other, and we probably get pretty good results. So you might think, okay, great, we're good to go. We have all this data here, we'll just combine it all together, take a pre trained model, train it up. And, you know, we have our content moderation model ready to go. But we can already see from this table here that that might not work. And the place that we want to focus on is this categories column here. So we can see the categories across these different data sets don't necessarily line up. And although we do see some examples where there is some overlap, because the second one has hate speech. We might think, okay, you know, we just have to be a little bit careful about how we pick these models. But even that probably isn't going to work very well. And this is a big thing that we found. So even if two data sets uses the same labels, they might not have the same definitions, or the people who actually did that labeling might have been working with different assumptions, even if they did have similar definitions. So in our experience, even if the category names were the same across different data sets, if you trained a model on one and evaluated on the other, you tended to get pretty poor performance. So quick recap all of this. Firstly, that pre trained transformers are very powerful. Secondly, that on the plus side, there are a lot of open source data sets and pre trained models that you can use to sort of get started. But then on the downside, that just combining these public data sets probably won't work very well. So now I'll talk a little bit about how we can extend beyond all of that. And I'll go over three main things here. First, you have an efficiently labeled data. Secondly, how some of these ideas can play out in action. And finally, I'll briefly touch on multi language support. And the first thing is going to be efficiently labeling data. So the key issue here is that at some point we are probably going to have to manually label some data ourselves, even if we are taking some examples from open source data sets. The big reason for that is that we want to make sure that our test data matches our use case as closely as possible. So for example, if we're trying to detect scams on Twitter, we probably want to use text from tweets in our test data, just to make sure that that will be representative of what we'll see in production. Again, that means we have to manually label some of our own data. And the downside there is that manually labeling data is both slow and boring. Take it from me. I've done a whole lot of it. It's not super fun. And we run into another issue, again, resulting from class imbalance, which is sort of visualizes with a quick example here. So let's say we want to flag forum posts that make fun of the name Ryan. And suppose that in the form that we're looking at 1% of the entries do so. So 99% are okay. 1% make fun of the name Ryan. And in our test set, we want at least 300 demeaning examples, just so we have a good representation of both classes. So if we were to randomly sample, and assume we have a lot of posts to look at, how many posts do you all think we would need to label? Well, we need to label 30,000 just to get 300 that we're really interested in. And of course, this is not something that you want to do. This would take a really long time, not a fun time. So that means random sampling in this case really isn't feasible because of that class imbalance. And just for the record, that 1% number that we're using in this example for a lot of the tasks that we look at is actually really high. In practice, it can be closer to 0.1%, 0.01%, and so on. So this is really a big issue that we have to address. So there are a couple of ways that we can do that. Firstly, we can filter using keywords. So in our example, we could primarily label utterances that contain the name Ryan. So those are the ones that are more likely to make fun of the name Ryan. Secondly, we could try using the outputs from fine tune models to help guide our labeling process. And finally, we could try leveraging text similarity. This last one is one that I will expand upon a little bit. So for leveraging text similarity, there are basically three big steps that I'm going to talk about starting with text representation. And this is basically how we can take the input sentence, which is strings, and represent that as numbers. There are three primary ways that we can do that. Firstly, there are frequency based approaches like bag of words and grams and TFIDF, which are all based on the frequency of the words or of short sequences in the text. Secondly, there are word embeddings like fast text, and these all map each word to a numerical or a number vector. And finally, we have deep sentence embeddings like sentence transformers, and these pass input through a transformer, which aggregates all the outputs at the end to get a single numerical vector that represents the sentence. And just as a side note, all of these are readily available in Python through different libraries. So once we have our sentence representation, the next thing to talk about is how do we quantify the similarity between two sentences. And here there are a couple of common metrics, the Euclidean distance and the cosine similarity. I'm not going to get too much into these. And once we've picked a similarity metric, we need to talk about how we're going to use it. Again, there are two ways that I'm going to mention here. Firstly, for each unlabeled data point that we have access to, we can basically calculate the similarity between that unlabeled point and the closest labeled data point. And that basically tells us how close is this thing to the entire data set that we have available, because we probably want to pick examples that aren't very similar to what we've seen already. Otherwise we're wasting effort, basically just relabeling information that we already have. And once we have this, we can either just do a weighted sample based on these similarity measures or just filter. And the other slightly more complicated thing we can do is clustering. So we get our similarities within all of the unlabeled data, cluster based on that, and then sample from each cluster. So do a stratified sampling based on the clustering. And what that will help us ensure is that we aren't selecting too many similar data points from the unlabeled data either. And at this point we could actually go a step further and either do something like K nearest neighbors or use the clusters to assign labels to our unlabeled data points. But at that point we become a lot more vulnerable to errors that the model is making. And in particularly we run the risk of propagating any errors that the model is making to our next iteration. And we of course don't have to just do this once. We need this as much as we want. So we get this sort of cycle of training a model, using that model to select new data, labeling that new data, using all of that to train a new model, and we can just rinse and repeat. And this sort of gets towards the idea of active learning, where the model is helping you pick the sort of quote unquote best data to label for the next iteration. So at this point talked a lot about a bunch of these different ideas. Now we'll see at least one example of how they can be used in action. So quickly go over the setup here. This was a small hate speech detection experiment that we ran. I have been told to say that this is not related to anything that we have in production. And the goal here is to boost model precision. So if we talk about the data that we have available, that we had available for this, we had 500 expert annotated data points that were labeled by experts for validation. And for training, we have about 6000 examples that we annotated ourselves. And just a reminder, this is all using exactly the same data sources, exactly the same definitions. The only difference here was in who was labeling the data and as a result, what types of assumptions that we were making about the data and about the definitions. And one thing to note, I would take the results I'm about to show with a pretty big grain of salt. Firstly, because we have a very small validation size. And secondly, because we were repeatedly validating against the same data. So there's every chance that we were just overfitting to a validation set. That being said, start off with a reference to give us a target to aim for. We split the validation data in half, trained on one half evaluated on the other. And just so how it worked. And we got 89% accuracy and also 89% precision for the eight class. Again, this second one here is really the metric that we were looking at the most. So as a baseline, we just tried training on the original data that we had. Evaluating on the validation data and it did not work out great. We had a huge hit to accuracy, pretty big hit to the precision as well. And what this really showed us here was the importance of making sure that not just the definitions that we're working with, but also the assumptions that we were making when acting on those definitions were aligned. All of that error there was due to misalignment with those assumptions because we were working off the same definition. So the first thing we did was basically get together with the experts, talk about where we were thinking of things differently, then use insights from that to manually relabel the existing data. And just based off of that, we had a pretty huge boost. So we got 11% points of accuracy back, 2% points of precision back. And what this tells us here is that our primary boost in this case was actually to recall, because we see accuracy jumping up a lot more than precision. So as a next iteration on this, we started with the model that we trained for trial two, used that to predict on a lot of new unlabeled data, then focused on the examples that the model predicted to be hateful. Because again, we're targeting precision. So we wanted to focus on those cases where the model predicted that something was hateful, even when it wasn't. And then make sure to add those into the training data for the next iteration. So that's what we did. We've got about a thousand new training data points. And then when we retest it on all of the available training data now, we've got a small boost to accuracy and a much bigger boost to precision, which is exactly what we wanted to see. So we basically figured, well, that worked great the first time. Let's just try doing it again. So we repeated the same process, just using our model from step three, and added 500 new labeled data points that way. And once again, we saw a pretty good boost in precision, which actually brought us above the target that we wanted to hit of 89. But the one thing to note here is that at this point, the accuracy was staying flat. So at this point, we were most likely starting to trade off precision for recall. But at this point, we sort of reached our target. So this is where we called it. And hopefully this sort of demos how some of the ideas that we've talked about can be used in practice and also shows the benefits of firstly picking the data that we're going to label more carefully and secondly, making sure that everyone who's labeling the data has the exact same idea of how to do that. So not just working out the same definition, but really making sure that you're thinking of things the same way. Right now, I'll briefly touch on multi-language support. You all might have noticed that so far there's been a very heavy emphasis on the English language in this presentation. And that is true in the field of NLP as a whole as well. There's a lot of research dedicated to English and relatively less dedicated to other languages. And you all say I only have a couple of slides on this and I do see the hypocrisy there. And this is primarily just an overview of some of the problems that you can run into. So starting with the data. I mentioned that a lot of open source data sets exist for English and some exist for other languages as well, but those are mostly European languages and Arabic. So for languages outside of that, it's really hard to find pre-labeled data for the moderation task. So you might say, okay, well, we can't use any public data. What else could we try to do? We can try to take the English data that we have labeled already and just automatically translate that into the languages that we want to support. And I think you all can sort of expect that might not work very well. There are a couple of big reasons why. Firstly, meanings can change in translation pretty easily, especially if it's being done automatically and not by a human. And secondly, a lot of the common translation models are trained on quote unquote clean text, which is missing a lot of profanity or abusive content that's really important to target for moderation. So that's why translating also might not work very well. So we're left with manual annotation, which is going to get you the best quality, but is of course slow and takes a lot of effort. And it can be really hard to find domain experts who are also proficient in the desired language. So moving on to the data and talking about the models and the sort of inference pipeline that you can use, a couple of options here. We can start with a model ensemble. So basically for each language that we want to support, train a separate model and just add on a language detection step at the beginning and then route to the corresponding model. In principle, this will work pretty well. But of course we get issues. And firstly, we need a new model every time we want to support a new language. And secondly, we need a new training set to train that model. So this isn't really feasible if you want to support a lot of languages as a result. The next thing we can do is we can try to just translate everything into English, get a really good English model, and then basically leverage that as much as you can, which in principle works reasonably well and in practice is okay. But it runs into the same issues that we talked about on the previous slide of translating text out of English, which is again that a lot of these translation models don't handle profane or abusive or obscene text very well. So you can lose a lot of that information in the translation. And finally, you can try a multilingual model. A couple of these exist that have been pre-trained on something like 100 or 200 languages. And in principle for those pre-trained models, you can fine-tune in just one language, for example English, and then run inference in all the languages supported by the model. And it works reasonably well, but in practice we found that having training data for the other languages of course gives you a good boost too. Well, that's unfortunately all I have to say about multilingual support for now. So a brief recap of what we've talked about. Firstly, the content moderation is hard and we definitely need computers to help out. Secondly, we think about how we can get the computers to help. Our first thought really goes to transformers because moderation is often works with subtleties and transformers, in our experience at least, are the best at handling those. With regard to transformers, there are a lot of resources that are available publicly, but sometimes they can be limited by compatibility. It's both compatibility between different training datasets and between the pre-training datasets and the fine-tuning ones. And finally, we can use the models that we've trained to help us pick out new training data to label and that these ideas extend to multiple languages as well. Alright, that's all I have for you all today. Thanks a lot for listening. And now, if anyone has any questions, we're then happy to chat. Thank you, Ryan. I think you all know the drill by now. So if you have any questions in the room, please go to the microphone. We can also take remote questions. So far, I haven't received any. So the reminder to those in remote sessions, contact your remote operator and jump onto the Zoom call. Okay, please go ahead. Thanks, it was a really great talk. I think it's a really important topic. I wanted to ask about if this active learning type process you guys are doing is causing over-sampling of the 8-speech class? And if so, how does it impact false positivity? So that's a good question. And just to confirm you say over-sampling of the 8-speech class, are you saying basically, are we ending up with like the vast majority of our data samples being hateful? No, so you said like let's say it's like a tenth of a percentage or a hundredth of a percentage. So by like finding these new examples of 8-speech class to label, we might be adding more and increasing that percentage from for example, tenth of a percent to ten percent of the training set, which would then seemingly cause many predictions of that class, which would generate false positives. Okay, so I see. Yeah, so it's definitely something that we are concerned about, but we feel and what we sort of seen in practice is that that is sort of the lesser of two evils because if we were to just randomly sample and say have just one percent of the data be hateful, what often ends up happening is a model just doesn't learn anything at all, just predicts that everything is not hateful. So what we basically do is we start off basically like you said over-sampling and maybe getting too many false positives, but from there we can at least potentially go through that iteration stage where you sort of pick out hard false positive examples, add those back into the training set to help boost the precision from there. Great, thanks. Thank you for the very interesting speech on this topic, but I wanted to ask a question. I'd be very interested in seeing that applied in an adversarial setting where there's maybe another model that wants to try and fool it because you know people, if they know they are mutated, they may try and sneak in something in a way. So have you had any results? Have you tried it in a adversarial setting with the model that it's trying to generate hate speech in a way that it's not recognized? That's a really interesting question and that is something that we are looking at. And one thing that we've sort of seen and that's sort of commonly known with transformer models is that in some cases they are actually really easy to fool because they generally don't respond well to typos. So for example, if you have a profanity filter, you just change a single letter, like I'm sure you all know, there's something very common swaps, that'll completely fool a lot of transformer models and that really gets to trying to find the mouse, so I'll just go back this way. There are many slides here. That really gets to that first tokenizing step. So for example, if we were to change in really here this L to a 1, then instead of just one token, really, it'll probably get broken up into three. None of which is directly related to the word really. So that is basically, that's at least one of the reasons why it's pretty easy to fool at least pre-trained transformers, Ellie. Thank you very much. This is kind of from the very start of your presentation. You mentioned sort of automated models and then human decision. Do you take any automated actions at all based on your models or does it always go through a person? At least as far as I know, it always goes through a person. Thank you. Hello. Thank you for the talk. Initially, you said that you were looking to find basically very hateful or inappropriate language plus pyramid schemes messages. So you never covered pyramid schemes, so I was wondering how do you see that problem in your situation, whether you see it as just let's say another class to a more or less existing language model or do you see it as a separate model? The reason I'm asking is that as far as I understood, even though they both are inappropriate, but on a language level, they are completely separate and mostly opposite to one another because sentiment analysis for hate speech and for pyramid schemes is completely diagonal. So I just wanted to know what are your thoughts on this problem. Yeah. So that's a really good question. Pyramid scheme specifically isn't something that I at least have worked on directly yet, but I'm sure there is time for me to do that. Now I go back to the slide that sort of went over it. Basically the way that it would probably break down is as a multi-label classification problem. So for every input message, we'd have a one or zero for drugs, a separate one or zero for firearms, a separate one or zero for pyramid schemes. So in principle, we can use the same sort of sentence representation and just train a different classifier on top of that to cover pyramid schemes versus any of the other categories. But the model itself would be the same, just a different classifier, right? You can do it that way. You could also just train up a totally new model. It depends on partly how much work you want to put into it and some downstream dependencies. Okay. Thank you. Hi. Two questions back you've mentioned. So you were asked about automated actions and you've mentioned that probably it goes through someone. Did you consider integrating this feedback from this person who's basically taking the actions and putting it back into the model? So it's a really good question. That is definitely something that we want to do. We are currently, I don't know how much I'm allowed to say. Yes, that is something that we want to do. We are just waiting for the correct systems to be set up to get that feedback loop going. And maybe a second question you've mentioned in the public data sets. That's one of the problems where the different definitions of the labels. Did you consider taking this as a binary and then clustering the same way you've done it at the end? So just to confirm you're saying basically clustering all the data points from both data sets and using the labels from one to basically try to map to other ones? Exactly. So that was the slide with the three different public data sets and you've mentioned the different labels. So exactly this one. Have you considered just merging them, having the categories instead of the categories binary one or zero and then after that clustering those back into the categories? So I think that specifically isn't something that we've tried but we definitely have tried a couple of approaches to effectively merge at least the text from all of the different public data sets that are available if not necessarily directly using the labels that are supplied as well. Thank you. Thanks for the talk. I just wanted to know if you've dabbled with weak labeling to solve your labeling problem and if it has helped. So we have definitely taken a look at that at weak labeling as well. One issue that we run into with weak labeling at least is if we want to do it with rules for example rather than with weaker model outputs it gets very difficult to write good rules for a lot of moderation tasks. So this approach like labeling it again by a model is better than with weak labeling. Okay everyone round of applause for Ryan Rogan Canberra.