 We have with us here Ruben Miguel, he's the CTO of Neutral. Ruben, welcome. Welcome, thank you very much. Fake news, what a sexy subject, okay? Yeah, it is. So, looking forward to listening to you, all yours. Okay, thank you for the kind introduction, Lena. Let me first of all make a brief introduction about Neutral, because maybe some of you won't know us. Neutral is a media startup. It was founded in 2018, so we are really young. And we are both a TV producer. We are a digital newsroom, Neutral.es. But we are also a fat-taking agency. And this is the business lie of our company that is most important for this talk. The team that is behind Neutral was the pioneers in Spain of bringing fat-taking on TV. And they did this in 2013 with the program El Objetivo. In 2017, El Objetivo becomes the first Spanish signatory of the International Fat-Taking Alliance. And from that moment on, since 2018, as Neutral, we have been establishing different partnerships with the main social media players, Facebook, WhatsApp, TikTok, or even with technology players as Google, for fat-taking content. For many of you, probably, the term fake news is something that started in 2016. Lena says she was not going to talk about politics. I'm going to talk about politics. It's in my job, okay? And maybe the guy that made fake news famous is Donald Trump with the U.S. elections. At that moment, we had 6.6 million fake tweets and 126 million Americans were exposed to this information according to Facebook estimation. So probably the numbers are really much higher. And the things went worse. Do you remember what happened in 2021 in the U.S. Congress? What happened there was not something casual. This project is really interesting. It's a project of our colleagues from the Washington Post. What they have done is tracking the claims made by Donald Trump during these four years. And he has sent, he has set 30,500 false claims. If you do the math, this is 21 false claims per day. If this guy slips, then you have that he is saying a false claim at least once per hour. If you also saw in that graphic in November, there is a high peak on false claims. That was what happened in the U.S. elections in 2021. And this is only one politician. Imagine what is happening out there. But fake news is bigger. There are also fake news that are impacting our economies. There are disinformation campaigns specifically targeted to damage a company brand. And the brand of a company is its main asset. But all of you for sure, all of you knows also that fake news impact health. With the COVID crisis, all of you for sure had in your mobile phones a lot of fake news about the COVID. In some points of the pandemic, in neutral, we have a verification service that anyone can send there something to be verified. We were getting one request per minute. And this is only a small fragment of what is happening. It's only a small part of our audience that want to share content with us. So the magnitude of the problem is huge, really, really huge. And we cannot solve this problem alone. We cannot solve this problem without technology. And this is why, when neutral was funded, I was invited to join the company to create a team working on artificial intelligence and trying to automate how the fact-checking process works. The problem, as Elena said, is that fact-checking nowadays is a manual intensive process. We divided this in these four steps, monitoring, spotting facts, verification, and exploitation. The idea is, first, you need to monitor everything that is being said online or in traditional digital outlets or in the TV. Then you have to select what is worthy to be verified. Then you need to find official sources with the data that will help you to verify the information. And finally, once you have your fact-check, you need some way to make it viral. But the problem is that normally the truth is lower than the lies. And it's very difficult to make the truth viral because it's not so sexy as a good lie. That's the main issue we are having. And before moving forward, also, I would like to point out the difference between fact-checking and banking, because for this presentation, it's important to have this difference in your minds. When we are talking about fact-checking, we are talking about political statements. Something like, during the last four years, the unemployment rate was the lowest ever. However, when we are talking about the banking, we are thinking about statements like the TQ coronavirus or a deep fake of Obama saying, I'm in love with Donald Trump. In this second scenario, the content normally is totally fabricated, and we have to do some multimodal analysis. It's very important to analyze if an image is tamper or a video is tamper, and the focus of the technology normally is in the detection of this kind of content. However, when we're speaking about fact-checking, and this presentation is about fact-checking, political fact-checking, we need to go deeper in the NLP analysis because all the analysis is textual and the focus is on data verification. And we try to say something is false, semi-false, or totally wrong. So we have to tackle with misleading content, not pure fabricated content. Our main question is, can artificial intelligence automate fact-checking? And if I have to provide an answer for that, or at least if I have to figure out to envision which is our ideal AI that neutral wants to build, I could summarize that in this sentence, fact-checking in 100 languages, reasoning like a human journalist would do, but with the computing power of a machine. This is what we are building. This is what we are trying to build. And the main idea is having a combination of humans with technology, fact-checking and charge, we are using that word to try to explain that we are enhancing the capabilities of the journalist. So the word that now takes eight hours, we want to do it in one hour, and we want to increase their productivity at least 30 times. All of this through the use of deep learning. This is how we are envisioning the problem, the four different steps that I presented to you before. And in the monitoring step, the first one of them, basically we need to create a common output format of all the sources that we need to monitor. We need to monitor YouTube, we need to monitor Twitter, we need to monitor Facebook, we need to monitor the ATV interview or radio interview. So all of this needs to be converted into text. So speech-to-text technologies are very important for us. And after that, we need to be capable of processing all this massive amount of information and automatically detect, this is what is called claim detection, automatically detect what is worthy to be fact-checked. And something is worthy to be fact-checked if it has some factual data on it. This is an example of a sentence with some factual data on it. If you do somehow a review of how the sentence is built, you can figure out that normally in factual statements we are going to have dates, we are going to have also verbs in past because if a sentence is in future, we cannot check, we cannot verify the future. We don't know what is going to happen. We also have to take into account some specific grammar structures that normally are of factual statements. We need to take care of comparisons like this, more than 4,000 deaths, figures, the numbers as also normally a good sign of some sentence that can be verified. And when we started to tackle with this problem, we followed the typical machine learning approach based on future engineering. This is a summary of what we were doing at that moment. Basically, we were building many, many features with the typical NLP pipeline. We were working with dependency parsing, we were working with post-tagging, we were working with custom-made regas for trying to capture those kind of structures that we knew that were common in factual statements. And in later stages, even we added word embeddings to the mixture. The word embeddings was the thing that was the best for us. And we combined all these number of features in with four main families of classifiers, SVMs, decision trees, logistic regression, and Naive Bayes, and we do some hyperparameter tuning of all of this. At the end, you have a lot of processes that are running and running. You shoot the data with the data. What you need to see is, okay, this feature is working, this feature is not working. A new round with all these potential combinations, creating new features, erasing features, combining features, all of this at the end was not working for us at that moment. Why? First of all, we are building systems for the real world, and the real world is imperfect. We have noisy transcripts. When the sentence is broken, when we have misspells inside of the sentence, when the sentence doesn't have any kind of meaning, the features built with these NLP frameworks are wrong, normally they're broken, and the performance of the classifiers go down. Besides that, the machine learning model that we were building didn't generalize properly to different contents. Also we were spending a lot of time trying to create the features. And if we want to do this in a new language, not only in Spanish, we have to start from scratch, creating new features. So we were doing this, it was very similar to find a needle in a haystack. And we started to think, can we use deep learning to solve the problem? Because in a deep learning approach, we have that the network is the one that do the feature extraction for us, which was a problem that with the amount of data that we had at that moment, this approach was not viable because deep learning needs big data. And the problem is that fact-checking is a field where we have a scarcity of data. Why? Because detecting or defining, establishing if something is check-worthy requires expert knowledge. We need fact-checkers to decide if something is check-worthy or not. It's not so easy as saying in this photograph there is a dog or there is no dog. Also these data sets are highly unbalanced. In a normal transcript of a politician, only between a 5% and a 15% of everything that is being said can be verified because the rest of it is blah, blah, blah, opinions, subjective things. So in order to get a massive amount of samples from the minority class, we need to get massive amounts of data. And this is expensive. We cannot go to Mechanical Turk. But with the rise of the transformers, we can solve this issue of data scarcity for our domain. Basically starting in 2018, the transformers architectures starting to change the NLP world. Basically the transformers model has changed the way we work with text data. There are plenty, a lot of different architectures and families based on transformers. This is only some of them. I cannot draw all of them. And it's very difficult to follow all the new proposals with new pre-training objectives with new ways of dealing with cross-lingual material, etc. That researchers across the world are pushing forward each day. Basically the transformers architecture is composed of encoder and decoder. And the encoder has the goal of encoding the natural language, the text that you are providing him and creating a vector of a fixed size while the decoder has an opposite goal. The decoder is going to take a vector and create a natural language presentation. This encoder-decoder architecture was defined for machine translation. So the idea was that I have an input text and I will get, as an output, the same text but not in the original language but translated in the target language. For instance here, I am a fat checker and as an output I get the same sentence in Spanish. So you're a verificator. Normally transformers architecture are based of a layer of encoder and decoders. If we have many layers, the learning capabilities of the network is really good. But it has more requirements in computational power. So you have to play with the number of layers to get a proper system. And what was really interesting was that when researchers were starting to figure out how these were so good for translating a language, they figured out that this kind of architecture worked also really good for many other NLP tasks. And you don't need all the architecture for doing that. For instance, we have encoder-only architectures that were really good for any kind of application that needs to understand the language. For instance, for sentence classification or for sentence similarity. However, the encoder-only architectures like the GPT models are really good for those tasks where you have to generate tests. For instance, creating fake news with a GPT-3 model. Why are transformers so powerful? This is the main question that we need to ask ourselves. And this is what we are using in neutral. Because these three key points, the pre-trained language models, the fine-tuning capabilities and the transfer learning approach that they help us to do. And I'm going to explain each one of them very quickly. Pre-trained language models are like building blocks. I can take a building block that basically is a massive neural network. It's a neural network that someone has trained with terabytes of data with one main objective. The goal of a language model is to encode the knowledge of a specific language. And how can you do that? You feed the neural network with the whole Wikipedia and the whole common crawl, with the whole internet, with a lot of sentences, a lot of sentences, and you define an unsupervised learning objective. In this case, what you are doing really is I give you one sentence and I'm masking some words. So this is really very similar to this exercise that probably all of you have done some day when you were learning English, for instance, of filling the gaps. If you have one sentence, you have to fill the gap. And you are capable of doing that. Machines are also capable of doing this. In this case, with these two masks that are there, if we ask to our neural network, try to guess what word goes in those spots. The machine will probably say game of thrones, and it will also say that the preposition of shall be the word that shall appear in the second mask position. The important thing to remember here is language models are task-independent. We can use it as a building block for any task that needs to understand the language. And they are language-dependent. If I want to use it for Spanish, I need a language model that supports Spanish. I cannot use any language model. This was a problem at first. When we started to work in neutral, there was only one language model in Spanish that was Beto, and was the language model was a Berda-like model that we can use at that precise moment. But the most important thing about the Transformers architecture is that we can combine the language models with the fine tuning. And I will explain why this can turn a big data problem into a small data problem, and that, this thing was exactly what neutral needs. Why? How this works, basically. We have our language model, and with our small annotated data, the small annotated data are the claims that neutral has, these factual statements that have been labeled by our expert checkers. With that information, we can change the weights of the neural network, teaching this network not to understand a language, but to know how to differentiate between a factual statement and a non-factual statement. Basically, we add some layers on top of the neural network. We find you on those layers with this small annotated data. We started only with 10,000 records. Now we have half a million, but with 10,000 records, you can start to get some good claim detection models. And also, if you need it, you can also continue the pre-training of a language model with a big corpus for this domain. This is called domain adaptation. We are also doing this, basically, feeding the pre-trained model with sentences from the European Parliament because these sentences are from the political domain. So our pre-training language model that is based on XLM Robert architecture is a model that, at the end, has more knowledge about the political domain and the third important point about the transformative architecture, its capability for transfer learning. And this is key for getting the outcome of 100 languages. The XLM Robert architecture supports 100 languages by default. So it creates these language representations that are language-agnostic representations of our input text. What does it mean? This means that when we are fine-tuning the network, when we are teaching our model to detect claims in Spanish with Spanish content, we are also teaching him to do the same thing in English, in German, in Romanian, and in many other languages. Of course, the transfer of knowledge from Spanish to English is small, but there is some transfer of knowledge. And the transfer of knowledge from Spanish to Catalonian is bigger because Spanish and Catalonian are closest languages. So this helps us to create what is called zero-shot learning systems where we can classify sentences in German with no training data in German. We can't detect factual statements in German, but our system never was trained with any sentence set by Angela Merger, for instance. Sorry, you can see there the performance of our algorithms in 21 languages of the U. You can see that in Spanish and in English, we have the better performance. In English, we have 5,000K additional training data. For the other languages, we have zero training data at this moment. And we still get good performance. We only lose 10, 15, 20 points maximum of precision. How all of this works in the real world? Okay, we have a real use case scenario where basically we are recording in our day-to-day videos of interviews of the politicians. We uploaded them to our asset manager, then the asset manager sends a job, an automated transcription job. So we get the transcriptions, and when the transcriptions are finished, our neural network starts to work and detect which sentences are factual. The sentences are reviewed by journalists in an editor, we are going to see it now. And the feedback of those journalists gets back to the system. So we can keep training the system with more data. This is why we move forward from 10K to half a million sentences that we have now. This is how this video editor works. I will show you a small video where you can see how it works. Basically this is a transcription, automatically generated by a machine, and when we press a button, what is going to happen here is that the system is started to, the neural network is starting to process the data. You will see that this is a video of one hour and 30 minutes, and in three, five seconds, the system has detected 93 factual statements. The numbers that are there with a percentage are the confidence that has the system in its prediction. And you can see that the predictions that the system is doing are really good, at least for Spanish and for the political domain that is what is the main goal of our system. A journalist can cite another sentence and add it to the system, a journalist can directly delete a sentence if it's not of interest for him. And this is a demo of the same kind of behavior, but in this case we are going to take a text in English from Boris Johnson, and we are going to try to figure out if there are some factual statements on it. You can see we copy the text here, we press the send information to the neural network, the neural network starts to think, and there are the list of sentences that it has detected. You can see that these sentences have factual statements on them, and we also have a level of confidence, in this case it's simply represented as low, or high, or whatever. And the low confidence level, it gets to a sentence that is something ambiguous. Now I'm going to do the same, but in German. I don't know a word in German, okay? Not a single word. So I'm simply copying the text in German, I press the send button, again we have some sentences, and as I don't know anything in German, what I'm going to do is copy the sentence, I'm going to go to Google translator, and I'm going to see if what happens there seems to resemble a factual statement, and it is. So this is how our system is now capable of detecting information in 100 languages, detecting factual statements. Another real use case that we have implemented at our journalists are using in their day-to-day is basically a system that we call Claim Hunter, that is monitoring Twitter in order to detect factual tweets. Basically it's the same behavior than before, but in this case we don't have automated transcripts. What we have is tweets, so we have to train a specific neural network for it. When the network detects a new factual tweet, what it does is to send a notification to a Slack channel. So we have the interface already done, this is so simple like this. Our journalists are connected to this Slack channel, and they are getting these notifications of this tweet contains factual statements, and they can simply accept the recommendation that we're doing, so they send us a sign saying, okay, this was okay, you were right, this was a factual tweet, or they can reject. And with this real-life loop, we can continue to train in the algorithm. And this data from this real-life test, basically we are getting an average precision over 85%, with this claim detection mechanism in Spanish, of all the tweets that we are recommended to the journalists at the end, 15% of them are selected to create a fact check, so they are sending to a new queue for being verified, and what is more interesting is the graphic that you can see on the bottom, where you can see the amount of work that we are saving to them. The red line is the tweets that are being processed by the platform. The blue line is the tweets that the platform, the text that they have something factual on it, so you only have to do at least more or less 90% or 85% of the work that you have to do before. Another thing that we have to take care of when you are working in the real world, basically the bird models are required high computational power, GPU and RAM, so the real-time inference is much expensive than the training, because you have to have your machines 24x7 on if you want to provide inference in real-time. In order to deal with this in a small company, as ours, we started to use the AW's inferential chipsets. They were launched, I believe, one year, two years ago, and basically they provide six or two ten times faster inference time than CPU, than BERT CPU, and they are cheaper than GPUs. What do you have to do? Compile your code with an SDK they are providing for this specific architecture. You can have something that is more affordable for your company, so this is the way that we decided to implement this at the end. Now, we are going to go to the third step of the verification process. We call it automated claim matching. How can we automatically verify information? Using an AVE approach, one way of doing it is simply to try to figure out if this claim has already been fact-checked by Neutral, previously, or other fact-checker, and then you can send automatically the information to the final user. Why? Because you already fact-checked that info. Other way of doing it is trying to get the real data and give it to the journalist. We are going to see how we are starting to work with this objective. So far, we have totally solved the step one and the step two. Our research work now is focused on the step three. But when we try to work with sentence similarity, what we find when we were working with BERT architectures was that the traditional BERT architecture was really slow when we try to compute thousands of embeddings at a time, and you need to do that, you need to do this if you want to compare a lot of sentences. So there is a new BERT architecture that is the way to go, that is called sentence transformers or sentence verts. Basically, if you have 10,000 sentences, they have got with this new architecture to reduce the range of time needed to compare 10,000 sentences from the range of hours to the range of seconds. And this is really powerful when what you need to do is to make inferences or provide recommendations in real time. How this thing is possible or how this works, basically, training the network with what is called a Siamese network architecture. The graphic on the right shows how this works. You have sentence A, you pass sentence A through a BERT layer, then you get the embeddings of the sentence A, the same, you do the same with sentence B, and you compute the cosine similarity between both sentences. Then you compare this measure of cosine similarity with your ground tooth or similarity between both sentences, and you do a fine tuning of a model like this. Also, when you have all of these embeddings, as I said, these are completely less expensive than the BERT, when you have all of these embeddings, you need also a way to find a way to search in this embedding space that is efficient. For that, we are using the elastic search with the algorithms, the KNN-like algorithms that allow us to do this in a short time. The problems that we have discovered so far with these approaches, basically, we had a bad performance when the sentence is short for that kind of scenarios is better to compute sentence similarity, go with the typical elastic search queries, not going with these kind of architectures. The other thing that we have figured out when we do our first test is that also we had a bad performance when you want to compare sentences that have a rather different structure. Imagine a very long sentence that have compared with a short sentence, and you are interested in knowing if the short sentence is contained somehow in the long sentence. For that kind of scenarios, our first test says that this is maybe not the best architecture for doing that, or you will need to figure out new ways of solving this issue. Again, cross-linguality for us is very important, and can we do the same when we are trying to measure semantic similarity? There are some models, one of them is LAPSI, that have great capabilities to try to compare, to try to generate cross-lingual embeddings of sentences. So with these cross-lingual embeddings, you can figure out if two sentences are the same. This is a screenshot of a real use case scenario, a real proof of concept that we have done with sentences that were fat checked by neutral, and we started to look for similarities with sentences by fat checkers across the world, in a database that is the fat check explorer that is maintained by Google. And we figured out that sentence, the specific sentence that when I read it, I thought it was in Arabic. I went again to Google Translator, it seems that it was not in Arabic, it was in Telugu, it was in Indian language, and what was really impressive for us was discovered that both sentences were really the same in very different languages and expressing a different way. So this kind of system has the capability of finding similar sentences on very different languages. The problem is they are really, really slow. They are big machines, and we cannot afford to have that in our computational resources. So we can start working with knowledge distillation techniques or quantization techniques trying to get these models with a smaller footprint. Knowledge distillation basically consists in having a teacher model and a student model. The teacher model is the heavyweight architecture, the one that has a really good performance, and what you do is to have a lightweight architecture that you train it to imitate the teacher network. You have to make a trade-off at that moment between speed and performance, but that's all. And finally, sorry, and finally, what is really interesting is having this idea that Bert knows data. Language models, there are some people that are starting to use language models as knowledge basis. We are trying to explain this. If you remember, the pre-trained language model was being trained with massive amounts of data from internet. They have the whole Wikipedia inside of them. They have the whole common crowd, the whole internet inside of them. So there are some factual knowledge inside of them. And if you remember the pre-training objective that we were discussing, this pre-training objective was working such a way that the language model is capable of saying, of guessing, one must work in a sentence. If we build a sentence like this, El Quixote was written by Musk, filled it up, probably a language model where trained is going to answer Cervantes. So at the end, the language model knows the answer. The language model knows the data. Of course, the knowledge that the language model has is a shallow representation. They don't have complicated relationships between the data, but they know data. And so we can use this knowledge for trying to answer questions. One interesting thing is that architectures like the T5 that basically are complete transformers, not encoded only architectures like the Bert that we were using before. The T5 architectures can be used or reframed to create Q&I systems. In fact, the verification work, we can rephrase it as a Q&A problem. The question is the claim that we want to verify, and the answer that we are looking for is the set of evidence, the set of data that we can use to decide if something is true or is false. So these kind of transformers models that also can use the capabilities of transfer learning among many tasks can be an interesting solution for the future, and we are starting to work with them to see what they can lead us. Other potential solution for our problem is work with the stance detection. You can decide that some amount of data, textual data, are evidence, and with a stance detection you can see if one of those evidence supports or contradicts the claim. If you have many evidence from textual data supporting the claim, you can simply say this probably is true or this is false. And this is all guys, thank you very much for your time here and this is all. Thank you Lena. Thank you so much Ruben, my God, oh my God, I'm taking all my notes because it's okay, it's okay, I think I just got some water on me. Wow, Ruben, well, one has to be careful, no, no, one has to be careful now that we know all this with what we say on social media, there's no way to escape. Thank God we're not that important, as important as politicians. Well, Ruben, thank you so much, in terms of questions, we have a couple of minutes, so let me go ahead with them, what are the limitations of the transformer model, such as context fragmentation, fixed length, etc. Yeah, there are many limitations, when I'm explaining this it seems that everything is totally solved. First of all, working with spans of data is difficult, but what is more difficult is working with big amounts of data. For our problem, we don't need to have a big context, but when you want to have, I don't know, imagine a figure like a lot of characters, maybe four or five paragraphs, you need at that moment going to different architectures or at least going to the long form as well and try with different things. With many data, with big data, you need to have big architectures with a billion of parameters and at the end the computational expenses are normally too much for the trade-off between the cost and the benefits you are getting from them. Regarding the part of data verification, the main limitation I'm only highlighted at the end is that the knowledge of the data that they can have is really shallow, so the capabilities to reasoning with this model, for instance, we know that the bare traditional architectures, they have problems to reasoning with numbers. There are specific architectures to work with numbers, so this is something that the transformers still have to work with. Okay, well, but it's impressive nevertheless. They ask you, how do you design for false positive scenarios? Okay, this is something that is really difficult for us, basically we are working with one mattress, but we are not, for instance, waiting more one kind of error than other kind of errors. For our use case, normally we prefer to have biggest recall than precision, because for us it's more important that the journalist don't miss any lie than a politician is saying than the opposite scenario. And we are still in neutral, there are people working on it, we still have to work with the study of the bias of the system. The problem with deep learning networks, specifically transformers, is that you don't know really why they are making their choices, and maybe we have some bias in our data sets. The next thing we have to do is to study that. Which is one of the basic question everybody asks when we are talking about data algorithms, etc., because there's always a human behind it. Look at Twitter, what happened with the content of the right versus the left. So just the last final question very quickly, because they're telling me we ran out of time, but actually it's latency versus accuracy, question mark? It depends on the use case scenario. If you are not in a debate, okay, in your debate in real time, where you have to say liar, liar, liar, for us accuracy is more important. Okay, accuracy is more important. We have to say thank you so much to Ruben Miguel from Neutral. Well done, what a fantastic job, what a love to Anna and all the team at Neutral.