 So, bienvenido Álvaro Barbero, Chief Data Scientist at IIC, Álvaro, welcome. Great to see you. Thank you, Nicolas. Thanks for the introduction. So, that's right. I'm going to talk about how to find your natural language models for the Spanish language. And with experiences we got in working with these kind of models. So, let me just share my slides. We will get going. Okay, here you are. So, the first thing I would like to mention is that I was already present at big things last year, so I'm grateful for being here once again. And I'm not only mentioning this to give thanks, but also because my last year's talk was quite related to this year's talk also. So, let me start by saying that previously on being things, okay, you know how these things goes. Okay, so what was my last, my talk last year about, so it was also about language models. And I'm just going to give you a very brief summary of that because I think it's relevant to understand what I'm going to talk today. You don't need to actually have watched this talk, I'm just giving you the summary. So, last year I proposed this problem. Let's try to solve this cattle competition about text classification. But instead of using the whole training data set we had available for this, let's try to use a more real world like version of this data set. So instead of having like 100,000 data points available, we just use a thousand 600 texts for training our model. And this made things quite more difficult, because now you have to put a lot of engineering and thinking in your model, and you can trust just the big data to do everything for it, right. So there are a lot of techniques that you can use from the world of NLP here. And here's a summary of what works and what didn't. So if you use standard models, and when I say standard models, I'm talking about just very basic machine learning models like doing bug upwards and doing random forest linear models, so on. Well, this can only take you so far, right. Of course, if you use recent advances in neural networks like LSTMs, gated recording units and all that neural things stuff, you can do better. But there is a significant point of improvement that is very useful when you have so little data for training. It's when you use parts of the network that were pre-trained with a different data set. So when you use transfer learn. And you can see in this plot that when we use the fast text embeddings to get some significant improvement. So fast text is there are some embeddings that were pre-trained by Facebook on a general data set of the English language. And if you incorporate that into your model, it means that you don't have to learn that thing. Okay, but take a look at what worked best for this problem. So we have this logo here of this hugging phase. Actually, this is the name of the company that creates this library, hugging phase. This library is called Transformers and it worked really well for this problem. So what is this Transformer thing about? Why did it work so well in this problem with so few data? Well, the idea is that this library mainly uses language models, which will be the main point of this talk, right? So what's a language model about? Well, if you think about it, about the way you learn any language, actually, you should need to do it in two steps, right? Just think about it in the way a baby learns language. You first learn to read or learn to speak, let's say, and then after you are, let's say, proficient in that, you can start doing some more interesting things like classifying topic, detect the emotions, summarize the arguments, answer questions out of different tasks, right? But before doing all these interesting stuff, of course, you need to learn how to read, right? Now, the point is that in most of our NLP models, we try to do both things at the same time, because we just have a single data set, like the dataset that I just described for this Kaggle task, and we have a model that starts with no knowledge about language and tries to learn everything to be able to classify these texts into different classes, right? So it's a lot of learning efforts we are putting into the model, right? So what happens if we split the training procedure in these two tasks? So this is what language models are currently doing. And just to give you an example, BERT is probably the most famous language model. It was introduced about two years ago by Google, and they performed the training procedure in this exact two-step way I described. So first, you take a very large dataset of the English language in this case, or any language you want to model. And this dataset does not need to be labelled. So this means you can use the whole Wikipedia, for instance, or any dataset of books or dumps of the internet, just really whatever. And you make the model learn how words are structured in this language, okay? So once the model has learned this part, which is akin to the idea of learning how to read, you can now fine-tune the same model to your specific natural language problem, right? So for instance, in this slide, you can see spam classification, which means you will take your small dataset of labelled data about spam or non-spam messages, and you will fine-tune the language model for that task, right? So how actually is this performed, right? Because, well, probably most of you understand how a language, well, how a model can learn from a labelled dataset, but what about the first part about the unlabeled dataset? Okay, the key idea is to build a language learning task by, well, changing a little bit the unsupervised dataset we have constructed. So the first task that was used in BERT for learning the structure of the language is called mask language model. And the idea is very similar to what you might have seen when you are doing exercise to try to learn a second language. You will take a sentence, pick a word from that sentence, and take it out. So you just leave a blank space there. And the model tries to predict which word is now missing from that gap, right? So in this example, in this slide, we remove the improvisation word. We replace it by a special word called mask. And the model needs to predict that. And the way it does it is, well, BERT is essentially a very large neural network, which is a deep network. I will fine-tune all the weights in the network to optimise this task. So filling out these gaps you are creating. So this is the first task used for learning these language models without actually having levels. And the second task is a little bit of a higher level task in which you will take two sentences from the sentence. And tell BERT, OK, BERT, this is a positive sample, OK? Then you will take two samples from different documents, put them together as well. And tell BERT, OK, BERT, this is a negative sample. So in this way, what BERT is trying to solve is a minor classification task in which it's trying to tell apart phrases that came from the same document or from different documents. So you are trying to make this language model understand a little bit more of the overall architecture of the document or the overall structure. Right. So we need to understand the structure, right? The correlations between words and the sentence and so on. And then after that, you can perform this fine-tuning step I told you about which you can perform a few more steps of back propagation to tune your neural network for your specific supervised data set. OK. So that's great. And the thing is that, well, this was the first language model. But since BERT was created two years ago, we have seen an incredible evolution in language models. In this plot, you can see how as time has evolved. We have been obtaining language models with more and more parameters. So here, parameters means that the deep neural network we are using to learn the language is larger and larger and deeper and deeper. It even goes to the point in which you have a GPT-3, which was released last summer by OpenAI. And this model has 117 billion parameters. So it's a huge model, right? So it's so large, you can't even use it for, let's say, business as usual tasks. But it can be able to solve very complex problems. Now the thing is, well, we have seen this evolution here, right? And now the point is, well, we have seen some improvements not only in the size of these language models, but also on different aspects of these models, like improving the architecture, improving the efficiency. I have summarized more or less how these models have been improvement in these two trends we can see here. So we have either models that try to exploit larger datasets or they build larger neural networks so they can learn more from that data. And there is another trend which goes in parallel, which tries to compress these models. So try to build the smaller models that perform at more or less the same level of accuracy in your NLP task or architectural improvements, which actually try to learn the model in a different way, maybe using slightly different layers or slightly different ways of learning from the data. So you can actually make this learning procedure more effective, right? So here we have some examples of both trends, like Roberta from Facebook, T5 from Google, GP2.3 and GP2.3 from OpenAI, this particular from Hagenphase, Borg from Amazon, Barton, Albert again from Google, and ProphetNet from Microsoft. So we can't really have a clear code of whether each one of these models is following one trend or the other. Some of them provide mixed improvements, but actually we can see that the evolution of this field is more or less following these lines. Right, so that's great. It's not that only companies that have very large clusters can use these models. We also have smaller versions that can be used for, let's say, our daily tasks. And the thing is that this doesn't seem to stop here, right? So this plot you are looking at right now is extracted from the GP2.3 paper. And what you can see here is that when you use a larger model and you spend more time optimizing that model, the improvements you have in these language learning tasks are quite noticeable. And we haven't yet reached a point in which increasing the model even more doesn't provide additional benefits. Of course, the issue is that for every step of improvement you have to build the model size. So it's more and more costlier to build these models, but it seems that we still have a long way to go in terms of making these models bigger and more efficient and getting further improvements in our NLP tasks. So, well, that's great since we have a promising future in this field. But the thing is, if we take a look at all of these models, it seems like they are very focused on our very specific language, right? So Bert was created for the English language. Roberta, which is the Facebook version, also speaks in English. We have GPT-3, GPT-3, which also speaks in English. We have Bart from Google and also Albert that also speaks in English. So, well, maybe we could ask a question. So, Albert, how is your Spanish language going? Hmm. Not so well, right? So, where are the other languages? So, we are focusing on Spanish because, well, this conference used to be named 2 years ago, Big Data Spain, and I also work on a data science firm on Spain, but we could ask the same question for their languages. Well, fortunately, we have a multi-lingual version of Bert. So, this version was trained with 100 Wikipedia's, which were selected by, well, basically taking the Wikipedia's, which a larger number of pages. So, we have here documents from Spaniards, German, Italian, Russian, Chinese, Japanese, and so on, up to 100 languages. And they built a single language model that was able to work with all these languages. So, this means, in particular, that you have a CERVO vocabulary, which includes characters from the Latin alphabet, but also Celeric, Kanji, Hiragana, Hangul, and so on. So, this is a larger model, but actually it works. It can be used for different languages, other than English. So, we made a little test to see if this works so well, if we could apply this in particular for the Spanish language. So, the test we created was based on the TAS task. So, TAS is a competition that is performed every year in Spain. It follows a conference called CEPLIN, which is about the Spanish natural language processing community. And, well, this is a sentiment analysis task, right? So, we get some short texts, and we try to predict whether this text is positive or negative. And this interesting thing here is that this task is defined for the Spanish language, but also that it mixes several Spanish variants. So, we have texts from Spanish speakers of Costa Rica, Spain, Mexico, Peru, and Uruguay. We created, let's say, an artificial data set by aggregating different data sets from several years of this competition. And, well, you can see here the results for the test data set we prepared. So, it seems that when you use these classic methods, like, again, bag of words, random forwards, and so on, even if you use some embedding methods that the embedding is provided by spacing, you're still far away from the performance of this multi-language model. And even if this model wasn't specifically designed to learn how Spanish works, it was very well for a Spanish task. So, that's good news, right? So, most of the language models we have around are for English, but this multi-language model works well in Spanish. Now, the thing is, if you take a look at the community, you can see that some language-specific models are starting to appear. And here you can see some funny examples, like, we have Camembert for the French language, Robert for Dutch, Grigbert for the Greek language, right? So, what about Spanish? So, it's one of the most spoken languages in the world. Well, fortunately, we do have a language model that is specific for Spanish, and it's called Beto. It was created by some researchers at the University of Chile, the Computer Science Department. An interesting thing about this model is that it closely follows the architecture of the original BERT, but with some of the improvements that Roberta introduced, this model by Facebook. It's also available in the Transformers library, so you can use it straight away. And also an interesting point is that they use a new corpus, a new dataset that these guys created by aggregating a lot of different sources of Spanish documents, like Wikipedia, books, documents from the European Union, and so on and so on, right? So, we have a dataset of three billion tokens that was used to train this model, and this model is specific to the Spanish language. So, this will be something useful for Spanish NLP tasks, right? Well, we tested it in the same task as I showed you two slides ago, and indeed the improvement is quite significant. So, we already got an improvement by using a language model that wasn't specific to Spanish, but when we do actually use our model specific to Spanish, we get a further 10% improvement in F1 score of this task. So, this is a very significant improvement. So, that's great. Good news. This BERT language model works very well for Spanish. Now, the thing is, we wanted to try it a little bit more. So, we designed this experiment, which is a little bit mischievous, if I might say, because we wanted to test what we call the domain gap, which means we are going to build a model, so we are going to find you better for a domain of language, and we are going to test it in a different domain. When I'm talking about domains, let me give you an example of what I'm talking about. So, this is a problem about tweet classification, again, about positive or negative opinions. And the first domain is related with tweets about social issues, like where people discussing about politics, feminines, inclusive language, and so on. So, we trained the model in this domain, and then we used that model and changed it to classify tweets from a completely different domain. This is the domain too. You can see in the slide, and it was about customers expressing their opinions or complaining about different supermarket chains. So, you can see there are two completely unrelated domains. The task is the same, it's sentimentalizes, but the domain is different. So, the language used there is different. You can see here some examples of the kind of tweets we gathered. They're in Spanish, of course, so for non-Spanish speakers, just let me tell you that. So, in the first example, for instance, we see somebody saying that they are in favor of associations that are against violence towards women and child, but the negative tweet says that, well, it's complaining that somebody seems to be very interested in these feminist issues now, but in the past didn't pay much attention, right? So, these are the kind of discussions we're working with. While in the second domain, you can see a positive tweet that is so in support in favor of some supermarket chain. While in the next tweet, you can see somebody complaining that he's never going to buy again in the supermarket chain, okay? So, these are the kind of tasks we're trying to solve, positive and negative. And here are the results. So, again, we are comparing a classic machine learning method, like TFIDF plus a linear model against Beto. And you can see two bars here. So, the blue one is the results we obtained for the first domain, which was the domain in which we train our model. And the orange one was when we actually tested this model and changed it into a different domain. So, the scores are normalized. So, the performance on the first domain for the first model is 100%. And you can already see there is a domain gap there, right? When we try to apply the same model for a different domain, even if it's the same task, we see a drop in performance, which is to be expected, right? Because the language is a little bit different. So, what happens if we use a language model like Beto? So, what we see here is that, first, we have an improvement, a general improvement, just because we are using a language model. So, we go from 100% to 100%. 10%. And then the domain gap we find when changing domains is actually smaller. And this makes a lot of sense, because the first method is based on just counting words, counting expressions. And then the method based on a model language has learned the general structure of the language. So, even if you change domains, because it has seen data from a lot of different domains in the pre-training stage, it knows more or less how to do the jump, right? So, that's a little bit of a hard way of explanation, but this seems to be the main reason. Right, so, again, this works very well. Now, there's one more thing. What happens if we compare this language model, which we have and works beautifully, against the latest language models in English? We can see a huge difference, because just starting from the training data set, so, the data sets we can find for the English model language are like 20 times larger. And this makes also that the model, like the neural network you are training, is also quite larger. So, GPT-3 is probably an exception, because it's too large for most practical applications, I would say. But some other language models that work very well in English are like twice or three times the size of the language model we have for Spanish. And this also implies that you will need better hardware to train, right? So, the language model we have for Spanish is great. It works very well. But there is a clear way of improvement, right? All of the improvements that we have been looking at for the English language in these past years, in these two years, can be applied also for the Spanish language. So, there's a long way to go. And thus, because of this, let me introduce you to the RigOberta team. Okay? So, we are a team of experts in Spanish NLP, and the team is comprised by different a variety of experts from computational linguists to computer scientists, machine learning experts. And our objective here with this project is to build a new language model for Spanish, which we are going to call RigOberta. And our plan is to build a model that works better than the one we have right now by incorporating all these new features that English language models have been showing in the past years, past months, let's say. So, we want to incorporate five key ingredients here. So, first, we are going to build a larger corpus. We also want to do a more detailed control on the kind of texts that our model will learn from, so better corpus quality. This will imply that we will also need to use better hardware. And we will also want to incorporate into the neural network model itself some of the advances we have seen, again, in recent English language models. And there's also a final key step that we want to try, which is to be able to do domain adoption. In order to even reduce more the domain gap I have been talking before. So, let me tell you a little bit about these points on how we are doing this right now. So, the dataset we are building for pre-training this model is built from three different sources. The first one is called OSCAR, which is an openly available dataset that you can find. It's split by languages. And also, well, basically it's a dump of the internet, right? And for the Spanish language, you have about 26 billion words there that you got there from. So, that's a lot of content, right? We also are adding our own dataset that we gathered from different providers, which is about 128 gigabytes of news from different Spanish media outlets. And we are also adding the Spanish Wikipedia, which is actually not so big compared with the other datasets. But it's a useful reference for high quality texts. So, let me tell you what I mean when I'm talking about high quality texts, which is another key ingredient we are thinking about. The idea is that we want to take all these massive dataset we are building and filter it a little bit in a way that many tests that are not really representative of language are being dropped out. So, we are doing this in three steps. So, first, we are going to, well, we are actually removing duplicates that we can find in the dataset. So, even if these datasets are more or less curated, you can sometimes find the same piece of news in different parts of the dataset. So, we want to remove that to have a more faithful representation of the dataset. We are also removing some texts that may include a significant part of the text in a language different from Spanish. And we also want to add this quality filter so that texts that are more similar to how people really write in Spanish are the ones that are given more relevance when training our model. Right. So, for instance, to remove the text that are from a different language, we are building now two lists of characters, one which we call the positive list. So, you can see here there are characters that belong to the Spanish language, like those from the Latin alphabet. But also, you will find we also have symbols, numerals, different kind of emojis that we find to be commonly used. And we do want our model to learn embeddings for those emojis. And on the other hand, we also have another set of characters which are not so useful, right, because there might be characters from different languages like Korean, Japanese, Chinese, or there might be special unique characters we don't want in your data or emojis that are not so common so we don't want to actually pay the time to learn embeddings for them. So, that's one filter. Now, the second one is our quality filter. And the idea is that we developed inside our team a guideline of which kind of text we consider to be of high quality. So, to give you an example, this means, okay, any text that looks like HTML garbage, like a lot of HTML levels and so on, we don't want that in your model because that's some text that somehow entered into the dataset but it's not it doesn't represent that of the way people write Spanish. And on the other hand, like news or pieces of Wikipedia articles and so on, those are high quality text. Okay, so we standardize this definition of what means to be a text of high quality. Then we manually level a golden standard of text of high quality and low quality. And with this golden standard, we trained a simple machine learning model that just by taking a look at work counts and simple features is able to produce a score. So, you input a text and it tells you the quality of this text is 90% or 10% or so. So, the way we want to make use of this is just by shortening all our data by this quality score and making our language model to start learning by looking at the top quality text and keep going down the list. So, in this way, we are going to put more focus on text that looks more like the text people using on the internet on different sources, like real Spanish language. Right. Now, to train all of this, we are now currently testing different hardware platforms. So, we are now currently benchmarking GPUs and GPUs. We don't have a clear cut winner right now because that's also something we are doing right now in this project. And we are also incorporating some technical details from different language and English that have worked very well. Like, for instance, using very large batch sizes. This has provided very good results in the Roberto model or changing a little bit these pre-training tasks like what is done in the Albert model or this cleaning procedure that was using GPT-3s and actually our clean procedure is heavily inspired by that one. So, those are some improvements that by the same cost of training the model can provide actually results, which we expect we will result in a better model. Right. So, there is one more thing that we want to add, which is domain adoption. Our plan is that we are going to be a general language model for Spanish, but after that we want to perform some more fine-tuning iterations to learn how particular domains of language are well, are structured. Like, for instance, we are now working on the legal domain and we want to create our legal version of Rigoberta, which we can now then apply to problems in the legal NLP sector. And I can say that we are working together with the law firm Garregas in trying to find applications of this model in the law sector, which is a great opportunity. And we also have in mind that in the future we might be specializing in this model for biomedical and other different tasks, right? But this is our plan. So, now, what can we tell about the model now? Just very quickly, what we have done so far is to create an alpha version of Rigoberta by running more training iterations over the beta model using our dataset, right? And with this, we have already seen some improvements. So, here you can see the learning curves. We can see how, as the learning progresses, we are able to obtain better results when we use our cleaning pipeline versus when we use the unclean corpus. So, it seems like we have cleaned the corpus, the dataset, is useful. The same happens if we use larger GPUs, which means we are using larger batch size. So, as we got from, as we learn from different papers on the English language models, it seems to be an ancient idea, and indeed, it looks like that. But to give you some more specific results, we also tried this on a document classification task. And just very briefly, we gathered some tweets talking about the Spanish city of Toledo, and we tried to classify them in nine different topics, which are highly unbalanced, and also they can be mixed. The same tweet can belong to different categories. So, we compare different models in this problem, and we got some surprising results. Because, since this is a document classification model, semantics are not so relevant. Basic machine learning model works quite nicely. And actually, the language models we have before doesn't work so well, but it seems that with these alpha versions of Rigoberta, we are actually, right now, being able to obtain slightly better results than with the classic model. So, this is a good piece of news. It seems that the work we are doing is actually working out. And now, what do we plan to do from right now? From right now. So, our next step will be to completely build this corpus, this training dataset, and to finish this cleaning pipeline. And then we will gather a cluster of GPUs or GPUs, whatever we decide to use, and train this model from scratch, right? So, we don't depend on the way Beto was trained, and these might allow us to go one step further, right? So, this is our plan for the end of this year and probably next year. That's also to give you some final conclusions on not only our plan on NLP, but also the general idea is that, well, language models work really, really well. Like, Joffrey Hinton was in this tweet joking about that maybe building larger and larger models will give us the answer to the question of what's the life, the universe and everything. And also, François Chalet, which is the author of Keras, said that, well, you might keep building these models larger and larger, but this doesn't really take you to strong AI in the same way that building taller and taller towers won't take you to the moon. It means, well, we are kind of going in the right direction. So, these models are better and better. But probably with this technology, we won't be able to actually build a really strong AI, like a language model which you can talk to and can reason and answer you in a human way, right? And the reason is that this is kind of like the old Chinese box dilemma, Chinese room dilemma. So, from the outside, you see a very large model that has surprisingly good results, like the results you have probably seen for GPT-3. But on the inside, it's just a giant correlation machine. This is what we have. These models do not have a real grounding on what each word means, what's a pen. The model doesn't know. It doesn't know that the word pen goes together with these other words. And with that, it makes very beautiful rich science, but it won't give you strong AI. So, what I'm trying to say with this is we are not really closer to the perfect NLPA. Alvaro, hi. I'm so sorry to interrupt you, but we are starting to run over. Would it be possible for you to wrap things up fairly quickly? Actually, I was wrapping already. So, just my final phrase is the final sentence that these models won't take us to strong AI, but they will be more and more useful in practical problems. And that's my key takeout. That's fantastic. Thank you so much, Alvaro. So, we have time for a few questions. That's great. Thank you. Very detailed presentation. Fascinating. A question from Angel. He asks, when do you expect to get final results from the Rigoberta project and what business application do you plan for it? Okay, good question. Our original plan was to have this model finished by the end of this year, but cleaning in Corpus is becoming a little bit more difficult than we thought. So, we are still working on that. So, maybe the first months of next year. And for practical applications, we are thinking about, well, sentiment analysis, document classification, but maybe also document summarizing is something we have in mind. Okay. I'm curious about NLP in general. And I'm wondering if some languages are intrinsically more difficult for any NLP. You mentioned the multilingual BERT platform that was adapted for 100 languages. Does it perform equally in all languages? Or are some just easier structurally, or because of the pronunciation or for whatever reason, they're always going to be more of a challenge? Well, we do have the hunch that Spanish is more difficult than English because of the more complex structure of the Spanish language. But it also highly depends on how much data you have of training. For languages in which you don't have so many data resources, it will be more difficult. Right. And you started to outline some of the sectors that you're going to be working in going forward. I think you mentioned the biotech sector. Can you talk in a little bit more detail about where you see the greatest opportunities for what you're doing right now? Okay. So we have a parallel project that is about building some kind of question-answering bot in which you can ask it questions about COVID, about different things about this virus. And it will try to look in the database of documents and tell you, well, I think the answer to this question is this one. So if you improve the base technology you are using, which is the language model, you will be able to build a bot that can give you more accurate answers to your questions. So that's one of the applications, but this is a general technique, right? So once you improve the base language model, all your NLP applications can benefit from it. Right. I don't know if you have this kind of information or data, but is it possible to quantify just how big an industry NLP is in 2020 and how much it's grown over the last few years, just to get a sense of how important this is becoming now? Well, I don't have precise data to give you, but I will say that at least the interest, let's say the industrial and academic interest in this field has grown in the last years. You can see it in the number of papers, all this in the number of applications we can find around. So I will say it's a rising trend right now. We have some requests for your slides. Will you be sharing those later? Yes, of course. Is there some way for people to get them? Well, I guess we could share them through Twitter. We might send a link, okay? Okay. So if you follow me on Twitter, I will try to share them that way. Fantastic. That was another request for your contact details, but people can always reach you through the networking platform as well. So, Alvaro, once again, thank you so much for that highly entertaining and interesting talk. Hopefully we'll see you again next year. I hope so also, okay? Okay, thanks so much. Thank you, Nicolas. Thank you, everyone, for listening.