 OK, good morning, everybody. So thanks for choosing this talk so early in the morning. So we're going to talk about text. And text is everywhere. So maybe you already know this infography. It's about how much data is generated on different regions of the internet every minute. And the highlighted regions here are those that are mainly made of text. So we are talking about Google searches, about texts, Tumblr posts, emails. If you are unable to process text in your data analysis pipeline, then you are missing out a lot of information. And when I talk about processing text, I'm talking about different kind of problems or different kind of challenges that we can address using text analysis techniques. So for instance, we could talk about text classification. And this task is mainly about I will give you one document, and you have to place some levels over that document. So maybe say, what's the topic of this document? Or is this document saying something positive about this issue or something negative, so on? So that's the easy part, let's say, about text processing. But we have more complex problems. So we can also talk about text tagging. And that means that we are going to put a single level into every single word that appears in the document. We can do this for useful tasks, like finding names of people in documents, or names of places, and so on. And maybe if we want one step further in complexity, we also have question answering or information instruction. And here, the idea is that I'm providing the machine learning model, not just a document, but also a question that is phrased as written text. And the machine learning models will be able to tell me which part of the document gives me the answer to that particular question. So maybe this is a little bit abstract, so let me give you some examples. So let's have this very simple sentence. There's a dog in John's garden. So maybe we have a model that tries to classify this document, this sentence, into two classes. So we heard previously this example at the Google dog about classifying images or cut or non-cut. We could do the same thing about text. So is this document saying something about dogs? Well, yes it is. We can classify it as a dog document. Not a thing that is really useful, but we will get a little bit or two more complex things. We can also try to do some kind of sentence tagging here, some sequence tagging. So maybe I might ask this sentence. So is the name of a person referred into this text? And it turns out it is. So John is a person. So that particular word will get this person tag. And the other tokens here have nothing interesting, so they will remain empty. And as I said, we can also try to do some kind of question answering. So I might ask this document, well, where is the dog exactly? And if the machine learning model is clever enough, it should be able to highlight this region of the document. That's telling me that the answer to this question is that the dog is in John's garden. So these are the kind of tasks that we can try to solve today. There are other ones like translation and so on. I'm going to focus on these ones. What I will do in this talk is try to apply a different set of techniques from very simple to more complex to a very specific problem. So it will be this problem. This comes from a cattle competition that took place two years ago. And it's about classifying comments found in the internet by their levels of toxicity. So this means the data set looks like this. So we have a single piece of text for each document. And we have to tell whether it contains any of these six types of toxicity. It can range from toxic to severe toxic, obscene, threat, insult, identity, hate. And this is a multi-level problem. Multi-level means that each of these kinds of toxicity might appear on its own or combined with other. Or maybe we have text in which all of them appear at the same time or maybe none at all. So we have all these possible combinations. Fortunately for this challenge, we had a lot of data to train on. So we have about 160,000 training texts. So that's quite a huge data set, right? And we also have more or less the same quantity for testing. So when this competition appeared on Kaggle, what did everybody start doing? So deep learning, everything. Whenever you face a complex multimedia problem, deep learning is the way to go. So everybody try this and you can actually get very good results for that. You won't win this competition because for winning you will need to do a lot of very clever feature engineering and, well, handle the data to get some more synthetic samples and so on. But if you do that, you can more or less do something useful. Now the problem is that I'm here to talk about real world problems, not about Kaggle competitions. Kaggle is very nice, but the real world is very different. The thing with the real world is like, it looks like this. So welcome to Iraqis. Maybe there are some Frank Herbert's new fans around here. So what they think with Iraqis, you don't have data. So maybe your data leak looks like this. So it's an empty can in the desert. There's no data to be found. So that's an issue. Now you can't train very large machine learning models because you have no data. By the way, since we are in Iraqis, for those of you that work in the data engineering part, you will also find sometimes that you have some terrible boxing production, right? That's a different issue. I won't address that on this talk, but I just wanted to refer it. Okay, so instead of working with this data, which might be not so realistic for industry problems, we are going to apply some real world filter here. And now we only have 1,600 training data points. Okay, that's about it. That's a more reasonable dataset for a real world problem. I will still keep the whole test data because I want to show you that the methods I will try generalize well enough. Okay, but I will only use the test data just to measure the accuracy, right? I will use all the methods on the train text only. And by the way, all the code I use for this talk is publicly available on GitHub. So you can just follow this link or if you look for my name and big things on GitHub, you will right away find it, okay? So how do we solve this problem? Well, let's start with the very basic stuff, okay? So we can build a baseline model and on top of that, we will see how we can improve this using more recent natural language processing techniques. So probably the most simple thing that you can do is, well, you can't just take some text and input into the machine learning model. You have to do some kind of feature engineering. So the simplest feature engineering method you can do is bag of words. And what does this means? Well, essentially, you will take your training data and build a dictionary with each and one of the words that appear in your training data. And you will assign a single number, a unique number to each one of those words. So when a document arrives to your system, you will build a very large binary vector, which one input for each one of your words. And the way you will codify your document in that vector is just by placing out one in the position of each word that appears in the document. And the rest of the vectors will be all zeros. So to say it's different, what you're doing essentially is building a vector that tells you which kind of words appears in this particular document or that particular document and so on. This is very simple, but if we do this, we already have a feature vector, a fixed length feature vector that can represent any document in our data. And once you have that, you have a very pretty nicely framed data frame. You can already use a standard machine learning model like random forest and you get a result. You can do a step to improve this a little bit, which is instead of using the simple bag of words methods, which just takes a look at each one of the words independently, you could take a look at a couple of words that appear together or triplets of words that appear together. You could still use this bag of words technique for that, but the problem is that this binary vector will grow very, very, very large. So to avoid that and get something that works officially, you can use this other technique that is called the hashing trick. And you can think of it as a way of compressing the bag of words representation. But the idea is the same. You will make a dictionary and here you will take note of what kind of groups of two words and three words appear in your data. And now since your feature vector is so large, maybe a random forest is not such a good idea. You might use a linear variant, which is a model that works very well in high dimensional spaces, okay? So this is the theory, but implementing this in Python is really, really easy. You just need to use cycle learn and this is the code. I'm skipping the messy code that deals with the data cleaning and so on, but the code for your model is just this one. So you just need for the bag of words plus the random forest model, you just define up cycle learn pipeline, use this count vectorizer transformer, which implements the bag of words stuff, and then you apply the random forest model and that's about it, right? If you want to use the more advanced technique here, using the hashing vectorizer, well, essentially you just need to replace the count vectorizer here by a hashing vectorizer model here. There are also some not so nice things about using a linear SVM because it doesn't really fit into the multi-level model that we are trying to build here. So you need to have some wrappers, but the key idea is that you just define a transformer and a model and that's about it, that's very simple. If you do this, you can train these models with your handful of training data and then you get these results for the test data. This problem is fortunately very unbalanced and I say fortunately because most of the comments you find on the internet are non-toxic and that means we can't really use the accuracy to measure accuracy on this problem because that will be way too balanced to the negative class. So what we do is we measure the area under the ROC curve and what happens here is that, well, we can see that you can get about an 86% of AOC using the simple method and if you use the hashing vectorizer which takes into account couples and triplets of words, then it works a little bit better, okay? So that's not so bad for the first test, but as I said, this is the baseline. I want to introduce more modern text analysis techniques and we will see how we can apply this is still if we have so few training data. So the first thing I think I want to talk about are embeddings. So it happens that I actually was here three years ago giving a talk and I already talk about embeddings. So they were still in fashion that time and they are still in fashion. So what's an embedding about? What's really about it? The idea is to represent each word with a vector of features that the model can learn by itself instead of using a fixed binary vector. So let me show you how this works. We have a word like dog and the first step that we will do is just pass it through the bug of words model. So there's nothing special here. We just have a dictionary of words and this dictionary will tell you that dog is the 20th word in the dictionary. So you will represent dog as a binary vector in which everything is zeros except for a 20th position in which you have a one. So sometimes this kind of vectors that have all zeros but one position in which you have a one are called one hotend coding vectors, okay? That's the traditional representation in the bug of words model but the advantage, the key idea that the embeddings provide is that now you are going to compute the product of this vector with a weights matrix. And now something interesting happens here because how do you compute the product of a vector with a matrix? Essentially you take the vector and you will multiply it column by column with the matrix, okay? But if you do that with a vector that contains all zeros but one position essentially what you're doing is well all these entries here will be get multiplied by zero so they disappear and all the entries here also get multiplied by zero they also disappear and this one here the one that corresponds to the 20th position the one that corresponded to the index of the dog word this will be multiplied by one so essentially multiplying this vector by this column just means take this value, okay? You do the same with the second column and the third column and so on so actually this vector matrix product the operation is really doing is just getting that vector so just picking that row from the embedding matrix and that will be your representation for your word. Now why is this so fancy? What's the interesting point of doing all this linear algebra? Well the key thing is we are actually implementing an operation that is able to select a row from a matrix but since we implement this as a matrix product this operation is differentiable and everything is beautiful. Why is it so great for an operation to be differentiable? If you know a little bit about how neural networks work it's because when you have operations in a neural network which for which you can compute the derivatives and they are differentiable then you can make the neural network optimize the parameters of that part of the network. So that means that with this idea we can build a neural network model that looks like this. So first we have our text we will apply some tokenizing procedure to get each one of the words of that text and then we have this embedding layer and this embedding layer will perform this operation we have just seen just take each token and replace it by its corresponding embedding vector but now this embedding matrix that we have been discussing before will be a parameter of this layer. So what will happen now? After that we will need to take all the embeddings for each words of the document and summarize all of them in some way. So probably the simplest way of doing this is just by computing the average so that means let's take the average meeting of all the document and after that you have a fixed dimensional vector so let's say you have a feature vector that represents the meaning of the complete document you are inputting and now you can start building a standard machine learning model so you can have a dense layer and an output layer and whatever you want to add. But the thing is that this whole system works as a single block. So if you took a closer look what is going on here is this part of the system is doing the feature generation for you because it's looking at the words and deciding which kind of vectors which kind of features should represent each word. And this part of the system here is performing the classification. So let's take a look at the features and deciding what kind of document you are looking at. But since, as I said, everything is differentiable that means you can do back propagation through all the network. And that means the network will be learning two things at the same time. It will learn how to classify your documents but it will also learn the parameters of this embedded layer. So that means it will be learning what's the best feature representation for each one of the words in your training data. So that's a very powerful idea. You don't need to think about feature generation anymore. The network will do that for you, okay? So this is also easy to implement. So for instance, if we use Keras which is a popular deep learning framework it works like this. So first we have some imports. We define the Keras model here. This instruction will initialize the model and here we will define each one of these blocks one by one. So for instance, here we are adding the melee layer here. Then we add the average layer, this block over here. And then for the dense header layer and for the output block we will add this, add them in this way, okay? But as you can see, it's very easy. You almost have a one to one mapping between the network architecture and the blocks you are adding here. And if you do that and you train this model it should work better, right? Because now you are not using a fixed feature representation for your words, you are allowing the model to learn the best feature representation for your particular problem, right? So we are hyped about this and we try this and we train the model and we test on the test data and it doesn't work. And then, okay, we're wasting time here. So what's going on? What's the problem? So take a look at this again. There's a particular block that is doing some operation that is kind of violating the way language works. So can you spot it? This one. You are computing the average of all the words. So think a little bit about this. So what you're telling the model here is, well, you can get the meaning of each one of your words. You compute the average of all of those meanings and that's the meaning of the whole sentence. And language doesn't really work like that. To give you a more precise example, this model can tell the difference between these sentences because it doesn't take order into account. It's just computing the average. Average is an operation that does not take order into account. So we need something better. So what can we do? Well, we can use what is called a mixing model. We have the embedding layer, which tells us which vector better represents each word. But now instead of computing the average of all the embeddings, what we will do is add a mixing model. And the mixing model is an operation that I will discuss in a while that tries to compute some kind of merge of all these vectors, but in a way that really takes into account the way these tokens were written. These words were written. Still, after this mixing model, you will again have a fixed feature vector for a single, let's say a fixed length feature vector for audio document. And then after that, you can add the usual neural network layers. So usually what you will do for the mixing block is use a recurrent layer. So maybe you have heard about LSTM layers that were quite popular in recent years. Now what most people are using are gruelayers. But the key idea of these layers is that they will compute a mixing of your embeddings step by step. So they will take the embedding for the first word, mix it with the second word and you get a new mixture. Then you take that mixture and mix it with the embedding for the third word. And then that new mixture will be mixed with the fourth word and so on. If you do this step by step, this looks more similar to the way we work with language. We'll read from left to right, right? So that's the idea. And again, you can do this in Keras. So in particular, a nice way of implementing the mixing model is by combining two layers. So what we do first is we define a gruelayer with a bidirectional wrapper. Bidirectional essentially means that the network will read the sentence two times. First from left to right and then from right to left. And after this mixing, you will get a new embedding vector for each one of the words in the input sentence. So each one of this vector is now a mixture of the word and those that appear on its right and those that appear on its left. Okay, that's the idea. And after that, you again do the average. But the difference with the previous structure is that before we just computed the average straight away and now we are allowing these embeddings vectors to mix a little bit between themselves because computing the average, okay? So again, the code is down there. It looks very similar to what I showed before. The only big difference is that now we have this gruelayer here and we have this bidirectional wrapper to make it work both ways. And after that, we have the global average pooling that we have before, okay? Now again, we trained this model with the data. You tried it on the test data and now it works. Yeah, great. But there's a subtle question here which is, well, why is it working now? Is it because the mixing model is great? But the idea is about the embedding was also a very nice idea and it didn't seem to work out. So are the embeddings really helping? Maybe if we just replace the embedding with a bug-of-force feature generation, that's okay. Well, to answer this question, we have to do a step back and think a little bit more about the general framework. So let me tell you about a little bit about language models. So this is a different problem completely with what is not related with the problem I have been talking about, about toxicity. It's a more general way of thinking about language. So a language model is some kind of function, some kind of model neural network, whatever, for which you can provide a sentence and it will tell you what's the probability of that sentence appearing in the language. So as you have a sentence for an English language model, like a black dog, this will tell you that this sentence is quite possible. It will give you, well, this sentence has a chance of, I don't know, 75% of appearing. If you get a random sentence, like blah, blah, blah, ASDF, I don't know, it will tell you, well, this has very low probability. Well, let's try to train such a model. So there are two issues when training a model. First, you have to take, you have to get hold of a training data set and then you have to decide which kind of model you are going to use, neural network, whatever. But actually the first problem is very easy to solve because you can train from everything, absolutely everything. Because what you're training data here, you need real examples of language use, that's it. You don't need labeled data, you just need sentences that were written by real people. So for instance, you can download the whole Wikipedia or Twitter or even use Common Crawl, which is essentially a backup of the internet. So you can train with the whole internet, you can do that. Great, we have an unlimited amount of data. Now, how do we train the model? Well, we can use the same ideas we have been discussing so far. And here the key trick is to use this probability, the composition. So I said, well, a language model is a salem model that computes the probability of a sentence. But the sentence is made of words. And by using probability rules, we can write this probability in this way. So the probability of the whole sentence is actually the probability of the first word times the probability of the second word given that we have already seen the first word and then the probability of the third word given that we have already seen the first and the second word and so on. And if you look at, take a look at the problem in this form, this is interesting because this looks like a supervised problem. So what are you trying to model here? You are trying to predict what's this word given all the previous words in the sentence. And every time you write on WhatsApp, you're actually doing this. So you have seen that all your smartphones will try to predict what's your next word. And that's helpful for writing on your smartphone, right? So that's a language model, okay? And that model could be implemented like in this net where I showed before. You just get all of your words, tokenize them, compute the embeddings for all of them, get the average or some other kind of mixing operation and then here in the output layer you will predict what's the next word. So we have the model, we have the data. Now what's the point of doing this? Well, predicting text is something useful but that's not what we are here for today. The thing here is that if you're able to train this model, again, you are not only training a model for your specific task, you're also learning the embedding layer. So that means you are learning which representations are useful for your tokens, for your words. So now you are learning a very powerful feature generator for text based on a huge data set, the whole internet, okay? So fortunately some people have already done that for us like Facebook. So Facebook, they have this project called FastText in which they train these language models for 158 different languages, including English, Spanish, Catalan, so on. So you can go to FastText website and download the embeddings matrix. Now how do you use this? We are going to use this for transfer learning and this means we will still have the same architecture we used before. So we have the embedding layer, the mixing layers and so on, but now we are not going to learn this layer. We are going to initialize it using the embeddings that we can download from FastText. So that means the feature generation part of the problem is already sold for us. And now we will freeze this layer in the neural network. And freezing means that now we are not going to back propagate through all the network, we are just going to back propagate through a few layers of the network, okay? Actually, this means we will only learn the mixing model and the output part of the model, not the feature generation. So that this has two advantages. The first one is training is cheaper, it's faster, we don't need to back propagate through so many layers. And also we are also avoiding a great risk of overfitting because the embedding layer has a lot of parameters. If we try to learn it with few data and we do have few data, then probably we are not going to generalize. So this is helping us in a lot of ways. And actually if we try this, it works great. So you can see that I have added two new bars here. So the first one is still using the average for the mixing model, but instead of learning the embeddings, try to use the FastText embeddings straight away. And this gives you a huge leap in advantage. And even when I use the mixing model, I can still see that advantage. So we have two sources of improvement here. Then using the embeddings from transfer learning from another model and using the mixing model. Okay, so great, that's nice. But as I said, this embeddings idea was around here three years ago. So I'm here to talking about new things. So which are the new things? Well, if we were able to transfer learning the feature generation part, the embedding part, well maybe we can also try to transfer the mixing model, because what this mixing model is doing, essentially is trying to mix the words in a way that makes sense in your language. And that should be a general skill that is independent of the particular problem you're trying to solve, okay? So can we do that? Well, yes, we can do that, but we need a better mixing model. So what's that model? We are going to use a model that's called self-attention. Let me show you first how it's used and then I will show you how it works. The idea is that we will have several layers of self-attention. So this model will take all the embeddings for the words in your document and we'll perform a mixing between those embeddings. So after that, after that self-attention layer, you will still have an embedding vector for each one of those words, but each embedding vector will have some information from its neighbors. So they will have some kind of context in there. If you repeat this pattern again, you will have more and more context in there. So after a few layers of self-attention, you will have vectors that represent not only each word, but also the context in which that word appears. And that's a very powerful idea, okay? Well, that's great, but what's inside this box? What does self-attention think? Well, what I always like to say is that self-attention is like Tinder, believe it or not, but for words, okay? So what's the sense of this? So the idea is as follows. We have our input words, we have an embedding for each one of them, and the self-attention layer has three groups of parameters, the query matrix, the keys matrix, and the values matrix. So the first thing we do is multiply the query matrix by each one of these input embeddings, okay? And with this we get what is called the query vectors. And now think about Tinder again. So the query vectors are what you're looking for. So you go to Tinder and say, I'm looking for this, okay? That's the query vector, what each one of the words are looking for. And then we have the keys, which matrix, which again gets multiplied by each one of the input embeddings, and you get the key vectors for each one of the words. And the keys means what this word is able to offer. Oh, see, this is what I'm able to offer, okay? Right, if you have used Tinder, you know what I'm talking about. So you have what you're looking for, what you can offer, and then you also perform a similar operation to compute the value vectors. And these value vectors will be the way that the combinations of words will be performed. But let me show you a very specific example and you will see how this works. So for instance, for the word D, what you do is you get the query vector for D and compute the product with all the keys vector for all the other words. And with this you get a matching score, let's say. So if these vectors are very aligned, you get a match. If they are not, then forget about it. So you will get some scores and apply some normalization through the softmax layer. So you get a set of scores that sum up to one. And then you will multiply these scores by the values vector, the other set of vectors I told you about. So essentially what you're doing here is computing a weighted combination of all the words in your document. But the weights that you have there will be compute automatically depending on how different words match. So for matching words, these scores will be very high. And again, the cool thing about all of this is that at the end of the day, this self-attention layer will give you a vector that contains the context. And the way this context is compute is by using these parameters here. And these parameters are also part of a different type of function. So once again, the neural network can learn how the words should be matched to produce a contextualized embedding that makes sense for your problem. So if we use these kind of layers, we can build what is called a new breed of neural networks that is generally named transformers. But not these ones, actually these ones. So this neural network block is called the transformer. And well, it has a lot of stuff inside here. But the main idea is that you have a self-attention layer. And you can already see that you get, as an input, one embedding for each one of your words in your text, they get mixed here, and then as an output, you get the same number of embeddings. But now these embeddings do have context. And if we made a neural network out of all these blocks, we arrived at what is called BERT, which is the method that is nowadays top of fashion, I would say. So BERT is bidirectional and color representation for transformers, but actually it's nothing else but a stacking asset of these transformer layers. So you can see that this is already a mixing model, you are stacking different self-attention layers that produce better and better embeddings. You also have a larger version with 24 layers, but well, the key idea is this is stacking of transformers, okay? So how does BERT work and how do we apply it to transfer learning? Let me just give you the general idea. You will first train BERT with a very large data set from the internet, like we did with the fast text embeddings, and then you will fine tune it for your particular problem, okay? So first, how do we do the pre-training step with the unsupervised data? For instance, you download data from Wikipedia, and what you will give BERT is all the tokens in your document, but you will replace some of them at random with this mask symbol. And what the model is trying to optimize is, well, you have hidden this word here, but I will still try to predict which word is under the mask, okay? And this looks very similar to when you are learning a single language, and you have these usual exercises in which a word is missing and you have to propose a word there. So that's the way this model is learned. It's the same idea. And you will see that there are some special symbols that appear here at the beginning, this CLS symbol. This is added by BERT automatically to all of your sentences. And this symbol will kind of summarize all the information that your sentence has, okay? Well, you train BERT this way, and then you will fine tune it to your particular problem. So fine tuning means that you input your document through BERT, you will get the output embedding that was created for this CLS symbol. So this embedding should represent all the information in your text, and then you add a small neural network layer here that solves your problem. That gives you classification, toxic levels, whatever, okay? And then you have to propagate through all the stack, okay? So now the model we have is we have the document, we apply the BERT tokenizer that includes this CLS token. Everything goes through BERT, but we get a single embedding that represents everything, and then we have an output layer that's a small layer that solves your particular problem. We have to compute the propagation through all these steps. So this is hard to train because BERT is large, but this is something that is doable. And actually we have this library that is quite new. It's called Transformers, also in Python. And this gives you the ability to make use of these kind of BERT models. So let me show you a little hints about how to use this. Well, I'll also tell you that there are a lot of BERT models going about the place. So the original BERT is some, is what's created about a year ago, but for this talk I use distilbert, which is actually a reduced version of the model, okay? But there are different kind of BERT models here. You have Roberta by Facebook, XLM. You also have also Albert by Google. Well, I use distilbert for this talk, and what I did is the following. So how do you first tokenize the data using BERT? Well, in this Transformers library, you have this BERT tokenizer, and you can already see that you apply this to some text. You get the tokens represented as some kind of indexes, but actually the transformation that the BERT model is doing is this one. So you can see that the text is split on small pieces, and it adds this special CLS token. You can also compute the embeddings, the contextualized embeddings for all the words in your document, and it works this way. Okay, there's a little bit of torch boilerplate here, but essentially you input a sentence, and then you get one row, one embedding, one vector of values for each one of your words. And I would like to say that training is also very easy, like in secular or Keras, but it's not. Unfortunately, this works the PyTorch way, if you have a worldwide PyTorch, you know that you have to write your own training loop, you have to go through all the batches, compute the map propagation, and so on. So I won't give you the specifics on how this works, but again, you have all the details in my Github repo, so I will just show you the results, and it works very well. It was really, really well. So the only thing I did here, again, is I'm not only transferring the embeddings, I'm also transferring the mixing model, which in this case is BERT, I'm just by doing a little bit of fine tuning, and by a little bit I mean just running for back propagation epochs, you already get this result. Well, maybe some of you is thinking that, well, you know, from here to here, there wasn't really a very large change, you have been 15 minutes talking about mixing models and so on, but you know, we are so close to having perfect accuracy here. So this improvement is not trivial, right? So we can take a lot of things, we can take a lot of value from these pre-trained models, even if we have very, very little training data, we can adapt these very large models for a setting. So what's the key takeaway from this talk? Well, the key takeaway is that in life, you need to be lazy. The less you do, the more you get. Well, no, not really. The key idea is you have to use pre-trained models because you don't want your model to learn how language works, how your whole language works, just with your very small training data set. You will better train, get a large trained model that learned how your language really works and they'll fine tune it a little bit with your data so that it fulfills your specific task, okay? So my main idea here was to give you these message and also to get you hyped so you go straight to my GitHub repo and test these ideas and see that they really work, okay? So I hope I have packed your interest and if you have any questions, we still have two minutes for that. So thank you for everything. Okay, thank you, Alvaro. Like you said, we have time for maybe one or two questions. So if there are any questions, please raise your hand and the host will come around with a microphone. Very nice talk, Alvaro. Question, does BERT support multilingual scenarios if my text comes from different languages? Okay, that's a very nice question. So most of these language models work all in English. There is also a BERT model for German and Chinese, I think, and there is a particular multilingual BERT model, but it doesn't work very well, at least in my experience. So I have been trying to use this multilingual model for Spanish and it works so-so. So if you want to apply this to a multilingual problem, maybe you will have to gather a very large dataset of those different languages, pretrain BERT with that and then fine tune that for your specific problem. Because I'm aware that Google, for example, just released the Muse multi-sentence embedding, multi-universal sentence embedding, which does the embedding parts in such a way that you have the same embedding of a sentence in English as in French, for example. How would that fit in the workflow so that we use this information and we use the architecture of BERT in such a way to kind of improve. Okay, so when you're not working with different languages, it's not so easy because the way these things work is you start working at token level, world level, small pieces, you will embedding for that and then you try to mix them. And the problem is that a small word might mean different things in different languages. So it's true that there are models that give you an embedding that seems to be the same for different languages, but usually, these models do not do as well as building specific models for a specific language. So you might, well, you can try that, but I still think that it will be better to build a model for your particular language. I don't know if I've answered your question, but if you need more detail, maybe we can talk later. Any other question? Okay, thank you very much, Alvaro. A round of applause for Alvaro. Thank you.