 Hello everyone, my name is Vincent. Just to get a quick confirmation though, people can see this screen, this is the purple thing? Yes we can, yeah. Okay, perfect, perfect, perfect, everything works well. My name is Vincent, I work as a research advocate at Raza. Essentially my job by and large is to make sure that whatever our lovely team of researchers come up with that that's relatively well understood for our developers and that we you know contribute a little bit of knowledge in the whole NLP realm. So there's this new technology, you might have heard of it, it's called a transformer. And the goal of this talk is to give us somewhat intuitive explanation how they work, why they work and why it matters. And I just want to also demonstrate how in two core areas of our chatbot technology, we're actually making use of this quite a bit. So I hope that's going to be interesting to some of you. Before I can go in depth in sort of you know how this technology works, let's first discuss the problem that we have at hand. So at Raza we like to provide you with the technology to build your own digital assistant. So that could be a chatbot, it can be whatever you'd like, but it means that we have this conversation with an end user. And the conversation could go something like hello, and then there's a sort of digital assistant here that says hello back. Then we have a user that says hey I would like to buy a pizza. Then there's this reply asking for what kind of pizza. Then the conversation could get interrupted because a user could ask by the way are you a human? And then we think it's ethical to never pretend that we're human. So it's really important that we first reply, no I'm not a human, I'm a bot. But then we would like the chatbot to automatically pick up the conversation where we left it. That's kind of the like one of the scenarios that you might have. And there's actually quite a lot of things happening in just this short conversation if you think about all the moving parts and things that we would like to detect. So every time that a user speaks over here I could argue that's an intent. Every utterance suggests that a user wants something and it's for us then to figure out what to reply back. But it's not just that one such utterance is an intent, I mean you could say that's a classification but it's also the case that inside of the text there's a specific bit of subtext that has a lot of meaning. So for the intent that we have over here for example the person wants to buy something and what does the person want to buy? Well a pizza. But it could also be that the person wanted to buy a burger or a different type of product and we really need to pick up this entity in order to get the best reply here. So we have this interesting mapping from intense entities to actions that the digital assistants should take. And note that I've drawn a couple of actions here like the action of saying hello is an action. But of course there's also an action happening right after here because the chatbot also has to recognize when they should listen and when they should speak. So listening is also an action that we have to detect. And this is basically one of the core problems that we have. We want to provide you with essentially Lego bricks or building blocks to help you navigate this sort of problem. So that means that we have models essentially to detect these intents. We have models to detect these entities and we have models to detect these actions. And all these things by the way are super open source so you can check them out. But we are interested in techniques that can do this accurately. And not just in English by the way. We really want to have systems that work well also for other languages. So Chinese, Zulu we really want our system to be broadly applicable. Now the interesting thing about the problem that I've just described to you is that both problems they sort of have this sequence if you think about it. If I see text appear in a sentence then you could say ah it's kind of like a sequence. There's a first word, a second word, all the way up to a fifth. And if we look at the dialogue that's happening then you can also argue there's something of a sequence happening here. We've got this text at time step one. We've got this text at time step two. That's a sequence over time. So the way that we're going to deal with these with these text problems is going to be somewhat related to time series. And here's why I'm going to try to at least intuitively explain to you what a transformer is and what it does. But in order to get there what I'm going to do is just make this one analogy. Let's say I've got this time series. This is something that's happening over time. And what I could do is I could sort of move this filter over this time series. And you can see as the filter moves at one point in time there's going to be more attention somewhere than somewhere else. So right now there'll be a lot of attention like that. So if I look at the screen right now at this point in time you would get a lot of attention on these points. And let's say at this point of time you get a lot of attention around these points. And this is a technique that you can use to maybe denoise the original data. It's a filtering technique. And you can say, well, let's have a wide attention span. Let's have like a thin attention span. But the idea is that if this were a time series, right, we're going to come back to text. But just one aspect of this is I start out with this time series of dots. And then I have this, I will argue this attention mechanism on how to smooth this out. And you could argue that this thick red line that I have on top here, that's a more contextual representation of the original data that I started with. There's less noise in my data by applying this filter. And that might be more useful for whatever task I have later. So this filtering, this preprocessing step, that is something that can actually be really, really useful. The question is, can we do something that's really similar to text? Because if we think back of the original problem, the text sequences that we have in the chatbot setting, those are also sequences. So let's sort of pick everything apart of what's happening. In this current example, we can say that, well, let's say that there's this one dot over here. And let's say that some dots that I'll call xi, because we're at index i over here. But as far as that xi is concerned, we can say that this arc over here describes where we are going to spend our attention. And as far as this point is concerned, we're not going to put any effort into points that are over here, because those are super far away. We don't have to care about any of that. So by understanding for this time series how we should put our attention span, we allow ourselves the opportunity to come up with this more contextualized line, because it allows us to re-weigh the original points. This filtering step is effectively taking like a weighted mean. And all the points that are closer to the point in question, those will get weighted more than points that are further away. So it's like a weighted mean that we're doing here, actually. And I would love to do this weighted mean, but when I'm going to do it for text, I have to be really careful, because maybe I cannot do this proximity argument here. Now, if you're a little bit familiar with natural language processing techniques, you can imagine that we might have, let's say, a token in a sentence here. So I've got bank of the river. And I've got this other sequence of text, the money on the bank. And we have these four tokens that correspond to these texts. But you might remember that we have these things called word embeddings or vector representations for words. And let's say that I'm interested in sort of doing the re-weighing trick based on the vector that I have over here. Well, then, I hope you might agree that in order to get the most context for the word bank over here, proximity is just not going to cut it. If I were to think, hey, this token bank, what other word in the sentence is giving it a lot of context, then it should be this one token over here called river. That one word river is going to give me a lot more information about this word bank. And same here, if I want to understand what the word bank means here, then I would really like it if some of the attention came from money instead. Not the word though, like the word though might be more proximate, but it's not something that gives me extra meaning. So we're going to apply a mathematical trick to go about this instead. But I hope that you still have in your mind that, yeah, we're trying to get this attention mechanism to re-weigh all of these vectors. That's kind of the idea. Now, as luck would have it, there's this mathematical coincidence. Let's say that this vector and this vector, let's say that they're similar in some way. And if you have pre-trained word embeddings, you might just do the dot product between the two vectors. And if that number is relatively big, then that's a argument that you could say, well, they might be more similar. In practice, this seems to hold. And you can imagine that there's not like, if you look at the word vectors for these stop words, since they don't have a lot of information, there's probably not going to be a huge overlap between the dot product of bank and of, but there might actually be a lot of overlap in bank and river. So that means that instead of maybe using that proximity that we use in the time series to say where we should put our attention to, we might be able to do something with the word embeddings that we start out with already. So then you might get an attention mechanism, so to say, that kind of looks like this. And this is like a rough sketch, of course, but let's say that I do this dot product between v1 and v1, v1 and v2, 3, 4, then this might be an okay sort of distribution of my attention for this token. As far as bank is concerned, sure, the token bank is important, but of and there are not important and river probably is important. And the way that I derive this is because of this dot product with the other word embeddings. And note that we have a very similar thing over here. The main difference though is, is that here I'm looking to project everything onto v1. And here I'm looking to project everything onto v4. But it's the same calculation by and large. It's just that this is at the end of the sentence is the beginning of the sentence. So if we have pretty proper word embeddings, right, this can be an okay way to do some reweighing. But before we do the actual reweighing, one thing that we should do is we should remember that if we take, like, if we take the height over here, and we add the height over here, and we add the height over here, and we add the height over here, then we might get a number that's not exactly equal to one. And if you're going to do some reweighing, I mean, it's kind of nice if you have like all your weights sum up to one like a probability distribution. So what we'll do is we'll just do a normalization step after this. And what we can then say is, well, we've got these reweighing weights now. Because it's normalized, these are the weights that I could go ahead and use for my attention. And the way that it would sort of work, and I apologize for the math symbols, but I'll explain what they mean. But the idea could be, if I start out with this set of word embeddings, then I'm going to do all the dot products, then I'm going to normalize everything so I have these weights. And then I can say, okay, this is the word vector one that I start with. Then I'm going to multiply this with this, this with this, this with this, and this with this. And then add all of that together to get this more contextualized word vector out. Effectively, what these weights mean is how much should this vector listen to itself, its neighbor, or maybe all the way at the end of the sentence. So this is the new reweighed word vector that should potentially have more context. It's like that denoising thing that we did with the time series in the beginning. And you can do the same thing for, if you can do it for v1, you can also do it for v2, v3, or what I've got here, v4. It's just reweighing that we're still doing. And if I were to make a comparison to what we saw before, we have some sort of token at some point in time, that would be this. Then we have some sort of an attention mechanism. We're using the dot product of word embeddings over here. But that gives us this reweighing factor. And by reweighing all these different points that we already have, then we have this contextualized word embedding. That's this guy. And I hope you appreciate this analogy. It's actually like a very similar filtering technique that we have in the time series domain. But now we just have this dot product as the hack to do something really similar in the language domain. And that is really nice and convenient, again, because in a time series, you can make this assumption that when things are close to each other, they should be related. And you really cannot do that in language. Bank really needs to listen to river if you want to understand what bank means in this sentence. So on the right hand side again, we have reweighing based on time distance. But on the left hand side, we have reweighing based on embedding similarity, so to say. And yes, to do this trick, you will need some sort of a word embedding that's already pre-trained. But those are relatively common. So those are available in some degree, I would already argue. Okay. So we have bank of the river, money on the bank. And we have this system where we start with some word embeddings. It goes through this, I will argue, attention block and outcomes a vector that's more contextualized. That's what that star means. And I don't just have that for v1, by the way. I also have that for v2, v3, and v4. I can all just sort of go into that block. This operation here would happen. And then what would come out is v1 star v2 star all the way up until v4 star. That's kind of the idea. And because we have that block that I just drew, maybe we can put this in a neural network as well. We might have all the operations that we just did defined as a layer. That'd be kind of nice. So I'm going to take one sip of water, so everyone can have a breather. But the idea is that the mathematical operations that we've just described, we can actually rephrase that such that we might have proper Keras layers. I know that that's kind of interesting because then you can do this with your deep learning stack, et cetera. So what I'm going to draw now is this self-attention block. And I'm going to be a little bit diagrammy, but I'm going to repeat all the steps that I did just now. So on the left-hand side, I've got all of these vectors that are coming in. And on the right-hand side, I've got all of my vectors that are more contextualized. That'd be these vectors. Now, these vectors as they're coming in, it's kind of like an array of vectors, right? So that means that when I actually get in here, that's kind of more like a matrix, if you will, than a set of vectors because a collection of vectors can also be represented as a matrix. So that means that it would also hold here on the outside. So you might argue, hey, maybe they're not sets of tensors. Maybe it's like one giant tensor that's two-dimensional. Well, okay, then the first thing that needs to happen is we're first going to do like a mathematical multiplication, right? So like we had before, we had the v1 times v1 thing, and we had the v1 times v2 thing. Like all that stuff is happening here. After that, we had this normalization step so that we would have these weights in between, right? So that those were the normalized weights. And then we would, again, do that multiplication thing down below because we would do something like, well, let's do v1, pardon, v1 times w, like i1 plus v2 times wi2. And like you would do that for all of these vectors, you would add that together. And that's how you would get this contextualized guy over here. And I'm not going to do the whole formal math thing. That's going to be a bit boring for a Python talk. But I do hope that you appreciate that, yeah, these are just matrix multiplications and some normalization steps. And a matrix multiplication is essentially a layer inside of a deep learning framework. So we're in the green here as far as implementation possibilities goes. So that's kind of nice. But if I now start thinking about this, right? This system is pretty cool, right? It's pretty cool. We have this attention mechanism that's already kind of useful. But we should maybe think about the properties of what we've got here because this is not a standard neural network layer. You'll notice that if I'm doing multiplication here, that's great. But I'm not learning anything yet. There's no weights in this matrix multiplication. If you think about a dense layer in a neural network, then there are weights that you're going to train. There's a label somewhere. You get your gradient update. But currently, there are no weights at all in this entire system. So that means that the vectors that I'm going to get out here, whoops, so that means that the vectors that I'm going to get out there, they're going to be really general. They're effectively not trained toward a certain task. And that's, you know, I can imagine if you have a label, like you're doing entity detection, or you're doing translation, or you're doing whatever with this neural network, that you still want to have this attention mechanism be able to learn and be able to specialize toward a certain task. So roughly what you could do, and I'm definitely skipping a couple of steps here because I want to keep it on the intuition level. But one thing that you could do is you can say, well, let's move some stuff around inside of this schema. And let's put a couple of neural network layers there. Now, the idea is that these can still be regarded as matrix multiplications, but these matrix multiplications will be multiplied with weights that we're going to learn. The matrix multiplication that we had here was a multiplication between this matrix and itself. These layers will effectively say, well, that's a multiplication between these vectors and some other set of weights that we're going to be training. That's a pattern that we're going to have to learn. And the idea essentially is, if I get some sort of gradient update, at some point, all of this stuff is going to be connected to some sort of label. And if I've got a label, that means that I have a gradient update that tells me how well I've been doing. But because I now have those gradient updates, these are going to travel through this system and eventually they're going to update these layers, so to say. That's kind of the idea. And you can extend this idea as well. But the idea mainly now is we've just found a couple of places where we can put these extra layers so we can also specialize this attention mechanism towards a certain task. And I hope it's plausible that if you're doing neural translation, then you want to learn different things and if you're just doing entity detection, for example. So I hope that that makes sense, why we might want to have that. And there's like variants of this, I should say, by the way. So one thing that some people really like to do is they like to add like lots and lots of extra layers here in parallel. What they would do is they would call one of these layers ahead and by having parallel couple of them, you get this multi-headed attention, what's something that people like to call this. But to keep it simple for now though, essentially what we're just still doing is still nothing more than a variant of that time series task. The main thing to remember is that we have word embeddings coming in over here and they're getting contextualized. And these word embeddings over here might be able to focus on different parts. And that's very useful. Again, in the example of bank of the river, money on the bank, bank should mean something different in both of these sentences. That's the idea. Now, to complete the whole story, we're almost sort of intuitively there at the transformer, but not quite. But typically to wrap it up into like what is a transformer. In the end, the main component that sits in the middle of a transformer, the main thing that made it slightly different, interesting and like an extra step in the ecosystem, is that attention block. But what people like to do is they like to put like an extra dense layer at the end, maybe a sort of a dense layer in the beginning in front of it. And all of those components together, that's what people like to call a transformer layer. And this again, I should stress, I'm skipping over a couple of details and there's many variants of this, but at least on an intuitive level. This is it. This is a transformer layer. And again, the idea is that we have we have text here, the text will be converted to word embeddings. That's a sequence. And then that sequences will be at some point be passed for this transformer. And we have something here that's more contextualized. Eventually that'll be hooked up to a label so we can start learning. And that means that this transformer is able to update any of the weights that are in here. And that's what a transformer does when it's learning. So definitely glancing over a couple of details. But in the end, intuitively, we're still doing the same thing as with the time series. Now, if there's extra time, I do have an appendix so we can go a little bit more in depth. But what I kind of want to do now is this was a theoretical part is the part where I was explaining what a transformer sort of is and what it sort of does in the next bit, what I hope to do is show you how we actually apply these because this transformer layer is nice, but we've made some customizations to it over at Raza to make some models that we think are pretty good at a lot of scenarios for digital assistance. And this is the part where it gets exciting because we've been able to add some nice customization. We've been able to customize this transformer for our needs. And I think there's a nice lesson there. So one thing that we like about this transformer, if you think about it, and it's a bit of a handway of the argument. But if you were to say, well, this sounds really familiar, you wouldn't be wrong because R and Ns, these recurrent neural networks, they were doing something that's really similar, right? You could say that also for an RNN, there's word embeddings going in here and there's word embeddings going out. So why would this be better than this? Well, the answer kind of is, if I didn't have any training data whatsoever, imagine this scenario, I haven't seen any training data yet, then the node that I have over here is even without any training data going to be slightly biased to listen to this one and this one. You're going to need a whole lot of data before the node over here is going to start listening to the node over here. And that's a bit of a bummer. There's already a preference for the sequence for the words that are close to each other, even if it hasn't seen any data yet. And that's different with the transformer. If the transformer hasn't seen any data whatsoever, then the attention is pretty much uniform over the entire sentence, which if you're thinking about, hey, what's a good starting point to learn from, that's a lot better, especially in our scenario, where we don't have too much training data typically. If you're starting out with a digital assistant, you're designing a chatbot, then it's probably going to be different language than what people are using on Wikipedia. So that means that very often you start from scratch and you're not going to have bajillions of conversations that you can already train on. It's typically maybe tens of conversations when you're starting out. So we like having a system that's relatively robust at being able to start in a lightweight fashion. And I hope that this hand-wavy argument, it is hand-wavy, I agree, but I hope that you intuitively also appreciate here, hey, that transformer is a little bit more flexible in that scenario. And there's also more parallelization options. Anyway, theory, I want to go to practice. So if we look at the original problem that we had, there are two systems that we have in Raza that use a transformer. The first system is a system that detects these intents and entities. Those are two tasks and we've built one algorithm to handle both. And the other part of the system that we have is a system that is able to detect what the next best action is. And these two different systems use a transformer in a different way. And I would like to show you how that works. So intents and entities are found by the system that we have called diet. Now you can definitely also use scikit-learn. If you have your own preference for that, that's perfectly fine. But the algorithm that we've designed recently is called diet, which stands for dual intent and entity transformer. The idea is that we've built a system that is able to handle both intents and entities. And because it's able to handle both of them at the same time, we have reason to believe that it might also be more accurate at many tasks. So the way it works is I've got one utterance of a user on the left-hand side and it's saying play ping pong. That's what the user is sending my way. Now the way that we encode that internally in Raza is we say, well, you probably have some sparse features like a count factorizer. Just count how often each word appears. And the thing is that we typically generate these features for every single word that we have. And what we also do is we try to have one token that represents the entire sentence. So you could say, well, if this is a one-holster encoded thing, I've got a one, a zero, zero here, zero, one, zero, zero, one here, right? So those could be the counts for the words. And in that scenario, I would have three ones here because I'm counting all three of the words. And a similar thing would happen with the word embeddings as well. But the idea is this is the utterance that I get. This is what I sort of start out with. Then what we do is we pass that through, you know, one or two feet forward layers. The main reason for that is to make sure that we have a hyperparameter for the size of what comes out over here. And then what we do is we pass that into a transformer. And the thing that's interesting to note here is we also pass this class token into the transformer. That's as if it's another token. It's something we just throw in there. But that also means if we look at the same argument we had before, that what I get here is essentially a vector like what comes out over here, but more contextualized. And we have that over here as well, over here as well, and over here as well. And in a minute, we're going to hear an argument why that might be super powerful. Because what we can do now is since these words are all separate words that could be an entity, right? And in this particular case, you could argue, well, I want to play a video game. That's what the user is telling me. But what video game do I want to play? Well, that's ping pong. Well, what I can then do is I can sort of say, well, I've got this entity layer over here, and there's an entity that should be detected there, namely the game I want to play. There's an entity over here, namely what game do I want to play. And you can imagine that there's this nice little correspondence between the token I have over here and the entity label over here. And it's a very similar thing happening here. So I've got this separate layer for the intent, and I've got this intent play a video game over here. But here comes something that's actually quite amazing. Now let's think about the gradient signal that I get, sort of the feedback if I'm actually taking the system, applying the gradient descent on it and trying to learn patterns from it. Well, then as an example, let's say that I get a gradient update from this intent. And note, in production, I'm also definitely getting a gradient update from these over here. But to keep it simple, let's assume there's a gradient update from this intent over here. I'll be updating this layer. But then we have this attention mechanism happening here. And you could potentially argue that for the intent that you want to play a video game, you know, this word play is going to be a key part of detecting that. If a user types the word play in a commanding tense in the beginning of the sentence, you could plausibly argue, I don't need to know the rest of the sentence, just that word play is already telling me what your intent might be. So that might also mean that the gradient update signal that this layer is going to receive and this layer is going to receive, those are probably going to be more fundamental and more strong than the updates that I get over here. Potentially because these don't necessarily immediately tell us anything about this intent. It might be that you want to play a game or buy a game. And the fact that there's a game in here might certainly still give a boost. But you can imagine the nice thing about this transformer is it also routes the gradient update in an appropriate way. Now again, hand wave the argument, definitely aware of that. But to me at least it feels intuitive that that is something that's going to be extremely useful here. Because this isn't just happening with the intents, it is also happening with these entities. And especially if a certain intent has certain entities that occur often together and very infrequently without each other, then that's a pattern that this system can learn. Instead of having two separate models, one for intents and one for entities, we think by having this transformer in the middle, we might actually be able to handle both. And so far, it seems that diet is working quite well. One of the nice things about diet, and that's something I do want to mention. This system will work even if you don't have any word embeddings at your disposal. And this is really important to our users. Because again, English word embeddings are frequently out there, but there are also some languages like Zulu where you don't have word embeddings readily available. Now I'm working on some open source projects to make sure that we also have word embeddings for those. But the nice thing about this system at least is even if you don't have word embeddings, either because no good ones exist for your language or because they're maybe super heavy, this system will still go out and work and do its best. There's still lots of features that can be learned here. Another thing that's kind of nice is that if you want to go super heavy, say you're a big fan of the new BERT models out there, then we can chug those in here. And that's kind of nice because if you have a use case for it, then you can put the super heavy word embeddings in there as well. And this is the nice thing about this model. And I think a transformer helps here. This model can be used in different ways according to whatever you think is best for your application. And this customizability, that's a super nice feature. So that's both a feature of, I would argue here, our diet system that we built on top of the transformer. But this gradient updating that's inside of the transformer, that's kind of a nice milestone. There's something really cool happening inside of that. So this is the base model that we provide inside of Raza to handle entities as well as intents. And what I'm going to conclude with now is how we also handle actions. But I hope that already the intuition has the mind boggling at this phase. The thing is we also have this thing called TED. It's a transformer embedding dialogue system. And this is what we use to handle the actions to take. Now I'll go over this relatively quickly in the interest of time. But if a user were to say, hey, I'd like a pizza, well, then we have a model called diet that's going to give us an intent. We have an entity that's detected. We have that from diet as well. We might also have some long-term information filled in, like the address of the user might be something that we know. And we also know the previous action. And you can argue, well, that's a feature space that I have for an utterance at a single point in time. It'd be nice if we can pass that to a model and then that the model says, oh, this is the best action you should take. But again, we have a sequence here, right? So that means that we can have those feature spaces over time. And then we can then say, well, chuck that into a model and have then the actions here come out. And then we have a sequence over here. And I hope that you can imagine what we're actually using here inside of this model to facilitate this that is another transformer. But the main thing that is awesome about this, the one thing that struck me as sort of super interesting is that you don't have to use word embeddings in order to use a transformer. We're not even dealing with text per se. These are just features that we know at the moment and it's sparse data. But still, we're able to use a transformer in our experience so far quite well in situations where a user is maybe interrupting a chatbot and we have to suddenly make the right decision. In the interest of time, I'm going to skip a couple of details about how this transformer works. It's a unidirectional transformer because we're only allowed to look at the future. So there's a couple of interesting details there. But in the interest of time, I just want to do one quick demo because then we'll still have a little bit of time for questions. What I have here is the raza configuration file and what I also have is some training data where the task is that the chatbot is supposed to count down. It needs to start from 10 and then ask the countdown. And what I'm also doing is I'm asking it not just to count down, but to also if I interrupt it by asking if it's a chatbot that is able to recover. So if this is my input and I say, hey, start counting, it starts counting at 10 and then I say, okay, but then it starts counting at five and then I say, okay, again, it's messing up over here. But the reason why it's messing up is because in this configuration for this particular TET policy, I'm saying that it's only allowed to look one time step back, which is not going to be enough information for the transformer to actually learn a nice pattern. It gets a whole lot better if I say, hey, your max history is actually supposed to be three now, you're allowed to look back three time steps. And then you notice that it's actually not the worst, you know, it responds, okay, now it can count down. When it says, hey, are you a bot? Then it's able to say, no, I am a bot, not a human, it starts counting down again. But it's not able to generalize. There's this moment that's not in a training data. And it's not able to look further back enough to actually achieve the right context here. However, if I were to put the max history at 10, then we see that even if I try to interrupt it at instances that aren't in the training data, it is able to recover. It is able to keep counting down. And it's able to still at every instance reply that it's not a human, which is a feature that we like. And the reason why I think this is a nice result is because when we tried doing this, not with the transformer, but with the LSTM type model, we noticed that the LSTM actually has a whole lot of trouble. You need to give it way more data so far in this benchmark in order for it to learn that you're still supposed to count down after saying that you are a chatbot and not a human. Anyway, there's only like a couple of minutes at the time. So very quickly, the one thing that I do want to remind everyone of super fancy machine learning, that's great. But if you're building your own chatbot, definitely label your own data and learn how users are interacting with your chatbot. If this was interesting, know that I have, like I'm paid to do this on the Raza channel on YouTube. I have lots of videos that go more in depth in material that I've discussed today. So if you're interested in this, there's the algorithm whiteboard on YouTube that you can definitely go and check out. And finally, one thing that I would also like to mention is on behalf of Raza, I've open sourced and maintained this project called What Lies, which is an interactive visualization tool for many common word embeddings. And some of the features that I'm currently implementing are features for detecting bias in word embeddings. So if that's interesting, reach out to me because these are features that I'm building. I'm going on holiday next week, but the week after I'll be implementing some things that help you detect bias in word embeddings. Anyway, a little bit of rush at the end, but I hope people found this interesting. And now would be a good time to go ahead and ask me some questions. I have already noticed that some people are asking, hey, what am I using gear-wise because people seem to like this pen? I've got to walk on one tablet and there's an app for Mac called Screen Brush. And every time I click Alt Tab, I'm able to just doodle. You can also do this with your mouse, the app is called Screen Brush. Come to the chatroom on Discord and I'll share you a link and a tutorial. And I am using a Yeti microphone it was 140 bucks before Corona started. So we have also a bunch of questions on chat. Sure. All right. Which one should I take? All right. I will ask you. Sure. Go ahead. Are there moving graphs which you made, are they using matplotlib? Oh, yes they are. So there's, I go to the Discord, I'll send you the link. There's actually a cool project called GIF and it's basically a decorator that you put around a function that renders matplotlib and GIF is just the easiest way ever to get pretty GIFs from matplotlib. Go to Discord channel after I'll share the link. But yes, it's matplotlib based. Okay. The next one. Really interesting explanation of the diet in the net. What do you have available to QA a predicted label or to explain to a user the label? Oh, that's super tricky in essence because there is a transformer in the middle and every sort of like you can have multiple transformers in sequence. So what you can do, that's what lies does is we can visualize the vectors going into the transformer. And typically if there's like clusters there, that's something that can help you inform. But what Raza also does is it gives you this overview of common mistakes that happen often. And what we mainly try to do is tell you, hey, these two intents are often confused with each other. And then you can investigate from there. That's that's typically the approach that we offer. Okay, so this is a question from Francesco. You said that RNNs perform worse because they are intrinsically biased in assuming words that occur closer tend to be more related. However, isn't this assumption in most cases more correct than assuming as a transformer does that a word in a sentence has equal chances of being related to any word in the other parts of the sentence? Well, so that's why I mentioned it's a hand wavy argument, right? So it's not necessarily 100% solid and perfect. And it definitely depends on the use case. For our use case though, you have to imagine when you get started with a digital assistant like your first demo, it's going to be based on 20 really, really short sentences. So at least for us, right, for our specific use case, that one assumption holds just a bit more. You are definitely correct if you say something like, okay, but now let's do this for an entire document and the documents like 20 pages. I mean, you're sure then situations differ and then you cannot make this argument anymore. But at least so far in our results, it seems to be confirmed. That's the main thing I'd like to say about that. All right. So, so this question asks you, how does this compare to GPT2 or the recent GPT3? Oh, okay. So I wrote a blog post on that, which I'm going to be spell correcting in like five minutes. The main thing with GDB3, and I hope people take this serious, don't underestimate what sort of bigotry that thing can generate. If you give it sentences like the man is known for and the woman is known for or the person with black skin is known for, you will see that it generates all sorts of stereotypes. And yes, I mean, it's great that we have the system that can generate text, but that's not the use case that our users at least have. Our users would like to build a digital assistant that is reliable and predictable and that you could order a pizza and do that kind of thing. We are currently investigating GPT3 maybe as a way to have a human in the loop and generate more expressive training data. I mean, that can be a use case. But in general, I think it is a really bad idea to have something like GPT3 generate text as if it's a chat bot. I have trouble coming over the good use case for that, especially in enterprise. Okay. Thank you. Thank you very much for your talk. And I think this is for you. Sure. Happy people liked it. And if there's any questions, I'll be on Discord for like another 20 minutes to like send links stuff. And I hope people learn something today. Yes. Thank you very much.