 OK, thank you very much. Today I got to talk about something that is not related to my work. So that's more fun. It's an NLP project that I set up myself and is about machine understanding how we speak and try to mimic the style that somebody speak. So yeah, let's dive into it. I would briefly introduce myself, although I was introduced briefly. I'm a data scientist at Hotel Bets. And I'm also a co-organizer of AI Club for Gender Minority, trying to help people who identify gender minority to get into the industry and learn and not be scared. Also, I'm a member of Python Sprint. Yesterday we have a lovely Sprint to be contributing today, you too. I think it's amazing. And I thank very much for the Sprint to bring me into this open source world. I think everybody needs to try it out and give yourself a chance. So why I do this project? I think mostly I was inspired by there was a news about somebody write a fan fiction of Harry Potter using AI. And the result is hilarious. It's very funny because you can see even the title already sounds very funny, like a pile of ash. Is that title? Is it about the ash? Yeah, so I was investigating in it to see can I do the same thing with deep learning? Because at that time I was very interested in deep learning. I want to use neural network to do something cool. So luckily today I have the chance to show my result to you. So I'm glad. This is a side project. It's fun. So my approach is very naive and very similar. I just want a very simple neural network that I can play around with. It's not going to be like take me weeks to train. It's got to be I just want to understand how it works and maybe hopefully get some decent result. So it's very simple. It's only four layers. It's only four layers. I don't know, can I call it this like a deep learning? It's maybe shallow learning. But it used some very core elements of deep learning in NLP task. For example, there will be word embeddings involved. There will be a RSTM, which is a recurrent neural network, that will be involved as well. I will explain a little bit more later. But we also need an input layer and output layer. So input layer, we have to encode our paragraph or sentence because basically, if you dive deep down, machine is just like 0 and 1. So everything is numerical. So I'm trying to encode it by fitting it with everything in it. We call that a corpus. So fit it in and try to label each word as a number. So that's the encoding. But after that, I have to make it like a vector. So I do a one-hot encoding of it. For example, if there is a word that is labeled 22, so when I put it as a vector, everything will be 0 except the position 22. It will be 1. So that's one-hot encoding. You can use that to do something with it. But that's not the best way because, as you can imagine, a corpus could have lots of word. And the vectors will be very big. And the matrix at the end will be very sparse. So that's not ideal. And also, it kind of takes care of each word independently. It doesn't have the magic that the word embeddings could do. It's like the logic within the words is lost. So I will also explain that when we talk about word embeddings. Also, yeah, we have to store the mapping of our encoding because, at the end, we want to generate a paragraph. So we want to see if we predict a number at the end. Like, what does it represent? Which word is it? So we need to store that mapping as a dictionary. So yeah. So we have that magic when embeddings are RSTM. And at the end, we have the output. The output is, actually, I'm not predicting just one single word. Because everything is a vector. So at the end, I would try to predict a vector as a result. So I use a softmax layer for that because it would give me a distribution as a vector. So with this distribution, I can actually use a way to do a sample word out of it. So why do I do it? Because when we write something, we have the creativity. So it won't be the same every time. Even though a word that is unlikely to come up to a sentence, it will still have a chance to appear. It's just a smaller chance. So I would rather have a distribution of probability to draw this word but not hardly just predict a word. That's the setup. So what's word embeddings, as I said, is like a magic. For example, you can see from the graph, we have different words in a vector space. It's very sci-fi. It's like a vector space. But what's interesting about this space is that we can actually do calculation with words. For example, a very typical one is king. You have the vector. And if you do some mathematics with vectors, you can have king minus man plus woman equals to queen. And actually, that's the vector space. And if we put it, this is a demonstration. So it's 2D. It's very easy to see. Oh, actually, that's the logic. And also, not just this one may be in the same space. We can also have Paris minus France plus Germany equals to Berlin. And it's high-dimensional. So you can see all the logics. We try to preserve all the logics inside this space. But how can we achieve that? This is blown my mind. How can we achieve this mapping from just like 0 and 1, just like one hot encoding into this embedding space? We have two popular approaches. We also have more. But I will focus on two today. One is GLOVEVAC. The other one is our work to VAC. I will have a slide for that. So yeah, let's see what this is. GLOVEVAC is a global vector for word representation. I have to look it up for the name. It's like, I just think that GLOVEVAC is a very beautiful abbreviation, and I just remember that. So it is account-based model. What does it mean is like when we train it, when we try to find this mapping, we have lots of words. For example, we scrape a corpus from Wikipedia. That's what I'm using today is a corpus from Wikipedia that you can see from the paragraph. You can see the relations of the word by their coexistence. For example, in a sentence that the teddy always come with bear. So it's like teddy bear. You can see they are related. They are talking about the toy. And so that's by the relations from a paragraph. We can kind of build this space because we see their occurrence, like how often the two words come together. So it's a account-based model. And then from that, we can create a matrix of that counting occurrence so we can map it from our original space into this new embedding space. For work-to-vac, I don't know, is it words to vector? I don't really like scratching my head. I think so. Very obvious. And it's a predictive model. So it's basically the logic of doing it. Rather than just counting the co-occurrence, it's more like we train this weight by a new network. So it's like a new network predictive approach. So that's why it's called a predictive model. It's just a contrast with the GloVac. And yeah, so we use a new network to achieve that. So the first example today I'll show is first we use a work-to-vac, which we train it together with our model. But the second one, I will show that what if we feed pre-trained GloVac web embeddings and how does it compare to when we train together with the model. So I'll show you later. The second, remember we have two magic layers. The second magic layer is called RSTM, which is long short-term memory. So the idea is that we have short-term memories. We know what we remember and what we forget. For example, you remember the happy things in your childhood, but maybe you forget all the things that you learn in schools. So that's how our memory work. And because with this new network, it's supposed to be long. So it's called a long short-term memory. And yeah, it's a recurrent new network. What does it mean is that for a normal new network, all the vectors is coming in without relation with the previous data that you fit in. But for a recurrent new network, when you train these set data, the previous data that you fit in actually affect or will train together so it will have some relations with the previous data. So that's why it's called recurrent new network. I will skip some of the details here because it's hard to talk about all these math here. So I'll just skip it. And also, the general idea is for a long short-term memory, we have some gate. So it's control what to remember, what to not. So that's why it's very powerful because sometimes you have lots of data. You can't store everything. So this is very good. OK, let's see if we have time for some demo. I hope we still have. No, internet connection. Why is it the case? Because somebody helped. Yeah, yeah. OK, how much time do I have? OK, it's fine. It's fine. I have time to log back in again. It's always happened. Don't worry. Does it work now? Let me try again. Let me try it again. Sorry, do you have any questions? Maybe I can take some questions while I try to log back in. Yeah, I think I'm logged in. Sorry. Questions? Yeah, yeah. Don't be shy. I need some help to go back to the internet. Yeah. I'm going to use GloVe. Sorry, sorry. But we need to record the question. Otherwise, people won't care. So GloVe and Word2Vec seemed similar, but obviously calculated in different ways. When do you use one or the other? Is there any kind of wisdom on that? Yeah, obviously for GloVeVec, you have to be careful what you fit in. Because for example, in my case, what I'm meant to show you is that I have a pre-trained GloVeVec, which is pre-trained from Wikipedia. So I have actually two demo tasks. One is Shakespeare. One is Trump. So for the Wikipedia, it's written in modern English. So you can't really use it to do the Shakespeare one, because it won't be as good because it's closed. It's modern English. It's not exactly like what Trump will speak in Wikipedia, but it will be better if you do something with English. So it really depends on your task and the GloVeVec model that you fit it in, the state mesh. Like sometimes if we can't find a pre-trained model online, we have to build it ourselves as well. GloVeVec, you can also train it yourself, but I think it's easier if you can download it. And the Stanford, they have a very good library that you can download. That's where I got my thing. So yeah, we can check it out now. So this is the Shakespeare one. Let me see if I can run all of them. It should be fast. I hope so. But yeah, let's look at the other one. Case running, which is good. Yeah, it's training here. So yeah, what it does, I walk a little bit through it. Is it too small? Let me make it bigger. Oh, it's too big. This is not my computer, by the way, so yeah. OK, is it OK? Yeah, OK. So what I do here is I have two Haber function. One is for Trump. One is for Shakespeare. So I only use one of those here. It's bad coding habit, but just bear with me. So here we load in the play for the Shakespeare. So if I scroll it a little bit more so you can see. Yeah, it's spoken in some English that is very Shakespeare. I don't know what else to say. Yeah, very elegant and things like that. And yeah, so I fit in five plays, I think. So it's got to be with comedy and tragedy. I want to balance it. I don't want it to be always sad. So I put some of them, not every single word, by Shakespeare in. So you have to also set the maximum number of words, because you ain't going to capture everything if it's a big copper. So you have to set the limit. So for words that is not frequency enough to be represented, it would be just like unknown or something like that. So yeah, let's fast forward. The model is, yeah, this is the model that is what I showed you before. There is a lot of parameters, and we have trained it. We just trained it for one copper because this recurrent neural network is a small neural network. And I've tried it with two ePOS. Yeah, I just trained with one ePOS here. And it's actually, the training for one more ePOS doesn't really improve the result. So that's why I stopped it there. And also, when I generate a paragraph, as I said, so it's got to have a sampling function that it's got to sample the probability from the softmax output, the prediction of the, what's the probability of the next word, I will sample it from there. So also, you can tune like you can give it some more noise so it has a higher chance to pick something that is less likely to be the next word. So I think this is the most interesting part of the result. Let's see how it goes. Do anybody who is like a Shakespeare expert? No. So do you think this is a good paragraph? Because English is not my first language, and I didn't study Shakespeare when I was in school. So it looks all right to me. Let's see if it's always easier to understand. So basically, everything will be the same here. I think I won't show it again, because it's got to be the same thing. But I'll show you the result here. So yeah, this is what I fit in. This is all scraped from all the Trump's speeches. So somebody recorded and write it down. Because when you are president, you give a speech. Somebody will have the transcription so they can have access to it. So same thing. We have a built-in network that, as what I told you in the slides, it's got to be the same thing. There's a minor difference with the previous one, because I've changed this embedding space to be 200. Because the global embedding space is also 200. So I want to match both of them. So as a comparison, so that's why I change it. It doesn't affect it that much, actually. But we'll see. So this is by the work-to-vet approach. This is what is generated by work-to-vet. So you can see that it's still a bit nonsense. But you can see there's some logic inside. Let me see if I can find one. Yeah, for example, like this sad ones believer or something. It's like it follows certain logic in words. For example, following a noun, it would be a verb and all these. But it's still like crazy, crazy man's speech. I mean, who is crazy? Yeah, I can't tell. So compared to GlowVac, one thing I want to point out is that actually it's faster to train with the pre-trained GlowVac, which is as expected because the word embedding layers, we don't have to train it. So there's less parameter to train. But the losses is actually quite high. That's what I'm trying to say is because it's not as good as I have a GlowVac that is trained from my corpus. So it's only from Wikipedia. It's not exactly what I used to train the model. But you can also, it sounds bad, but it could have its advantage because what if you're, for example, Trump only give one or two speeches that I can get the transcript. So I don't have a lot of words to train my own representation. Then maybe downloading a pre-trained model, like pre-trained word embeddings from Wikipedia, which have loads of words, maybe that would be better than training my own because I don't have a lot of data to train it. So yeah, so this is GlowVac. I try to fix the seed. So it's kind of giving me the same generating sentence. So it starts with the same point. And to see the two, so it's giving the same starting point. It's like to compare the two models. Are they giving the result? Which one is better? Which one I like more? So this one is different. I would say I like the work of that one more. This one is also, you see demonstration fans, like Brooker. It's also similarly have some logic, word logic in it, but it's still a bit crazy. So this is my side project, and this is the fund I had, and this is the result. And I think, yeah, I would just let you ask question and have lunch. That was really interesting. Thank you very much. I wonder, looking at all of the examples, one of the things that they kind of miss is the end of a sentence. I guess it's very easy to have a model that takes one word and then finds another word after it that makes sense. It's harder to find to have a higher level view where you're like the end of a sentence. It's one of the very distinct things about Shakespeare is that he has like iambic pentameter. He has like a line and then a new line, and that's like. But with you, I guess it's quite hard to find the point where you end a sentence. That's a good thinking. Actually, you can include, as I said before, if you can't find that word in a model, you can have something that is unknown, that is encoded in one of, maybe one of your words is actually unknown, representing all the unknown word. But you can also have end of line there. So you can train your model to know when to stop the line. I didn't do it here, but it's doable. So that's one way to improve this project. So it's still micro. It's not like high level view of this is a sentence. It's just like after this word comes end of sentence. Cut basically full stop is another variable. Yeah, that would be. Because end of sentence is treated as one of the words. So I would predict, for example, every time I say thank you, that would be end of line. So it would actually train the model to see, oh, when is it, thank you. So the very high chances that will be end of line following. So you can actually stop the sentence there. So yeah, it's possible. Hi, thanks very much for the talk. Just following on from the last question, you mentioned iambic pentanameter. So this is one of the things that we can spot Shakespeare by the number of syllables in a line. Have you used integrated neural networks to look at syllable recognition as part of this? For this project, no. It's a very, as I said, very naive, simple, fun model. But that's a very good inspiration. I think it's something that interests a lot of researchers. And I think it's a good point that you pointed out. Do you have the code for this up somewhere? Yes, it's on my GitHub. I think I would try to see if I can, you know, oh, if you go to my, let's see, I can, no, sorry. Yeah, if you go to follow me on, I'm selling myself now, I'm promoting myself, see if I can go back. Yeah, if you're at me on Twitter, I would, once I finalize the slides, I'll upload it. And there's the link, you know, the demo. There's a HTML like thing that there is, like you can go to my GitHub and then you will find the codes there. Have you played around with different ways of generating text? Have you compared the output, something like the Markov chain models to generate text and kind of compared like, which one makes more sense when you get the output? Yeah, okay, I'll advertise myself even more because I have another one on GitHub that is using ngram and Markov chain to do a predictive work model. Actually, by the way, the Harry Potter thing, that actually I found the source where they use is actually they have similar thing. It's like they have the predictive text. It's similar to what I did, but I didn't do Harry Potter. So when you click, actually, if you have my slides later and then you, at the end, there's like a reference, actually I can show you now. Yeah, they have predictive text. So I'm not sure which model they are using but for the thing that I did on my GitHub is actually that predictive thing. But at least for interesting, let me show you this. You remind me, and where's Harry Potter? It's here. I found it yesterday. Yeah. But this is different from my approach because my approach is purely done by the AI machine. Maybe the internet is broken again. Whatever, you can go to the link yourself and play around with it. This one, you click on one word and it will give you the next word. So you can write your Harry Potter fiction by clicking it. There's some human inference in that one because when you click on the next one, you actually choose the next word. But for me, my model is just drawing from that distribution, so there's no human picking of it, yeah. So, Sally, we are out of time. Oh, sorry. Thank you very much. No, no, no. It was amazing. Thank you very much. Please give it up. Uh-huh. Thanks.