 I'm doing software by day and by night, I'm doing a bit of machine learning, which is my interest. And one of the things which I just wanted to just share a bit of my experience with is customizing duplicate questions with TensorFlow. And I think for many of you who are already on Kaggle, I'm sure you have also attempted this particular Kaggle competition, Quora Question Pass. So it's going to be a short one, so just plow through. And then we can ask questions later. So basically, the problem challenge and approach is basically to build a classifier that will determine if two questions are identical. And the key thing is that this labeled dataset is a human-labeled dataset, which also means that there's a bit of background human judgment behind that. So the dataset came only with a pair of questions and the corresponding class labels and the evaluation criteria based on log loss. So as usual, we'll go through the typical analytical workflow, EDA, training, data prep, feature generation, model building, evaluation. And this is just something what happens with the Quora's website. If you type in a question there, and it gives you a ranking of what are the closest, most similar questions. And I think one of the reasons why they choose log loss was because they want to make sure that your predicted question is closest to your model has the closest accuracy. Not in terms of whether it's true or false, but the confidence of you predicting that question is duplicate. So I'm going to jump to what I did with the beginning of this to prepare a dataset. We'll jump to the text feature generation and then we can go through the models. So basically, with all the data that we have provided, we start with the word and character count for each question. I mean, just based on intuition, right? Character counts and word counts of two similar sentences should be fairly close to each other. So that's an intuition-based kind of feature generation. And then, of course, the next one will be share of matching words. So we try to find, hey, what's the proportion of the words that appear on one sentence matching the other sentence? And the higher the proportion, intuitively, the closer the match. Obviously, we all know that that's not necessarily the case. But here, we're trying to generate as much features as possible. Now, one of the things that I took it was not to do standing to preserve the meaning of the tenses, right? And I had a custom stop word list to preserve the keywords such as how, why, when, and where. Because these are very important when you begin the question with how and why. It's obviously different. So the next feature that I used was the TF-IDF weightage. So TF-IDF stands for term frequency, inverse document frequency. Basically, it just means that if a word that appears very rarely over a corpus, that word is likely to be pretty important. So intuitively, what we're seeing is that questions with unique terms that appear in one question and not the other are less likely to be duplicates. So that's one of this feature that I generated. So here, the TF-IDF weightage, that matrix at the bottom. So these two questions, you get the TF-IDF weightage of every term, and then you calculate a score. You sum it up or multiply it or get a mean. It doesn't matter as long as you generate many, many features out of it. So the next one is semantic similarity. So this is where it gets, I took the approach of trying to use word net from Princeton. So it's basically trying to find the semantic distance between the two words from the sentences. Word net, we're using the sin sets to determine the length distance and hierarchy distance. So what is length distance? Length distance is basically the measure of the length of the shortest path in the semantic ontology between two sin sets. So each word belongs to a sin set and how far are they from each other? Next is the hierarchy distance, the measure of depth of the ontology. The nodes closer to the root are broader and have less semantic similarity. So what it means is that we compare two sentences together and for every single pair, right? What weightage, what is, what the, what best. Every one of them, we create, we do that semantic similarity lookup. Length distance and hierarchy distance. Now you can see there are a lot of zeros there because word net doesn't capture a lot of the conjunctions, for example. So it gets zero, but some other words like what, best, there are, there are sin sets for that. So you can get the similarities. So from this, right, we have calculated the semantic similarity and for these two sentences, the score is, say, for example, in this case, it's 0.6674, which is all right, you know, that intuitively it looks all right, right, from the score. Next, we took also the word order similarity similar to the previous approach, what the order of the words appearing in the two sentences and then we compare that again using a word net and the word order similarity is about 0.67. So that's intuitively right. So what I've been doing here is really generating a lot of different features, trying to find out as many features possible to pass into the convolutional neural network that are built. Now, the next one is the word-to-vec embeddings to determine word similarity. So basically what, again, we do a word-by-word vector comparison. So this word compared to that word and I'm using the word-to-vec similarity score, okay? So the word-to-vec was trained using Google, the Google's three billion word dataset. So I loaded that in and then run the word-to-vec similarity score and then we compute what's the score between every one of these words. So what you get will be a matrix vector like that. So I was actually quite inspired by Martin when he said that for CNNs, everything's an image. So following that approach, we visualize that particular matrix. And I think intuitively, you can also see that it's sort of like right, where the ones with the red dots are obviously the matching words on the axis itself and then it gets spread out. The rest of the weights get spread out so you can see the visualization. So here for CNNs, typically we use a very well, very nice square. So I chose 20 by 28, following the mince example. It's just easy to do, that's all, right? But you can have any other dimensions as well as I will show you later, okay? So these are some of the text visual generations just to understand and see whether or not they are intuitively correct. First one is duplicate. Does waste training really work? Do waste cinches really work? If so, how? So that's the image looks like. The duplicate one, the other one, is what's the most beautiful French songs and what are some good French songs? So you would see that, hey, it gets a bit interesting here because no longer are you seeing that dimension, that centerline across the image itself, but you're seeing a very different pattern coming out. So intuitively, hey, maybe CNNs would be able to try to extract out these features and use it to do the prediction. The last one is, it's not a duplicate, it says how do bartenders actually become bartenders in California? And how do bartenders actually become bartenders in Texas? So every word is the same, the position is also the same, only the last word is different and that changes the entire meaning of the sentence. So this is something which potentially a lot of the neural networks may not be able to solve, but maybe sequence-to-sequence might be able to resolve this. So anyway, what next we did was to combine two word vectors, the word-to-word-vector-simulator vector with the other features that we've previously created into another matrix representation. So in this case here, I converted it to a 24 by 33, 972 pixel representation. Now, with this thing, you would see that the whole spatial thing just splits out. It's no longer a very nice dimensional angle. And intuitively, that's okay because we're expecting the CNN to be able to identify features, even if it's not so nicely aligned and to be able to use that to build the model. So I'm gonna show a bit of code now and then see what the results are. This is okay? Okay. So as the usual mandatory initialization, so we do a bit of EDA, right? And then you can see that generally from the text itself, the training set, you've got a lot of words, right? Number of occurrence of a question and number of questions, meaning to say that many questions are unique in itself, but there are a lot of repeated questions in the training set. So that gives some sets of comfort that, hey, okay, this is probably trainable and usable. Next, we visualize the word distribution, okay? And this distribution is actually quite important. I found that this distribution actually gave us a lot of information because as you can see here, the highest frequency of it is at 10 words. Most sentences were, most questions were 10 words long and it tails off towards 30. So because of that, right? I think choosing a 20 by 28 matrix would cover most of all the dimensions of the questions, but that leaves a problem. So if I'm a 20 by 28 matrix, what do I do with sentences that are 10 words long, right? So what the approach that I took was I did a zoom. Basically, I just zoomed the image to populate out the various, to try to fill the vectors instead of padding it with zeros, right? So I also did the mandatory data cleansing and pre-processing of Boom-Ball every one year. And this, okay, so I just loaded the pre-built data. So you can see here, this is the question itself, original question, stemmed, the tokenized and stemmed question set, the word counts, normalized word counts, the percentage word match, semantic similarity, and the word order similarity. And then followed by the various TFIDF weights. I took a sum as well as a median, okay? So this is just another view of the data itself. So this is before I created the image. So after loading the pre-built, for this case, in this example here, I pre-built the data set. So I've loaded it because it's really, really big. And then this is after all the word to match translations. All right, and then let's look at some EDA further and see whether which of the features are actually usable. So word match percentage, that was the one where we had intuitively decided to take a look at, perhaps maybe the number of words, similar words across sentences would indicate whether it's duplicate or not. So here, it looks like word match percentage may be useful because you can see a very differentiation of the labels which is between duplicates and duplicates. So word order similarity, it looks okay as well, might be useful. Semantic similarity looks pretty good because, and this was the word order similarity and semantic similarity was based off the word net to get the figures. So now, we took another, visualized the word to match similarity matrix and you can see here, in this case here, this is not a duplicate question, but this is obviously a very long question. So as you can see, there's no, it's all a lot of features here in damage. Yeah, perhaps intuitively, it looks like yes, it wouldn't give you any duplicates. It's not a duplicate sentence. So here, we proceed to build the CNN. In this case here, I'm just using a convolutional and we're at both three layers instead of the standard two layers. Applying one hook encoding to the label so that it's easy for the TensorFlow to just quickly do the comparison. Right, in this case here, I partitioned the data set 80% and 20%, 80% training, 20% validations. In the actual competition itself, the training cases were huge, about 400, over 1,000 training cases and the test set was about a million over records. So in this case here, for this particular code presentation, I just took a very, very small sample. So we initialized the variables image size of 24 by 33. What I found was that batch size and number of epochs actually had a lot of influence on my model, right? If I did a very small batch size, my accuracy wasn't very good. If I had a way up of 300 and 350, 500, my accuracy was pretty bad as well. So after a couple of experiments, I settled it in around 250 with an epoch of 20. So that seemed to work for this particular data set. Right, then we set up the variables and helper functions here. So we define some helper functions to create the various layers of the convolution neural network. Yep, standard max pooling of two by two, right? And so we've created the first convolution neural network layer, the second layer. The first layer input was a three by three matrix, right? So it's a small, going to iterate through the entire image on a three by three matrix. Second layer is a five by five. And the third layer, which was really an experiment because when I start out with two layers, the accuracy was okay. But when I started to edit another layer, it went up a notch better. So I stuck with the third layer and five by five and then 64. I didn't go to 128, right? Because the training time was just simply too long. So I just stuck with 64. And then the dense fully connected layer was then a three by five by 64, a 960 neurons dense layer. Yeah, 1024 was my initial test. So yeah, please ignore that. And then you drop out layer and then a readout layer. So now the next year to define the loss function, regularizers, optimizers and evaluation functions. Here you have a standard softmax cross entropy with logits and then regularizers, cross entropy, the learning rate, create a variable learning rate. And then I'm using the atom optimizer, which I found was probably the best, the momentum and gradient descent wasn't very good. But atom optimizer gave pretty good results. And then after that, of course your evaluation and accuracy. So as with any classification algorithm, I mean accuracy is just one measure, one metric and it can be a very misleading metric. So we need to take into account the precision and recall as well as the F measure just to compare different models. And then one thing, what I found was really easy to do, I think even with TensorFlow, right? I'm not talking about Keras yet, but the ability to add your summaries so that you can visualize what's happening in your neural network as it trains was really easy. And I'm a beginner with TensorFlow and this didn't take me too long to figure it out. So I'm sure you all can do much better, but it was surprisingly easy. I thought it would be like rocket science, something like that. So you create a log writer to capture all the summaries as it goes through the network and then you can visualize it using your TensorFlow. And then we execute the graph. So here I executed it. I can stop halfway because it was taking too long, but you can see here, the accuracy was improving. And let me show you what it looks like on the TensorFlow. Is that okay? Right, so as you can see here, the accuracy was going up and just very basic feature sets. You're getting about 80% accuracy and that's pretty good, surprisingly. For me, when I tried the Kaggle favorite, which is the XG post. All right, it took me I think about three and a half days on training on the 16 core CPU and then to get about 83%, right? So I got this in about four hours, four to five hours. So I was pretty surprised, right? Error rates going down, it's still not good. It's like still at 0.3. I mean, the competition guys actually got it down to 0.1, right? So they were using a lot of other models, not just one model itself. They had about five or 10 different models all lined up together. So yeah. One thing to note about, again, classification algorithms, right? Is that precision and recall. So here you can see here the recall. So I plateaued around eight and a half, which is not too bad, but the precision was too fairly lacking. I think this signals a problem with the model itself. Either the model's not really fully, couldn't differentiate between false positives and false negatives and true positive and true negatives. And then the F measure, F measure was not too bad at 2.0. Now the one thing good about TensorFlow or so is the ability to add different dimensions of your statistics to your model. In this case here, I just visualized a simple confusion matrix, right? Do apologize. It's just a black and white image, but I did this very, very last minute. So as you can see here, the dimensions are prediction and this is a label, your target and this is your predicted and this is one, this is zero. So you can see here a lot of the model actually was able to learn how to detect not duplicates, right? Are you to say questions, which were very different, but they didn't do so well for questions that were duplicates, right? And this is to be expected given the current dataset model, right? So that's the tensor board. And then at the end of the day, yeah, you've had it wrong with the validation set. Now the other thing I was going to experiment a bit was the CNN with SVM. So at the end of the, at the max pool, sorry, not the max pool, at the dense layer, right? Instead of going back to the output, you move that entire layer inputs into SVM, right? So the 960 neuron layer pushing into an SVM and see what the outputs will look like. Unfortunately, I wasn't able to complete this experiment yet, but I anticipate it would be a better, might be a better result. So anyway, back to the presentation. So a couple of things to note about CNNs, which I've observed, imbalanced dataset and balanced datasets. So I'm not sure, but my experience with this particular exercise was that as with any other neural networks, having a balanced dataset is very important, right? So imbalanced dataset, you can see here, it gets 83.6%, right? It might seem very good, but actually it's due to the learning of negatives rather than identifying the positives. So that 83% is actually a very misleading figure. When you balance the dataset, the accuracy drops, but your error rate drops as well. So it's actually a less fitted model, right? So both models suffer from poor provision, but relatively high recall, and these are probably due to common NLP issues. So a couple of NLP issues that typically needs to be looked at, right? Not just from a TensorFlow modeling perspective, but more from an approach would be training set labeling. So as you can see here, this question here, what's your new year resolution for 2017? What are some of the best new year resolutions for 2017? So obviously these are not duplicates, right? I mean, but the human label that as duplicate. So it is very subjective. The dataset itself is very subjective. So what you're learning that model itself, maybe subject to always our biasness, which we need to somehow have to find a way to deal with. And obviously the third one was the name entities. These are obviously not the duplicate, but according to the neural network that the CNN, that was trained, it was a duplicate. So these are some of the issues that causes. So what are the possible improvements that I think for my learnings here? I think we need to add NERs. So to identify duplicate questions with high structural similarity, but different question targets, meaning to say, how do I be a bartender in Chicago? How do I be a bartender in California? Chicago and California, that's a name entity recognition that we need to build in. Next, tune CNN hyperparameters. I think filter shapes helped a lot. My front, my first layer was five by five. It didn't get too good a result. When I changed it to three by three, my accuracy as well as the error rate was much better. Stride size, currently the stride size is one. You can, according to literature out there, it can be one and two, I need something for you to try out. Pulling layers, two by two or no pulling layers. The other thing that I trust upon also was the channels. So currently right now, that the input channel is only one layer. So it's only one channel on the 960 pixel image. You could add another channel and you could apply another filtering, for example, a solo filter. And if you apply a solo filter, you can see the third image there. It kind of like able to detect edge detection of certain features of the image itself. I know CNNs are supposed to do that already, but something which maybe adding another channel, another perspective might help the network. But this is an experiment that I'm still working on. And that's it. So these are some of the links which have been helpful in my research, as well as helping me to point in the right direction. And you can download the source code from my GitHub account. All right, that's it. Thank you. Any questions? Questions? Okay.