 I also thank ODC for giving me an opportunity to talk in this. I hope I'm audible, right? Okay, maybe I can't hear you. That's why. Sorry for that. I'm Venkat Ramanjay. I'm a senior software engineer in Metapack. I'm also pursuing my masters in the University of London in the area of NLP and machine learning. The presentation today is about detection and classification of fake news using convolutional neural networks. If I'm fast or something in the middle, please raise your hands. Okay, I'll slow down. The outline is the section one, we talk why fake news is more important now, what are the implications. And the section two, so this is my own work. So I've done end to end, so this is my own masters project actually. So I'll go through the network architecture I used, the deep learning approaches to combat fake news. And in the section three, we'll have the results summary. And if you have time, we'll take some questions. Okay, what is fake news? This is a definition from Wikipedia, nothing new actually. A fake news is nothing but a content created to mislead people online specifically. Now, there are different reasons why people create fake news. It could be financially or it could be business wise. There are multiple reasons. It's quite hot topic at the moment in the deep learning world. Companies like Google and Facebook are trying to productize the fake news identification or the classification models. It's because the online media is more important for those companies to get advertisement revenues. And if people don't trust in the online media, the advertisement revenues for Facebook and Google will go down. So it's very important for those companies to productize these models. Just a general thing about this social media to be blamed for the spread of fake news. Because in social media, people actually like and share whatever their friends share. They don't care actually if it's trusted or anything. They just share because sharing is quite easy and you like. And the more you share and the more you like, Facebook ranks the page higher, article higher. So if it's a fake news and if you like and share, Facebook actually will put it on the top of news feed. So when I did my research, actually I found out in 2010, Morgan Freeman was declared dead. According to some tweets and it's published in CNN actually. Later CNN had to report it's a fake news. And I posted a site that's from snopes.com. They are a fact-checking sites actually. They fact-checked that and then they informed CNN that Morgan Freeman wasn't dead. In fact, even Jackie Chan was declared dead. A data set or corpus exploration. So in the work I did, it's very challenging to get the... Sorry. Oh, it's gone. Sorry. So the data set is very challenging. It's because even humans find it difficult to identify what is fake and what is not fake. And if you want to treat a machine to identify what is fake, you need a human intervention to annotate the data. So the labeling of the data was done manually in my case. But I also used a Chrome plugin called a Bullshit Detector. I'm not sure if you guys heard actually. A Bullshit Detector detects all the fake sites and says this is Bullshit. Don't just read. It's a nice plugin if you guys want to use actually. It's part of the Chrome. So I also took some data sets published on Kaggle and GitHub. Kaggle is a data size competition site. I mean it's good to trust because it's bought over by Google. So there's been a curation of the data sets published in Kaggle. A lot of PhD guys have published the data sets on GitHub. If you guys want to start identifying some things, you can use GitHub. Most important, I think 80% of the data scientists spend their time in data set cleaning and preparation. And they keep mourning about, okay, I have to do all the cleaning and everything. So as I said, I also spend most amount of time in preparing the data for the learning algorithm. I used a wide variety of NLP techniques. One is the stopwatch removal and then padding the documents to be of the same length, drop missing columns. I forgot to mention actually, I also used some entity extraction, which is even used in the previous one. So you can do clustering, you can do whatever you want. Once you have the documents, you can do clustering and then take only those documents according to some labels. I manually took all the documents. Okay, deep learning approaches. Now, I model the fake news identification as a binary classification problem, which means it's a function. It's defined as function of E where one means if the piece of news is a fake news, zero is otherwise. Now, the question is why deep learning? Because the baseline classifiers in the text classification world, the native base and support vector machines have done extremely well on the stance classification, sentimental analysis, and the text classification techniques. But I personally found if you want to use these baseline classifiers, you have to use a bag of words model, which is just a one-hot vector encoding as possible, or you'll use a TFIDF. Well, then you have to do the feature engineering, the dimensionality reduction, as well as the feature extraction. So I went ahead with some deep learning approaches and traditional models does not capture semantics in text. When I said traditional model means the bag of words model and the TFIDF scheme, it doesn't capture the semantics in text, which means words with similar meaning appear together in a similar concept and must have the same representation. This is very key, right? If you have a word which could mean differently in a different context, but if they appear together, that means they are slightly similar to each other. That's what the embeddings come to. That's called the word embeddings, and that was actually done by Thomas Milkoff from Google. He was the first guy who came up with this algorithm, and he called it as word-to-vec. Word-to-vec is an algorithm which will actually vectorize a word in terms of similarity, which is the words which are very similar to each other. They are placed in a high-dimensional space very similarly. If you have a document like, I am Venkat, and if Venkat is even in another document, the Venkat in both the documents will have similar representations. That's the n-gram models, and n-gram model is nothing but the language models, actually. We have been hearing about language models since morning. That's nothing but the n-gram model, and we have by-gram and trigrams. In the keynote, actually, he did mention about by-gram model and trigram models. They are all probabilistic language models. They are all trained on a neural network with a two-layer neural network, and they are all probabilistic language models. They are actually used in the vectorization. Now, the key difference between a baseline classifier and a deep learning approach is this word embeddings. If we don't capture words which are similar to each other with the same representation, then we are missing the semantics of the text, because in text documents, semantics matters. Where the word occurs in the document is extremely important than just assigning a unique integer to a word in the document. Facebook, actually, two years back, they released a model called Fast Text, which is, again, developed by Thomas Milko after he moved to Facebook from Google. But the difference between Fast Text and Word2Vag is that the Fast Text operates on a character-level n-grams, so you can have a two-level n-gram, which is like a two-character level n-gram. The glow vectors is from guys from Carnegie Mellon and Stanford University. You can use any of the vectorization mechanisms, but I personally choose Fast Text in this case. Convolutional neural networks in text classifications. We all know convolutional neural networks. They are the state out of the art in the computer vision area. They have proven really well in the video analytics and the image recognition. Two years or three years back, there was a guy called Kim Yoon. I still have the paper. Actually, he published a paper on convolutional neural networks for stance classification, and that triggered the convolutional neural nets for text processing. Until then, even, you know, we heard in the keynote that recurrent neural networks are best for text. Until that paper came out, actually, people never thought CNN would be fit for text processing. After that, even actually, as the keynote speaker mentioned, even Google, actually, they're productizing CNN for their text classifications now. Well, I hope actually everybody has a fair understanding of convolutional neural networks. It's got a convoluted layers, pooling layers, and fully connected layers. This is very different from a multi-layer perceptron, which is a feed-forward neural network. How does CNN fit for text and NLP? The key thing in computer vision is that the images have a spatial structure. That's the key thing which CNN tries to maintain. So even text, they have some spatial structure and orientation, right? It's not that the words just appear together. They appear because they have some context, and you need to capture that context. That's the key thing. That's why the CNN, the convoluted layers, they try to maintain the spatial structure, which is one-dimensional in case of text. If you're running computer vision, it's actually two-dimensional. In computer vision, for the spatial structure, it's made in two-dimensional, but in a text, it's one-dimensional. Feature extraction from text is effective using convoluted layers. This is the key difference for deep learning, right? The feature engineering is done automatically by the neural networks. You don't need a feature engineering team. Of course, you need to do some kind of feature engineering, like annotating labels and things, but you don't need to explicitly tell the algorithm which feature you need to take it. The convoluted neural network layers will learn, actually, from the data what are the best features to be used for the classification. And also, I'm sure you must be knowing that neural networks, actually, they are good for identifying the global maxima, because even in any data set, you want to find what's the global maxima rather than a global minima. They actually minimize an objective function, and they are used for finding... So the global feature extraction using global max pooling ID function will help you to find out what's the maximum best feature out of your data set features. And through this case, actually, you reduce the dimensionality, you convert the word to vectors using fast text, and then you apply these convoluted layers, and you try to reduce the dimensionality as much as possible so that the training time is pretty less. In my case, actually, I trained on a gigabyte of data, but people actually, in production, they're trained on a terabyte of data, actually. They just train huge, huge amount of data size. Okay. This is the network architecture which I built for my project. I mean, this is not nothing new, actually. It's a convolutional neural network. The input layer is a document vector that, of course, everybody knows like the neural networks take integers as inputs. It doesn't take words. So the document vector for each document has to be of same length. Otherwise, the neural network will not work properly. I use the pre-trained embeddings. Now, pre-trained embeddings in the machine learning world is called transfer learning. Why is it called transfer learning? That's because Google and Facebook, they have actually built a model on a separate set of data. I think Google built it on their Google news corpus site and Facebook built it on Wiki and you transfer the knowledge from that pre-trained one onto your data set. Assume, if not for this, I'll be just creating the training, the embeddings for hours together or maybe days together, right? Now, those companies actually have open-source these models. It helps us to do the transfer learning. So the pre-trained is the key there. The next layer in the neural network is a convolutional 1D layer. Well, that's where you start the feature extraction. It's just mentioned windows and filters. It's not so straightforward because you do a lot of iterations and then find out what's the best optimal window size and the filter size you have to use in your problem. In my case, actually, I started with 128 as the window size and then 5 actually as the filter size. And then the next layer in the neural network is the max pooling layer. The max pooling layer actually will try to, within the small window size, it will try to find out what's the best feature to extract. So convolutional neural networks work that way. It takes a square of images like from your data set and it tries to find out what's the best feature you can extract. You can actually keep adding layers and layers and that becomes a really, really big, deep neural network. If you just add more and more layers, if you add more convolutional 1D layer after this, if you keep order after this, then it becomes really, really big. I think DeepMind, I'm not sure if you guys know actually, they did something on the medical image things. They used to identify medical documents actually. I think they had some hundreds and thousands of layers with actually millions of neurons in each layer. They trained on a GPU. That's something brought over by Google now. So in my case, then I again apply the input to a convolutional 1D layer where again I tune the windows and filters and then the global max pooling layer is used. Now this is where the global maxima is identified in your data set. It's very important to identify it because that's where you deduct what's important in your data. In case of a computer vision, you deduct if you have a computer vision, in case of a computer vision, you deduct if you want to identify Venkat, does he have a round eyes? Does he have a whatever nose? That's important actually. So the global max pooling layer will identify the best features. Also bear in mind, you need to keep iterating again and again because they're iterative algorithms. You need to go again and again to allow the network to learn the best representation of your data. If you just try it one time, it may not work. Then if you try it again, then the network will learn, okay, this is the best representation of the text. And finally, the last layer is a softmax, which is a probabilistic output layer, which is called a cross entropy. It's also called the log likelihood actually in the machine learning world. If you guys know about logistic regression, you do a probability, right, which is a log likelihood. So it's modeled as a probabilistic layer where probability of article equal to fake given a new input document. That's the probabilistic classifier. So the softmax layer will give an output between 0 and 1, and the sum of that will be 1. It gives a probability, nice probability distribution. We expect a Gaussian distribution, but you don't get that actually. You get a different distribution, but the sum of the probabilities will equal to 1. Now this is just the model. You train on it. Now how you do the inference? How you save the model, and how you load the model into the main memory, and then how do you question the model? It's a different topic totally. This is just the model actually where you are trained. And in your label to training data set, you explicitly say that, okay, for this sequence of inputs, this is the label actually which you attach. It is a fake or a not fake. It's up to how your data set is created, right? Yes. No, the softmax will give that. The softmax in the output layer, you actually tell how much output you expect, the output dimensionality, which will be 2. So the softmax layer will do that. And the output layer actually, that's where the categorical, so I use the algorithm RMS prop and the categorical cross-entropic. That's that actually. If it's a feed-forward neural network, they need to flatten it out here. Because if it's a feed-forward neural network, they need a flattened layer. Okay. This is a result summary actually. You know, the model was trained using Keras, which is a TensorFlow background. It was trained on my own laptop. This is a 16-gig ram with a heavy CPU. It's a Mac. It's a horsepower Mac. The data set I trained was about 2.5 gig. I think if it's more than that, I have to go to cloud. Amazon has got nice solutions in the cloud. Epox I trained is about 150. I'm sure what Epox is actually the number of iterations a neural network has to go through on the data. It's because in your first iteration, the network won't learn the representations. Keep repeating iteratively again and again. Eventually, after some Epox actually, you will learn the best representation. The batch size I trained is about 256. And I trained on a CPU. I did not train on a GPU yet. This is the comparison thing which I want to, you know, give on the model accuracy part. The metrics I've shown is the accuracy. But you can have different metrics in a neural network. You can have a mean squared error, which is a different metric actually given in case of a neural network. I found 99.8% accuracy using a CNN. I did the same using SVN. And SVM gave me 90%. Actually, my friend was saying, you know, you didn't do the feature engineering properly. That's why SVM was giving a low accuracy. But this is what I got actually. I got a naive-based classifier, which is a probabilistic classifier. The model accuracy was 85%. So this is my project, actually. The source code is with me, and I'm publishing a paper right after the, you know, ODSC. And if anyone wants to know the source code, I think I'll make it public after my project is approved. Until then, the college doesn't allow me to publish this. Okay. The data set actually was taken from a trusted news media. Like if you take in, I did for UK. In UK, you've got BBC, Guardian, Telegraph. They are trusted because people do trust it. And when you download data from those sites, you can assume that it's real because the fact-checking is already done before a reporter publishes it on a BBC.