 Hello everybody. Welcome to my session at the open-source summit at America 2022 My name is João and it's a pleasure to be here talking about machine learning. I know that Right now machine learning is a buzzword in a way that every algorithm came something of machine learning right now but this is not a problem because machine learning help us solve many many different problems and And it can be in our daily lives and can be in our jobs and we can use all this to understand and predict some data and some important and valuable information and This is exactly the point. I want to bring in here We don't need to be super busy about artificial intelligence in a way that we start talking about all the conspiracies revolving around evil cyborgs terminators and all that humid end of humanity And we neither need to be super academic and only talk about math not that this is a problem Actually, I like to give a shout out to all the mathematicians out there But this time let's be very honest and ask this question how we can implement a very simple algorithm and That tries to learn some patterns and solve the problem that is in this information. Maybe not solve but Gather some clues on how to solve this problem so First off, I like to introduce myself better and my name is João Vale and I'm a Brazilian engineer Currently I'm working with open source and called Navy Technologies at SUSE My daily job is trying to solve challenges for companies around Brazil. I Worked for SUSE since 2019 and I'm a graduate of University of Brazil with a bachelor in communications network engineering Sometimes a lecturers and produces some videos about cloud native technologies for the open source community And I'm a machine learning researcher so Today we will be talking about some basic concepts for machine learning and natural language processing But I would like to make a demo Demonstration by the end of this presentation showing the experiments I did for this task And if you are interested in what I did and want to replicate or maybe try something similar This is the list of things that I use so I use open source as they're underlying operational system for the tests and Jupyter notebook for Organize and run all the machine learning pipeline More I used Python and some port and libs like pandas done by scikit-learn and nl2k so I can do all these these tests So Alright going back to the basics. I want to give us some background of this information spread throughout the media Understand why it's important to involve machine learning around all this is problem And right now people are very concerned about information spread on social media And this happens on many different social media platforms Depending on the country that you live in you can see that people are more concerned for what's up like in Brazil Or if you live in a country like Philippines people are my concern for Facebook this image is showing Interview that the Hoyters did last year with 92,000 internet users in 46 countries and We can see that there is a real concern in that case for the COVID-19 is formation This applies for many different themes not only the pandemic, but the pandemic was something that Powered up all these studies about misinformation because we watching some unprecedented amount of information spread throughout all the Internet So Moving in Traditional media are rapidly losing ground to digital media. This is the fact and This fact is strangest the argument that it's important to focus efforts on the study of the dynamics of the digital media and Social networks drive online a new Sharon This is something that we all know about because we all consume news in the Internet today and The problem is that the veracity of such news is only questioned and perhaps proven after a long period of digital dissemination There is a term called infodemia that got a bit famous last year and actually by 2020 that Describes the large flow of information that spreads over the Internet on a specific subject and there are some studies that compare the spread of Those informations like the spread of viruses and Sometimes we see that information was spread much much more faster than any virus that we observed in the nature In all those years and these effects the credibility of an entirely Journalistic structure because this can bring negative impacts to the lives of those who consumes these news Or the lack of ideal strategies to verify and flag those fake news So, okay, so how can machine learning help us with this problem? well machine learning consists and Consuming a huge amount of data sometimes maybe not huge but consuming some data analyzing this data and Starts to gather patterns so we can predict future data and This brings the question would it be possible to identify patterns in true and fake news that are already published and try to predict the veracity of future news so Today we are in contact with a huge amount of data tweets news texts images videos when we analyze and transform this data information We can make predictions. This is the very interested part about machine learning and And this type of technology can help fact-checking agencies, which are very very important nowadays quickly identify news which are with high probability of being false Thus, we can leave the efforts of those agents to more challenging news That needs a more serious work to be investigated So starting off our second topic How machine learning can approach and the tech fake news using natural language processing techniques? first off This is a this is the three main components in any machine learning pipeline data features and algorithms starting off data Consists of the source of information from which we want to obtain patterns In our case, it's the news is tracking from online portals This data can be anything. Sometimes we can build like an algorithm to identify cats and dogs pictures by an automate way and What is the data in this case? It's the thousand of photos of cats and dogs that will be pushing through the through the algorithm and saying to him Learn what a dog is learn what a cat is So the data is the piece of information that we want to extract some knowledge Moving on we got the features So features are inherent characteristics of the data that provides some information for the algorithm In our case, our features are the words that compose such news So you're going to gather all those words and try to understand How are the dynamics of these words if a word it's appearing more in the type of text If in case of fake news, there is a certain word that appears more frequently And we can start to understand what is happening by taking these features and processing them And finally the algorithm. Well, the algorithm is the method chosen for the task and Today we're going to to give some emphasis on the passive-aggressive algorithm. I Had some really nice results with the passive-aggressive It's not a very famous algorithm But I think it works pretty well with big data if you have like some big inputs for example, Twitter or any Important news agents you can work with passive-aggressive because works very quickly very fast and it's an efficient Algorithm as we're going to show you in a few moments Okay, so Now I'm going to talk about the data So we have the three main topics data features and the algorithm starting off with data. I Divided the data sets into two. I have the training data set and the test that is a data set I'm not using a validation data set because the test data set is Composed of news that the training data set doesn't have this light idea of what it is So I created two different data sets and they all have different news and The main question that I did back then when I was doing the research at the University was it would be possible to train an algorithm for example with old news and For example, we can take a period of five years maybe 2015 to 2020 and could this algorithm predict if Predict a new is fake or true. For example publish it in 2022 So this is what I did. I created this data this training data set composed of seven seven thousand two hundred news from a very Important job that some Brazilian researchers did called fake Ponto berri and this is a corpus composed of 3600 news real news and true taught 3600 fake news that were published by January 2016 and January 2018 and the test data set on the right are composed of 1940 news and I gathered the real news from a famous portal of news here in Brazil called G1 and I gather all the fake news of another news fact-checking Agents called Wato's org and this news were published between January 2015 and September 2021 so the test the test data set have news From before the training data set and from after the training data set So I gather news from a very large period to try to understand if the accuracy of the training model would be good With news that it does not have a clue of what it could be. Okay. So how did I did this? I did a web scraping To create the data sets. I used a python using pandas and the beautiful soup lips And I started doing this grape for those portals. I started going there taking all those news I didn't make any separations for themes. So I gathered all news politics sports Many many more and created these data sets that I showed before Movie one we got the preprocessing step the preprocessing consists of Organizing the data and cleaning users characters for example like points and commas So I use it pandas library to manipulate and organize this data sets One important step is the removal of stop words by the way stop words are grammatical structures with little relevance to the algorithm and We can think of of stop words like articles or conjections like all a the Those words are not very relevant for the algorithm. So we took them out And this is the basic structure of the data set. You have a link For their news you have a timestamp from when the news were Published and you got the news itself So the text and the label that tells us if the news is true or if it's fake in this case It's a data set of the of the training. So we can see how I organize it the data over here. By the way This is a really important step for machine learning creating this structure And the most clean in the most organized way. It's it's something that takes time But it's really worth all the effort. So take your time take your after efforts working on the data sets Sometimes even more important than choosing some algorithm Okay, so We have the features now All right So the features like we mentioned before in this case are the words that compose such news and I decided to use the tokenization using TF IDF I'm going to explain what it's this So TF stands for term frequency and it defines the frequency of appearance of a given term in a document But if I take a word for example apple, I'm going to count how many times the word apple appears in a piece of Texts or a piece of news okay, so I get at this frequency and save it in a variable and Inverse document frequency IDF is the other part of the process is a measure of how Significant a term is within the entire corpus So I'm going to take that word apple and I'm going to analyze this word in all the news Because if I only look at for the term frequency I would look in for example 200 appearances of the word apple and my algorithm will say I think that this word apple is very Relevant for for a fake or a true news And this might not be true. So we need the IDF to understand How significant the term is in the entire corpus. So we are not just considering its frequency, but It's importance in the entire corpus. I mean an entire data set of news So I have the two equations here that defines the value of TFDF and The IDF is implemented in a way that on a logarithm scale. So you start to avoid the division by zero and Taking all the these values so we start to to have an A word and the value we start to build like a vocab that is this vocabulary of size t Containing the tokens after the tokenization step So you have a word and have a token to that word and this token defines the importance of this word within our entire data set Another tokenization process that I use it is called word to vac and Word to vac it works in a different way from TF IDF in a way that Represents their vectors in a context for a word. So it will take a word for example woman and It will identify What is the context of this word given those texts? So they these this process can start making a semantic relationships for example woman with man Queen with King dog and cat and We start to see some Some skip grants which are vectors net of the type target word context word So we start taking separate words and analyzing the context of this word. We start to see What is besides this word? For example, if we take the word? Politics what are the words that comes together with politics and we start analyzing in true news situations and in fake news situations and Algorithm our machine learning algorithm will try to learn these patterns Revolving around the context of a certain word Okay, so token weights in this case are represented represents not each word by a vector that dictates its proximity to other worlds in a context Okay, moving on we got our third important step in a machine learning pipeline which is the algorithm and In our case, we are doing a supervised learning with the task of classification We had to classify if and use true or fake so Classification is a classic method of machine learning and its separate entries based on previous no attributes So it's always necessary to label this data. So this is what we did We take the news and we labeled true we labeled fake We give it to the algorithm so This method here uses pure statistics You can't keep math again. So shout out again to math. It's important to understand some math concepts when building a machine learning It's not like a prerequisite But it's important to understand the math behind the statistics behind because we can maybe Propose chance or maybe analyze which algorithm suits better certain tasks and I want to bring some examples of daily Applications of this classification algorithm, which are for example, span filters Google the Google search engine uses decisions trees to Classificate the you are IOS that you search for classified documents that it's our case and sentiment analysis for example Taking a bunch of tweets and analyzing if that which is is Happy or angry and moreover. Alright, so we choose it the passive aggressive to do this classification task It's a pretty straightforward algorithm. It will take like a new document from the internet piece of news It will change its weights of the model and will throw away this this new piece of information So if you want to deep diving into the details the mathematical details and all that I can share The papers with you, but in simple words, this is how it works. You take for instance a new document of D and you put into this hyper plan Which is represented by the x-axis or the fake label and the y-axis for the true label So the more a document is near the Y label. It's probability of being true It's in the Y label. It's next to the Y label actually and if it's more it And if it's next to the X label the probability of it being fake, it's higher Okay, so you take this document and you put into this hyper plan So, all right, I'm taking this document D and I'm going to classify as a true piece of news this black This black line here is the old weight of the classifier So if you look at in the right way here, you see that the document was in the true part of the model and This is the below the line is the fake part of the model So it was classified as a true piece of news and it was classified wrong because it actually is Fake news and it was represented by this red X So the algorithm will see that it did like a wrong classification and will change its weight So it's going to calculate a new weight, which is the green one and put into the model All right, so we have here this that there's double you in green Representing the new weight for the model and you can see that it is calculated This this vector is calculating using a loss that is represented by this L and D which is the weight itself of the of the piece of news that was Classificated so it takes the old weight, which is the black vector here And it transposed in the graph so you can get like a new weight a more precise weight For the model itself, which is represented here by the green line, okay Okay, so this is the pipeline summary of The machine learning process we did we did like a web script using the Python scripts Okay, so to compose the training data set and the test data set and the preprocessing and the tokenization with TF IDF We're part of the natural language processing step. Okay, next we have the training step Which produces an output, which is the machine learning model and we can use the Data from the test data set Okay, the tokens that we got from the test data set to make the inference Which is the step that we take the model we put some input data into it And we get some output metrics to check if we can predict data in a good or in a bad way So we can get some metrics like accuracy, precision, recall and F1 score. All right And now we finally get sure D mode time Alright, so here I am at my Jupyter notebook. I don't want to just throw code I know that it gets hard to understand what is happening here, but I just want to focus on some important Things here. So starting with the the lives that I use so non pipe and the CSV To import all the data and organize. This is part of the preprocessing step So I'm taking all the data that I gather it with the web scrape step that I did with other Python scripts And I'm organizing the data in a very clean way Okay, so this is all cleaning steps and this is the part that I did the features Extraction that consists of what the TF IDF did. All right, so it's important to emphasize here the NLTK Live, okay, so you can do many many things with the NLTK Live it works for natural language processing workloads. Okay, so you have TF IDF You have Word2Vac and many other important and NLP Stuff here. All right, so What this part does is it takes each word of the data set that we created and Attributes a token to it so we can start to understand which word is more important or not so important in our data set and we finally did the training of our passive aggressive classifier and In this part we can see a very nice picture here that shows the Accuracy for our model and this is a confusion matrix that Show us how well our algorithm did. Okay, so you can see here that 96% of the labels that are really fake work as file as fake and Only 4% of those news of news that were fake were classified as true in the wrong way Okay, and you can see in the line below bad news that were true. Okay, so true news were 12% of the time classified as fake and 88% of the time classified as true in a correct way. Okay, so this gives us an accuracy of 91.73% Which is really good for the session a quick and fast algorithm By the way, the training happening like 0.2 seconds just to have some some information Okay, so the precision was at 0.95 They recall at 0.87 and the F1 0.91 so this is our all metrics for a machine learning algorithm, which is something very Very important for our study here something else that I want to show you is I want to bring you this table which shows like a ranking of important words. Okay, so this this was a Case study for Brazil. So you have all these words in Portuguese. All right, but you can see that you have the 50 Tokens that are more probable to appear in fake news. Okay, so the more negative is the weight the more this word It's important to classify a new piece of new as fake. All right So you have all this here all those words and you have the same for The true tokens. Okay in case of the the true news So the more positive is the weight the more this word It's important for the classifier to classify a news as true. All right okay, and We did some other things here that we would like to compare the passive-aggressive with other algorithms and I did this comparison with the Decision tree classifier the handle random forest classifier the SBC support support vector machine and The logistic regression. Okay, so we have we had some Results over here and you can see that the passive-aggressive did well Actually that better than all the other options for the accuracy for the F1 and For the precision it got behind the random frost classifier in the logistic regression But when you take the F1 an account which With leverages the accuracy the precision and I recall you see that it get better than everyone else Okay, so the passive-aggressive did like a really good job for this task and All right, so right now. I want to show you something else that we didn't actually Talk much about but I think it's something that is better showing. Okay, and This is the knowledge distraction part. Okay, so we have like all this algorithm performing good to Detect if a piece of news is fake or true, but it's important to understand the structure of this news So this is why I use it Not supervising algorithm called T SNA. Okay, so it's a T Stochastic neighborhood abiding and for this case. I did a different Technization step called words back that we talk it before. Okay, so there are some nice things that we can show here for example This is like a cloud of words In our case, so we are taking for example a true piece of word, you know with a token associated to it and Understand the what is Next to this to this word So we take like a 2d plan We put the word in this plan and we start to see what is in the side of this word We start to see the context of this word that we talked before this is something that the word to act give us to us So looking to act the spiritual you can't see many information You can only see that there is the true label and the fake label So it makes sense with that ranking that I showed before But this is a different metric here We are not looking for the TF IDF, but we're looking for the word to back Which give us the context of the word we can take this picture and analyze in a more specific way And we can start seeing things like this For example, I have here a plot of the true Features that were very important to classify Piece of news as true. Okay, so you can start seeing that there are some words that comes together certain times and It's hard to see because you have like too many words But you start to see that some words have like a context in them The times they appear they usually appear with other words Together and the algorithm can start to understand that through news for example takes the word Politica which stands for politics in English and It can in case of true news You have different words besides politics and in case of fake news you have different words besides Politics, okay So in this way, we can start to see that the algorithm can use this context to make the separation classify a piece of news as true or as fake So we can see something like this in case of the fake news. All right, so for fake news we see lesser words, but you can see that The groups are different here. All right. This is what is important to identify This kind of pictures that the TSNA gives to us. Okay, so going back to our conclusion We can now talk about our results and what we achieved with those tests So in less than 10 seconds, it was possible to train a passive aggressive model and classify 9,400 news never seen before with an accuracy of 90 percent, which is pretty good There seem to be some lexical patterns in Brazilian news I'm not sure if this is the same for example English news or another country news but in Brazil it seems to be some lexical patterns that Makes possible this Differentiation and you can classify through news from fake news with ease. Okay, so I know that sometimes you can fall into traps and True new can be classified as fake or a fake new can be classified as true This is the point of such application It's too easy the job of classifier huge amount of of news So you can leave the more challenging news to a more human approach. Let's say Okay, the model was trained with news from a period up to five years ago And it was able to classify more recent news with great accuracy. It leads us to some clues about how Machine learning model works for this case You do not need to be a fit in this algorithm with news or that are recently published Okay, you can use old news and it will start to create those maps with patterns and you can apply these patterns in the occasion for newer piece of news, okay, and Finally for future works We can start talking about web applications bots APIs and more. Okay, so what in how would be the adoption of a classifier in social networks? how we could put this kind of feature into social networks and do like a fast classification of news that are being shared for example and A very important question here that I bring to you so we can all go back to what we're doing With this in mind is an ethical and social discussion like how to ensure the reliability of the training data Because someone needs to be telling the computer which news are true and which news are fake and this brings us to ethical discussion that is of extreme Relevance and it should be really talked about and studied Sometimes even more than the application itself. Okay, so this sums it up what I had to talk with you today It's been a pleasure being here and I hope that we see each other in the future Stay safe. Bye. Bye