 Hello everyone, my name is Obiyama Kagban Eje and this afternoon I'll be talking about building a naive-based tech classifier with scikit-learn So like I said, I'm from Nigeria and I'm a graduate of computer engineering from University of Lagos and I've worked in the IT telecoms and travel industry and I I heard that Data science is the sexiest job of the 21st century and I said, okay, I want a piece of that action So I hopped on a plane and I'm currently studying data science at Robert Gordon University Aberdeen so The objective of this talk is to hopefully help you understand how naive-based algorithm works and why it's such a good algorithm for tech classification and Also show you how it's implemented in Python using scikit-learn so naive-based is a supervised learning algorithm that is based on probability and base theorem and Base theorem is actually named after Thomas Bayes Cleging man from the 18th century he was kind of trying to figure out the proof of God's existence and somehow found a new way of thinking of chance However, like it wasn't really implemented until PSM on Laplace came along and and You and developed it to what it is today so naive Bayes algorithm is a supervised learning algorithm that's me that's Is a supervised learning algorithm and it's it's uses probability by calculating the the probability of a class given An instance and then I'm signing that class to the Signing that instance to the to the class with the highest probability and This can be Sorry Okay, naive Bayes is probabilistic because it calculates the probability of a class Given an instance and then assigns that class to the instance or assigns that instance to the class with the highest probability and This is done by counting the number of Sorry, sorry, this is my first talk. I'm really nervous. Okay Naive Bayes is considered naive because it assumes the features are independent of each other and by feature in terms of text classification. I mean the words and That means that it doesn't consider the word order But it considers the way and the frequency of the words in the text document Advantages of naive Bayes. It's very simple to implement and it's very fast. It works. Well, even when the assumption of Feature independence doesn't hold especially in terms of text It deals well with data sets that have very large feature spaces But the the naivety of naive Bayes also makes it not work well with Expressions that have a combination of words with unique meanings. For example In Google's early days and if you search for say The basketball team Chicago Bulls for example You would get results like the city of Chicago and Bulls not the actual basketball team So here is the The question and it seems a little Complicated even to me. I don't I don't really understand it myself So hopefully by the end of this talk, we'll all be able to be better able to understand what all means So the data set I'll be using for this Talk is the YouTube spam collections that which Is a corpus of text documents that was gotten from the UCI machine learning? repository and it's a corpus of Comments from five of ten of the most viewed videos are around 2011 So we have on size Gundam style M&M's love the way you lie Shakira's Raka Raka from World Cup of 2012 party rock and theme And Katy Perry's row This corpus was provided by albedo it all and you can also it can also be found on their website It consists of about two thousand Comments and one thousand and five of them are spam and about 951 ham comments So for this talk I use Python 3.6.5 and Jupiter notebook and I also use them Psychic learn and then You could out you could download that with PIP or conda, but you could also use anaconda which has all the packages you would need to get started so You would need some experience using Python and a little knowledge of machine learning and natural language processing Now before I get to the code I would like to give an example that would help conceptualize the inner workings of the naive base algorithm So say for example, you want to find out if a comment is spam or ham How would you do that like conceptually? It's done by We would use this example of five toy comments to Fit to build our classifier so These comments are them to spam and three ham and then we introduce This new instance of a comment and want to find out if it's spam or ham How do we do that first we have to give the machine learning algorithm Form of the data that you can understand and it doesn't understand texts you can't give it texts So we have to sort of encode our our text to Numbers which it does understand and to do that we would first to organize the comments and tokenization means breaking the the text into pieces called all tokens and then We change the text to lowercase and then remove stop words stop words that was to have little or no information like a the With things like that and then we count the words and then that's Those are numbers, right? So that's something that's the algorithm can work with and then these are all Collated into what is called a documenter matrix So now we have numbers which the the algorithm can understand and then we want to find out I Love song is it a spam or ham comment So we Use math we use Bayes theorem which I mentioned earlier to calculate this So this is the math equation for that. So first Can let me go back can let me quickly explain so we want to find out if the probability of the Comments being spam given that it is I love song and the probability of that comments being ham given that it's I love song and The plan is when we calculate its probabilities we find out which one we compare them Which ever is greater is the the label is assigned to the comments So let's So first we find the probability of The documents all documents in general being spam and this is simply counting We can see there are five Comments so we count the number of spam and the number of ham separately and then we divide We divide each of them by the total number of comments and then We then find the probability of each individual Word in the comments remember I mentioned the assumption of independence by that naive Bayes and uses to classified it Data So we we would break the alley. I love song into individual Tokens and then calculates the probabilities of Each word in I love song Given that Given that the label is spam and given that label is ham individually and then we multiply it so we can see that we You calculate the number of times that love appears in Spam and Then you calculate the number of times that it's that song appears in spam And then you divide it by the total number of words That's are in ham all together and then you do the same for ham and then you multiply them So next the two probabilities that we calculated will then be multiplied and then We compared it to we can see that Probability of ham given that the comment is I love song is significantly greater. So that's Obviously Ham ham Which intuitively you would know that it's ham based on the comments And now for the code so first We load the data set into our environment The data set is actually five CSV files So I had to figure out a way to import all files at the same time So I use the globe function and the globe function from the globe module to do that It just selects the pattern of the patterns that the files that match the pattern of my directory and then I Create an empty. Sorry I Create an empty data frame and an empty list and then through a loop I create for each file For each file in the in the list It's a data frame is created and then appended to create a full full data frame Then we want to see what the data frame looks like so there are five columns and Some of the columns will not be important to us, but let's just have a look at the first few rules and Then we confirm that the size is about 2000 by five data frame and Then we what we need is a text, right? So we would want to use them the I look method and to to kind of slice or select the rules and columns that of interest to us, which is the text so You can see that we just have two columns now the text and the label label being the the class and There are two classes ham and spam where one is ham and where one is spam and zero is We confirm that it's a the dimension that we would expect And then we me split our data set into training and testing For this talk and For this talk I'm using a 70 30 train test splits where 70% of the data set is for training and 30% is for testing for evaluation I had to separate the the The data and the label to be able to Pass it through the train test split function and then I confirm that it's About the proportion that I want then Now we do the we convert earlier. I showed you the the way the text is converted to Documentary matrix right changes into numbers that the that the algorithm can understand So now that's what we want to do here want to create a documentary matrix So to do that we use the count vectorizer I mean the count vectorizer Objects so the count vector is is actually what's called a bag of words Method of feature extraction because I feature extraction is basically creating Features that the model can understand so What it does is I simply counts the number of terms in a comment or a text so for the for the for this time Count for this time count vectorizer objects. I chose them. I use the stop words argument, which is obviously to I Selected English to remove the stop words, which are low information Words that we don't need and You can use that there are lots of arguments in count in for count vectorizer, but I just decided to use this one Then we use the fit transform method which Create the the parameters the bag of words parameters that will be using to create our documentary matrix The parameters are example like the frequent the number of times the words appear So I'm under and the transform part of that fit transform method transforms uses those parameters to transform the our text into a documentary matrix and then we Transform the test set as well So documentary matrix Um, so now we we now want to implement naive base So we get our naive base classifier from the naive base and module from a scalar and and then we fit out that To create out to build our model with the training sets and then we evaluate We run predictions and then evaluate our model now. There are several ways of evaluating your model You could but for simplicity. I'll be using accuracy and the confusion matrix accuracy is basically the number of positive and the pop correctly classified and Data that was by the model and From our results, we can see that it's 91 points one percent and Then the confusion matrix is just a way of illustrating how where the model Went right and where it went wrong. So The model was able to correctly classify 289 Spam and 246 ham comments, but misclassified 43 and Spam and 9 ham comments So I want to talk about another form of feature extraction, which is called them term frequency inverse document frequency Time frequency Let me just say TF idea. It's It's the same thing as Bag of words term frequency, but what it does is it's also It counts it counts the number of terms in a comment or a document, but it also considers the number of times that's the number of document that that Term appears in so it doesn't just So it's adjusts the weights It doesn't just think of the frequency also thinks of the number of Comment number of times that a word appears in all documents in the in the corpus so For this I'll be using the TF idea for them vectorizer objects from the feature extraction text module and I Said the arguments the stock words argument to English again, and then I use a max TF argument I set it to 0.7 max they have argument and simply Set a threshold so It's it removes and terms that appear in more than a Certain proportion So in this case 70 so if if a word appears in 70 percent of the 70 percent and above in documents 70% of the documents and above The term is removed because it's considered unimportant. So it's removed and then After I fit I use the fit transform method to Converts my text to a document term matrix So that the model can understand Then I also do the same for my tests sets and then I fit I initialize a multiple naive base classifier and Fit and then fit my model I do test in entry and then there's a there's a some size improvement There from the accuracy. I can see that it's 91.4 Percent, but that's a very slight improvement and it could be Just due to chance maybe the way I spit my data or or it could be Because of the settings on for the max TF removing those terms Or it could also be because of I choose TF ID TF IDF, but I Guess the best way to do this is to Use the same settings in both situations in count vectorizer and TF IDF and then compare the two properly to get a better comparison and then The Confucian matrix shows that it was able to classify tonic in 9 ham and turn it 48 spam correctly and Mr. Fight 41 spam and 9 ham so we've looked at two possible methods of classifying Implementing like naïve base in like it learn count vectorizer and TF IDF vectorizer but How well did this model do and can it be improved so I'm one way of improving your model is by tuning it and Tuning is basic and basically like adjusting The dial of a radio set to check for your favorite station something like that and In naive base one way of tuning is called Laplace smoothing so Laplace moving me basically Make sure that the probabilities of Terms in your in your text are not zero because when you multiply Everything becomes zero basically So in order to mitigate that it adds a constant called alpha To make sure that the probability is not zero so to implement this in Python I Set a range of alphas from zero to one and with increments of zero point one and then I Created a function train and predict With alpha as a as the argument and then in the function I Fit my data and data to the model and run some predictions and calculate the accuracy and I run it through a loop with those alphas like I created earlier earlier the offer ranges I created earlier and Print I print it out And we can see from the results that There alpha 0.3 and offer 0.7 are the highest and That's that's okay. That seems Like a reasonable boy still a bit small so so I I Talked about the naive base Algorithm and why such a good algorithm for working with text and also how to implement it in Python Hopefully You were able to understand what I was saying all this time and you can better appreciate This knife classifier. Thank you Okay, thank you very much. We have five minutes times for question You look put up all the characters lower case if you are analyzing something where you wanted to know what were the emotions and People generally express Anger or something by all caps. How would that be affected by? Like how would your analysis be affected and it was a little off point, but I'm just curious how you handle that That's a good question and maybe I would have to find out myself But that's a good question so any other questions so then maybe I Thank you Thank you for the talk In in your examples, you get probabilities of 91 percent So for anyone who would want to try to write something like this 91 person doesn't seem very accurate What what are we what are people supposed to to expect is it's actually am I wrong is 91 percent very good or How do you how do you know what to expect and to know if your filter if your algorithm is working? Well or not well it depends Like it's really subjective really like it depends on It's subjective really like 91% could be good to some people it could be But you could there's still room for improvements like there's something called Limitization which could make the model even better. I didn't do that here it's basically and Making those words that are similar like maybe go go in All those was a pretty much the same thing go go in went You could group all those was into one word So that would have probably have improved the model, but it's all subjective really like 91 percent could Might probably be the best or Not it's it depends on you What's you what do you want to do with the model basically? Okay, thank you. I don't know if that helps Okay, one question of that Accuracy of 91% is very good, and I I wouldn't trust if it would be 99% I would be I think Yeah from from the scientific point. I think at 91% is very good Okay, so if there's no more questions give a big hand to Obiyama again