 Good afternoon, everyone. Welcome to this afternoon session. I'd like to introduce Lorena Mesa. She's a platform engineer at Sprout Social in Chicago, and she's a Star Trek fan, and she's going to talk to us about spam and natural language processing. So real fun fact, I have a loud voice. Is this too strong or should I be a little quieter? Louder? Oh, this is great. I can be loud. Alright, thank you so much for joining me tonight. This afternoon, I should say the name of this talk is is that spam in my ham? Subtext, a novice's inquiry into classification. So as my announcer already said, my name is Lorena Mesa, and as you can see, I'm a huge Star Trek fan, so live long and prosper. Apart from that, I'm here from Chicago. A little bit about me and why I wanted to chat on this topic. I'm actually a career changer. So a few years ago, I came from being a data analyst in the social, in the social science space. Specifically, I worked at Obama for America doing data governance, and I then switched into doing software engineering about three years ago. Some big questions that were driving me at the time are captured in this talk, but some other things I do. I love Django Girls. I helped with the workshop yesterday. It's a glorious, glorious thing. If you have the opportunity to mentor, please do. If you would like to sign up for another one, please do it as well. Pie Lady Chicago is a group that I founded in Chicago, and I recently was voted to the board of directors for the Python Software Foundation, which is very exciting. So I'm going to chat a little bit about this great experience that we've all had before. I think we might have all had some kind of email at some point in time where we get something that flutters into our inbox, and it has language like de-junk and speed up your slow PC. And of course, we would trust an email that comes from aol underscore member info at emails. Yes, with a Z on it. aol.com. And of course, I'm going to trust anything that tells me this is free. This is great. You really should do it. So I think when we see emails like this, we know visually just by looking at it that it's a piece of spam. We know it's junk. We don't care about it. We ignore it. So how do we move from saying I know it when I see it to saying I can programmatically detect what a piece of spam is by using Python? So in today's chat, we're going to be thinking about three questions. One, what is machine learning? Two, how is classification a part of this world? And three, how can I use Python to solve a classification problem like spam detection? This chat is going to be really focused on a beginner understanding of machine learning. So if you are looking for more intermediate and advanced talks, I definitely know this would be a great conference to check out some of that. But we're going to really be taking this from the lens of a beginner. So machine learning, if you were to follow the emojis on the left-hand side, the top left would be me. Confused. Not sure what machine learning is. I'm like, is it a robot? Is it Johnny 5? Johnny 5 being a superhero from a children's movie I loved when I was a little kid, who's super quirky, can arch their eyebrows and come save the day. Well, I don't really think machine learning is Johnny 5. So let's go ahead and think a little bit about what machine learning is. One of the things I like to do when I begin working in a new problem space is I try to find some language to actually gravitate myself to understand what types of problems I will be solving. If I were to look around for some language defining machine learning, I might find something like this. Some discussions saying that there's pattern recognition, computational learning, artificial intelligence, what's going on? I don't know what that is. But there is a part of this that does make sense to me. The study of algorithms that can learn and make predictions on data. I like data. I like algorithms. Tell me more. So I think a better way we can think about machine learning is to borrow some language from Tom Mitchell, the chair of machine learning department at Carnegie Melling. He wrote machine learning, which is kind of a quintessential text for folks who want to start learning about machine learning. And he says we can think about machine learning in three kind of parts. We can say a computer program is said to learn from experience e with respect to some task t and some performance measurement p if its performance on t as measured by p improves with experience e. Okay, so we have a task. We have experience. We have a performance measurement. I can do this. This makes sense to me. So when I think about experience and how do I know what I know? Well, I'm a human and when being a human the way that I know what I know comes from my memory. I I have memories stored up that teach me things about what I like, what I don't like, what I should do, what I shouldn't do. So maybe as a kid and I was a very hyperactive child, I would be running around like a maniac all the time because I had to be super fast. But what happens when you run around as a little kid and you're growing in your body? You might be klutzy. You might fall and skin your knee. How many times do you have to skin your knee in elbows? For me, it probably took quite some time for me to learn I shouldn't run around like a maniac. I should walk around like a normal person so I don't hurt myself. That pain was a teaching experience for me. Likewise, when my grandmother was in the kitchen making tamales because I love tamales, I would always trying to be sticking my hand on the stove and more than once I definitely burned my hand. The idea of putting your hand on red hot coils, not very smart. So over time I learned to recognize that as a sign. I shouldn't do that. So when we think of experience as a human, we may think of our memories. What does that mean in different problem spaces? If I were to ask the question, what is the historical experience of the stock market? Well, I could say if I want to understand what a piece of stock has done historically, I might go look at what the records tell me about the price of that stock two years ago on July 17th, one year ago on July 17th. And you know, depending on how far back I want to do some analysis, I have historical data that can tell me something about the historical performance of that stock. So we have human memories. We have some memories there, but maybe in other spaces again, we want to go to historical data that can teach us something. So coming to machine learning and classification, what does experience actually mean? Let's frame this in Mitchell's framework. Our first problem is going to be identifying a task. For us, we want to classify a piece of data. So our question is, is an email spam or ham? And the idea here of ham is just anything that's not spam. It's cute. It rhymes. So spam or ham, that's our task, our experience. We're going to have a set of labeled training data. Essentially, what does that mean? We have a collection of emails and we have a label that is saying that the email is either ham or spam. So we have a collection of emails that we already know is one thing or the other. And then our performance measurement is the label correct. So what we need to do is be able to verify if emails are indeed spam or ham. So thinking about a classifier that we can use, we can think of naive Bayes. Naive Bayes is a type of probabilistic classifier. I love this image because I really want to know who's the person that has a neon light of the Bayes theorem, like in their office or in their front window. I don't know who that person is, but I applaud you. You are really great. So naive Bayes comes to us from stats theory. It's based on the Bayes theorem, no surprise. One of the key things with the Bayes theorem is when we talk about the likelihood of events, the key thing here to note is that we treat events as independent of one another. That's where the naive assumption comes from when we say we're going to be using a naive Bayes classifier. So for those of us who may not remember exactly what it means when we talk about independent and dependent events, let's have a quick refresher. So if I was going to ask you what's the probability of flipping a quarter six times in a row and getting heads, how would you go about solving that problem? Well, let's think about it on the first flip. I have two outcomes. I have heads or tails. So the likelihood of getting heads is going to be 0.5. The second time I flipped that, 0.5. Third time and so forth is going to be 0.5. So the likelihood of flipping a quarter and receiving multiple heads in a row is going to be independent of one another. So when we talk about independent events, we're trying to think of the outcomes. In contrast to dependent events, let's say we're talking about horse number five on the, I guess, your right-hand side. If my question was, what's the likelihood that horse number five is going to win the big derby, one of the things I would say is, well, we need to think about what are the weather conditions? Is it rainy? Is it sunny? Perhaps we want to think about the age of the horse, the health of the horse. There can be other things that are tied up in the likelihood of horse number five winning. So in this context, the probability of horse number five winning is going to be dependent on other things. For example, the weather. So when we talk about naive bays, our assumption is we have independent events. So when we talk about emails, we're really going to be thinking about the words that make up the emails. So let's think about these words. If I was going to say what's the likelihood of the word messy appearing with the word Barcelona, we're going to assume that there's no relationship. That's what naive bays tells us to do. Even though in our heads, we might think that there's a relationship. Or back to some really spammy language we love, what's the relationship between by and now? We're going to assume that there is no relationship. That the likelihood of by is not going to be impacting the likelihood of now appearing in a corpus of words for an email. So naive bays and spam classifiers, again, our question is what is the probability of an email being ham or spam? So the base theorem here in the middle, we've got three things we need to kind of think of. One, what's the likelihood of the predictors in the class? Two, the prior probability of the class. And three, the prior probability of the predictor. All of these together will help us compute the aposteriori probability of a class. So when I say class, our class is here, ham, spam. Those are the only two classes we have. Our predictors are going to be the words in the email itself. So for example, if I'm looking at a piece of content and I say, okay, well, what's the likelihood of a predictor being in the spam or ham class, we can say if I'm looking at the word free, we can think of it as, well, 28 out of 50 spam emails have the word free. We will do this for each word in our email and we will find the likelihoods of all the predictors and multiply them together. We also then need to consider the prior probability of the class. So given the entire collection of data we're looking at, how many of them are of one class and how many of another? So for spam, if we have 150 emails we're working with, we can say 50 of those documents are spam, so 50 out of 150. And then the prior probability of the predictor, we're here saying, well, how many times has the word free appeared in all of our emails, let's say it's 72 out of 150 and there you go. So the Bayes theorem is basically frequency tables. How many times has this thing appeared? How many times has it appeared in the class? How many times has this class appeared in the collection of things that we're looking at? Great. We've made some calculations. We found some values between zero to one. How do we know which one to pick? Pretty easy. Whichever one has the higher maximum a posteriori probability. So the reason why we would say a posteriori here is we're not looking at anything new. We're looking at historical data, things that have already happened. Once we've made a calculation for class ham and for class spam, we simply just pick the larger of the two and we say this email is going to be either ham or spam. Pretty simple. So why naive Bayes? Well, I think just walking through this, we can arrive at an answer. It's pretty straightforward. It's as simple as frequency tables. I think we can all do this together. It may seem a little bit daunting at first, but once you start realizing the application of it, you can see that it's pretty straightforward. So for the context of if you are starting to think about classifiers and problems you want to start looking at, I would say this is a great one to start with. The math is accessible. And while you can use other algorithms, we will talk about some of the limitations in a moment, this is a good one to start with. So that's great. But how do I use Python to detect spam? Okay. Well, I cheated a little bit. I didn't do all my own data collection and munging and cleaning. As fun as that is, I instead went to find a data source out there that already was cleaned and labelled for me and where did I get it? I got it from Cagle in the classroom. So this is a website that has competitions. So the classroom component is more of their teaching problems. They have open competition problems as well. But I loved that my data was cleaned and labelled and I could just get right to work building a thing. So in our example here, our training data has 2,500 emails, 1,721 of them which are labelled one as ham and the balance labelled as spam, which is zero. So the labels themselves are just in a CSV. We have an ID and we have the prediction, zero or one, pretty straightforward. And the, that's a little grainy, I apologize. But the emails themselves are collections of text with some HTML in it. So what are we going to use when we write our very, very simplistic naive Bayes spam classifier? We're going to use these three things. We're going to use email. It's going to go ahead and parse our emails into message objects. We're going to use LXML because as I said, those emails have some HTML embedded in it. And right now, all I care about is the words themselves. So I want to strip that stuff out. And then we'll use NLTK natural language tool kit. And that's going to help us to filter out stop words. So let's go ahead and get to it and train the spam filter. So the training of the Python naive Bayes classifier, when I say train, we're going to go through these steps. The first thing we're going to do is we're going to tokenize the text. We will explain that in just a moment. One thing I do want to say is when we look at all the corpus of words in an email, I am not treating words like shop and shopping as the same word. You can actually do that. That's called stemming. So that would be like a bonus feature. I encourage you to go try that on your own. So I didn't do that for this example. So we're going to go ahead. We're going to tokenize our words, which we're going to do that for each email that we process. We want to then keep track of the unique words that we see of all the documents that we process. This will come into effect to help us with zero word frequencies. We are going to then increment the word frequency for each category. So our category is here being ham or spam. We're going to increment the category count, which again is that prior probability of the classes that we need to take into account. And then we're also just going to keep a track of how many words are in each category. And it's good to know how many training examples we've actually processed. So that's the last step. So training is pretty much going to start with this. Tokenizing text into a bag of words. That's what it is. It's a bag of words. So essentially, this is very simplistic. I've kind of trimmed it down a little. What we want to do is we want to pull out the words. This is already after we've removed the HTML that's embedded. And we're going to say, hey, for each word in our text, let's go ahead, lowercase the word. We're going to say, if it's a word, because why not? And we're going to say, as long as this word isn't in the corpus of stop words for the English language, let's go ahead and keep it. So stop words are words like the and or words that may appear often but may not provide us a lot of value in thinking about if this thing is going to be spam or not. So you can get that from NLTK. I'm glad I didn't have to compile that. We go ahead. We do this for each email. And now we have a bag of words. So remember that zero word frequency thing I was talking about? Well, let's think about this. So I've done my training and I have a new email. In this email that I'm looking at that I'm trying to classify, I have the word free, but problem. I've never historically have seen the word free in the spam collection of emails that I've looked at. So what's going to happen when I calculate the likelihood of all my predictors? I'm going to get zero. So to offset that, what we can do is we can add a small constant, like which Laplace smoothing permits us to do, and that allows us to have a small offset so that it doesn't throw our math out the window. So let's talk about classifying. All right. So this is a giant wall of text. But I just wanted to point out that it's quite literally iterations, countings, dictionaries. That's all this is. There is no black box magic here. Essentially what we do in the classifies, we say for each category that we're going to create this aposteria probability, we want to go ahead, find the probability of all the predictors. We want to then multiply that by the prior probability of the classes itself. And we're going to pick the one that has the higher value and that's what we classify the email as. Not very magical. So in the predictors probability, if we see something we haven't seen before, we're going to go ahead and then add a value of one to that. And this point right here about floating point underflow, when you are doing computations where you really care about having very precise decimal points, you're going to need to use specific objects. You could use a log instead. But in this case, I use decimal objects. And there is a note here, which you probably can't read. I will share these slides, which comes from the Stanford natural language processing description about how to handle doing a floating point computation. And they said use decimal. So that's what I went with. So okay, performance measurement. I've classified. I've picked a thing. How do I know how well I did? Okay, so I go ahead. My detector says let's train and evaluate. What I eventually come out with is I have 223 that are correct, 27 incorrect. My performance measurement is about 89%. As a small footnote, the idea of about 90% accuracy I believe is a benchmark. We obviously can do better here and we'll talk about what doing better can mean in a moment. So the idea of how to split up our training data. Let's do a 90-10 split. It's pretty much what I've seen as a standard. I'm sure given different problem spaces, you might want to chunk things up differently, but I went with a 90-10 split. Essentially, all I did was say, hey, on 90% of my data, let's go ahead and classify. Let's go ahead and train, that is. And then on 10%, we're going to go ahead and classify. And how do we know if the thing is incorrect or correct? Well, whatever label we ultimately assigned it, check that labels.csv, see if it's correct, see if it's incorrect, and it's basically straight math. So that's how we got the 89%. So some things to watch out for. False positives. Oh, this is really fine. So for example, Google does things really well, right? They do really good with spam filtering, but even they can have some flaws. So I do actually like to sign up for pedagonia emails, and this email was actually flagged a spam. So basically a false positive is when something is incorrectly identified, right? So you can run into this. So we can say, well, when something is incorrect, what's the problem? Is it that our implementation, because we're talking about naive base, is it too naive? One way we can also correct this, we can tell Google and say, hey, this is actually not spam. So I can actually validate the data and send it to them, and they can put it into their implementation and try to auto correct for that in the future. So false positives are a thing to watch out for. And some limitations with naive base and some challenges. Obviously this independence assumption is very, very simplistic. If I get a marketing email about Barcelona and they aren't talking about messy, I'm going to be very confused. Granted, there are some talks about him being traded, so we shall see. But obviously this independence assumption is quite simplistic. It is not the way that things work in the real world. What are the side effects of that? Well, one of the things is then we're going to go ahead and overestimate the probability of the label ultimately selected, meaning we're going to create more binaries. We're going to say it's either more to left or more to the right and how it aligns with the category label. And also we can think about this, remember how I said I cheated and I didn't go and label on my own data? Well, here's the other thing, human error. This type of algorithm, classifiers, are called supervised learning. They require historical labeled sets of data to go ahead and learn from in order to make predictions. Well, human error can be prone in this data process. What happens if, let's say I'm a professor and I'm making use of all my student lackeys and some of them have been up all night and 10 of them looked at the same email and they all came up with different labels for it, but it's in my training set. That's going to be very inconsistent. So I need to think about that as well. How is the labeling of the data happening? So as much as we don't like to think about data munging, data cleaning and data collection, that's actually a really important part of the process when working with machine learning problem, supervised machine learning problems. So how can we improve our performance? Well, we can do more and better feature extraction because while I would like to say that emails can only be identified by the words in them, we know that's not true. Predicting sentiment of emails is very complicated, very difficult, natural language processing is a huge field. I'm not getting into that myself, but we need to think about other ways we can identify spam. So what are some things? Perhaps the subject. Is there something weird in the subject I can pay attention to? What about the images? Is there an abundance of images in spammy emails? Or maybe there's none. I don't know. How about the sender? Remember that really cool email address with the Z in it? Because clearly I would trust AOL emails, whatever that was. Then again, I don't trust most AOL stuff. So that's another thing. But some other ones, if we were just going to think about what to go ahead and consider other possible features, we can think about capitalization, irregular punctuation, things like that. Ultimately, we also want more data. So do you like data on Star Trek and have more? Want to learn more? Go to Kegel. They're super sweet. I also would highly recommend Sarah Guido's introduction to machine learning with Python. She's a great data scientist at Bit.ly. And I've heard great things about this. And also your local friendly Python user group. We love talking. We love learning together. Talk to people here. There's a great talk after this, talking more about machine learning. Stay for it. So thanks. And if anything, I hope what you may have learned is correlation may be causation or causation may be correlation I don't know. So we can implement a thing, but the question then comes to how do we interpret those results? And that's where I challenge you to go ahead and try some things out. Thank you so much, y'all. Any questions? I did such a great job. No one has any questions. Do you have questions? I'll be hanging out in this area out here for a few minutes, but like I said, I do want to hear the next talk. So I'll be around. My name is Lorena. Please reach out and say hi. It's a pleasure to be here. Thank you so much for listening.