 Thanks for being here with me. This is a very intro level talk because I'm going to spend this time being really excited about how easy it is to do machine learning these days and convince you that you can do it, which also means that if you've ever made a model with scikit-learn, you should probably just leave now because I'm going to bore you. It's okay. I won't feel bad. It's fine. You can just leave. It's all right. Please? All right. This is fun stuff. It's very exciting. It's going to be good. I hope you're going to know a little bit of Python or at least some pseudocode, so this is not going to be totally Greek, but we're going to start out with even what even is machine learning. We're going to take a look at what's available in scikit-learn, and then we're going to do an example with real data, real data, friends. I hate the term machine learning. I have a statistics background. This should all just be operations research. I don't know. I found this lovely illustration of machine tailoring, and I feel like that's a better example of what's actually happening. You tell the computer, you're good at looking at lots and lots of options. You can do it fast. I know you want something in this kind of shape. Here's some data. Make it fit to this. Make the sleeves longer, change the hem. Do whatever you need to do to make something in this shape fit this data that I've given you. It's not learning, but it's making the best of the shape you've told it with the data that you've given it. Does that seem fair? And scikit-learn helps us out so much, and they have a ton of examples for us. This top row is kind of the top level, what are the types of machine learning? Classification is anything we were going to get back out a categorical variable. Whether you're gendering chickens, or you're trying to find guest people's favorite colors, you're going to be doing some kind of classification problem. Regression is anything that gives back out a continuous variable. You're trying to guess people's salaries, or their IQ, or their SAT scores. Clustering is a little different, because all you tell it is, I think I have some groups like four. How about four? See what you can get. Give me some groups back, and it just, given whatever shape of algorithm you told it to do, it finds the best groups in that data. These other three things are that 80% of the work of data science, where you're actually pre-processing data, or figuring out which of these models is the best model, or changing those variables, or picking the variables that go into the machine learning models. But the important thing for getting started is there are a whole bunch of really cool examples up on scikit-learn for those three main machine learning problems you might want to do. We're going to look at classification, because I like it, and it's simple. And I stole this straight off the scikit-learn example page, so you too can look at it. There are tiny little numbers in the corner. It's trying to show you the difference between these algorithms and how they behave on different data sets. The top one has a lump of blue up in the middle towards the red, the middle one has blue in the center, and the one at the bottom has a nice split. And so you can see how different algorithms treat that differently. The shaded areas are for probabilistic algorithms, and they're less sure in the shaded areas. So if you already know something about the shape of your data, if you have some information about what you're dealing with, you can already make an informed guess about what classifier might be good. But we're using the same library to fit any of these. So you can get it set up and do your example with one of them, and if it's terrible, it's probably not that hard to find the scikit-learn example that probably takes them very similarly to shape data to try a different one. So you don't have to choose the right one right away. It's okay. We're going to play. We're going to learn. We're going to do decision trees. I really like decision trees because it's easy to explain what they're doing. It's not super easy to explain how they're doing it. But all they do is they say, I have some data. How can I best split this data so that my guess next step is the best possible guess I can make? Then I just go until they hit whatever stop in criteria. The nice thing about that is your model is going to be wrong and you're going to misclassify something. With a decision tree, you're able to figure out where it ended up. You can follow the path of how that observation got classified as favorite color blue when clearly it should have been favorite color pink. Then you can see which variables went into that decision and decide whether there are more things you want to add or that maybe you should have taken out because they were misleading. This is a very simple example that, again, I took straight from the scikit-learn webpage. Why I'm here to just evangelize the scikit-learn package to you and tell you that it is wonderful and it is there for you. I work for a company called Big Cartel. We're 35 people. We make shops online. We especially like to support artists and makers. We have about 90,000 active shops right now and I've been their sole data person for the last year. It's a lot of fun. We're going to pull some data and look at that. I'm going to go to my Jupyter Notebook. Don't worry. The slides, the Jupyter Notebook and that big ugly tree graph are all up on my GitHub. You can find them later. I have a bunch of URLs in here because I want you to know that I don't know these things. I look them up. You will look them up. It's okay. We're following examples. It's great. That's what the web is for. We're specifically following this example, just classification industries, just trying to do exactly what they did. This is the one that generates that graph that we just saw. We're going to import NumPy so that Scikit-learn can do things with numbers. That's always good. We're importing Matplotlib. Just for a heck of it. We're not actually going to use it. Sorry. We're going to import Pandas, which is a data frame library for Python. It's big and it's complex. I don't really know how to use it, but I'm going to show you a couple of cool things that it does for us here that I had to Google to figure out how to do. It's okay. It makes it a little easy to do some things and probably makes it easy to do a lot of things. I just don't know all of them yet. Luckily, I have a lot of data that we've already gathered in a wide format because we have our normal database that has everything segregated out. We have a tool that we use for support. When people want to ask questions and find out more about their shop and make sure things are working right, a whole bunch of data about their shop is already available in the tool for our support professionals to use. I just downloaded a bunch of data out of there. I got a whole bunch of data that we know is very relevant about the shop that we know we're interested in. I was able to just get it in one place. I cheat it. I recommend cheating. I'm not going to run these, but it is a lot. I did run it recently, so we can inspect these later in the QA. I grabbed a whole bunch of recent people, things we've seen recently on the website. There's a whole lot of columns. If they're capitalized in fancy format, they probably came out of the support thing. If there are underscores and lines, then there are probably things we put in there. It has all sorts of information from how many products do they have, how many times did they log in, when was the thing created, where do we find them in our system, whether they have a custom domain, just all sorts of information that is useful to support people and might be useful for building a model. At least it gives us a place to start. This is a pretty big data frame, a little over 100,000 observations. We can just willy-nilly, we're like, yeah, we got some stuff. It's fine. We don't have to worry about whether we have enough data. Normally, you would like to have some data, probably more than 100 things. This is just an example. I just picked a random categorical variable that had some variation in the data set. We're going to pick some variables and see if we can make a decision tree that predicts whether or not this particular customer is using discounts, whether they've created any discount codes. I create my why column, which I probably didn't need to do. I probably could have just set it to recent using discounts, but I felt better making knowing it was zeros and ones explicitly. You can see that not all of using discounts is filled in. The count is lower than the shape of the thing, so by explicitly making them one zero, I get the zeros for the undefined. We just picked a couple of fun variables that I thought might be interesting. I got a mix of categorical and continuous variables. Product count and log-in count are going to be between zero and a whole bunch. Browser has a much more limited set of values that you're going to see. One of our themes they've chosen to make their website look pretty, plan is how much they're paying us. We're going to build up our X matrix. I hope matrices are not scary, I'm sorry I have a math background, they're fun to me. We know that why, do they use discounts or not? We're trying to build a set of things that we want to predict why with. We're going to start with those continuous ones. We get product count on there, logging count on there. We could just put the rest of them in. It's out, scikit-learn requires your classifier to have zero one columns. This is when we get to use some really fun stuff out of pandas to help us out and be lazy. So pandas has this get dummies function and it will take a column and it will make a set of columns of that column. So when we look at this ninth observation, it's theme is sci-car, the get dummies is going to create all the theme variables and that observation will have a one underside car and a zero under all the other theme variables and they all become columns of zero and ones which make machine learning algorithms much happier. And we do that for all of our categorical values, which I didn't know was a thing until I tried to set this up for y'all and then we concatenate them all together and jam them on the side of our X matrix. I didn't know how to do that in pandas until I googled it for how to do it for y'all. Just know it's possible. And then because our data set is kind of, it comes from a support thing, some of those people are just leads, sometimes things aren't filled in, things aren't synced up, there's some things that aren't filled out there. And so we're just filling them in. Think pandas, please fill all the things with zero if they're not defined. Please thank you. That would make the algorithm happier and it obliges. So now that we've built up our X matrix, we ask it again what it has and you can see that the themes are now pulled out as the different columns and you can see the time zones. So we have a whole lot more columns now, we have 266 columns that we're going to try and use to predict whether these people use discounts or not. Well, cool. That's all we need. We have an X and a Y. Let's fit a decision tree, y'all. So we fit one. Cool. And then we say, hey tree, how'd you do? I'm like 99% correct. That's pretty cool, right? Pretty cool. Well, let's ask the tree some other stuff. Tree, how many splits did you do? 99% correct. So I made 56 splits. Lots of splits. Really good. I found all the things. That's kind of a lot that just, it gets really hard to deal with, to look through, picking a smaller but not too small number out of a hat, I picked 20 and so we just try again. And these are just things you can tell it. They're good. And it's Python. It's document it. These are just things it tells you and they tell you what the possible values are and you can just try some of them. So we're going to change the max depth to 20. We're going to say, 20 splits is enough. You're going to over-fitting a little man. So when we do that with our same data set, it only fits 91% of them correctly. So made a big difference in how much I was able to explain. So but actually we're kind of cheating because we're asking it how well it did on fitting and predicting the data that it just fit. It knows what that data is. It just tailored it to that data. So what we ought to do, what we really ought to do is we ought to split our data set into training and testing. I just, I don't know why I picked 30,000 because it seems big, seems good. So we split them up and now we make our next tree and we tell it to make the model based on just the training data set. And it's like, yeah, I did good on the training data set. I got 94% of those right. Like, yeah, how does that generalize for the other ones? Like 82%? Tried. And that tells us something. That tells us that these are probably not very good variables for predicting this. That's cool. Maybe we go back and choose some different variables to put in the model. It's, everything you do doesn't have to be a model that fits 99%. Everything, you will still learn something by finding a model that fits badly. So we're going to look at another couple of cool things. Following that same exact example that gave us the thing before, the graph they made, we're going to make a graph just like theirs. So the tree library has this ability to make graph viz files, which they're plain text files. I put it on the GitHub repo and then you run a magic command and you get this horrible, horrible PDF that I'm sorry for. That's our tree. That's a max depth of 20. Isn't it beautiful? Are we trying to make it so much nicer? Like it would have been 56 or something. It would have been so many more. Yeah. Can you all see those? But we were able to pass it the names of the columns and things like that. So we can see how it made these decisions, what variables went into those decisions, and then the purity of each of the nodes afterwards, what ends up in that set after it's been split. So it's kind of really cool to see how it does it and be able to trace back through it, even if 20 splits might be a little more than you want to go through it for more than one or two really busted variables. Pause to look at terrifying PDF. Okay. Another thing that the tree will tell you is how important the various variables you gave it are. The variables are called features. You feed it features. So we ask it about the feature importances. And it's like, cool. Here's a matrix. Here's a vector of 266 importances. It's not the most fun thing to read. So we'll play with that a little bit. So tree, what is the best, what is the most important thing? And it says zero. And I don't know about you, but when the computer tells me the answer to something is zero, I'm a little suspect. I want to make sure I check a couple of other things. It could really mean zero, or it could mean I am mad in some unspecified way. Here's a zero. So we're going to get the top 10. And that turned out to be weird and hard. Thank you Stack Overflow for helping me. So I gave you the URLs of the nice people on Stack Overflow who helped me figure this out. It turns out NumPy has this arg partition thing that will give you the top X things in an array. It doesn't give them to you in any particular order. All right. I'm sure that makes it more official. Great. Cool. Thanks. Thanks, NumPy. And then arg sort lets us get them in an order, but it does 10, 9, 8. I want them to be first. One is important to me. So then np.flipupdown because it's a vector. Right? I'm so picky. I just wanted to be what I want. So this is what we actually get. It says the 0th one. So our argmax was right. It wasn't just lying to me because I didn't know what I was talking about. The first one and then the 17th one. And what are those? This is the same index that the columns of X are. So that's our products count and our logging count. Gold is our free plan. So it kind of makes sense that whether you pay us or not tells me a lot about how you act. It's interesting to see the browsers come up. But when you look at the actual values for the future importances, they're big trail off really quickly. So if I'm actually trying to do a better job with this model, maybe I keep those first two continuous variables and I swap out those random categorical variables that it was really fun to learn how to make Debbie's for. But I try and find something that might be a little more relevant and put those into the model next time and give it a shot because we're seeing that so many of those are really not contributing. So we are able to just inspect that model and learn something about the data set and about what kind of models fit this well. So you can do this. Have a data set. Just follow the example and Google a lot. That's how we do. I do have some resources for you besides this. Data science from scratch helps you try and build this stuff up from Python. I have not used it personally, but I've heard really good things. I really like elements of statistical learning, but it's pretty dense. But it's a free PDF, so at least you're not investing a lot of money in dense things. And all these examples are in R. Some of my colleagues helpfully provided these other titles for us. So I cannot vouch for them directly, but Ryan and Rebecca are very smart people. I'm sure they know what they're talking about. Thank you for hanging out with me. It's been an amazing conference. We have transferred a lot of data between us these two days.