 Now that I have all of your attention, thank you very much for the introduction. I sold to reintroduce myself for the video. So I'm Juliette Hogan. I'm the head of data science for engineering at Cloudera. And I used to work in a more customer facing field facing capacity and in that time I used a lot of Spark with our customers and also spent a lot of time explaining how to use the machine learning capabilities of that. So I was asked to come and give a useful Python Spark talk. And I figured the best thing that I could do was give you a very typical example of a machine learning application and show you a simple example of how to build a model using Spark and then test the efficacy of the model. And as you'll see, I've been kind of like beat myself to the punchline. The model is very bad. But I have a notebook available on GitHub which you can go find and I'll point you to the link at the end and the data available so you can then experiment and improve the model which is really sort of the constant fight of doing data science and building models that you're putting in production in your systems. It's really rare that you have a static model that you build once and you say it's just good, it's fine, it's done, it's good enough. So without further ado, I'm gonna try and keep this really punchy, keep this fun, show you an example of an interesting machine learning example. So Spark is all the rage these days as are other replacements to distributed computational technology. So Holden and I have used Spark quite a lot. Ellen of course is gonna be talking about Flink which is another alternative and a good one. A lot of the excitement around Spark is because people that have used MapReduce know how frustrating it can be and how often slow it can be because you end up doing quite a bit of disk IOS. You're reading from files and writing two files and that's just sort of a law of physics. When you're processing huge amounts of data that's stored on a magnetic disk, you need to actually get those ones and zeros off a magnetic disk and into memory somehow and that's limited by how fast this disk can spin. The issue with MapReduce is if you're doing something really iterative which a lot of machine learning model training procedures are, you end up reading and writing very frequently and so your job is IO bound which is good meaning you're limited by the speed that you can read and write. But you're doing the same reading and writing over and over again. So the advantage that Spark gives is that it builds up a computational plan before it actually executes and it'll read once, do a bunch of computation in memory as much as possible without having to write to disk or spill to disk and save a lot of the time that MapReduce would have used in reading and writing from disk. So Spark is exciting. These distributed computational engines are very exciting and there's two big parts of that. I talked about the fast to run particularly for machine learning algorithms it's a lot faster to run. It's also relatively easy to develop. The APIs are very simple, they're rich APIs. You can express a lot of computation that you want in them. I think just about any distributed computational engine the rubber really hits the road when you put it in production, you start running it. Hold and start getting into the complications of using Spark and what it means when you begin to run out of resources and you need to tune and optimize how fast it's going. But I'm gonna take the easier out and describe to you simple rich APIs for building machine learning models because it does save human time when you're developing these models. So predicting churn. This is such a common use case. I have worked with many, many customers in building churn models. So usually the context for this is that a business has some sort of subscription. So telco companies are a good example where the business has a bunch of users and they want to understand who might not be a user next month so that they can do something proactive and actually prevent them from stopping being a customer. And so that retention rate ends up being very important. And so a lot of these models are trying to predict what kind of customers are gonna leave, who's gonna, what kind of attrition is gonna happen. So let's look at an example of three customers that might be calling. We've got President Obama, we've got a young girl, we've got a baby. Who might not be a customer next month? President Obama will probably be a customer next month. This little girl probably continued to make phone calls. This baby might eventually realize that it doesn't know how to talk yet and so it doesn't need to keep paying the phone company. And so how do we take it that we might have about these people and figure out whether or not they're gonna continue being a customer. So the example dataset that I'm gonna use, like I said, this is a real example. I'll give you the link to it on GitHub later. This is from the UCI University of California, Irvine Machine Learning Repository. There are some sort of tried and true places that I find datasets for examples like this and UCI is really great. The UCI Machine Learning Repository has information on customer churn. They have a dataset that they've cooked up. And this dataset is in a CSV. The first 20 something fields are information about the customer and the very last field is true, false, boolean, whether or not that customer churns next month or not. And so given this pile of phone numbers and categorical features and numeric values, like how many international plan calls customers have made, how do we go about using Spark to build a machine learning model? And I'm of course gonna give you this example entirely in Python because Python is a very useful, flexible language. But as Holden pointed out, Python with Spark is more complex because you have more layers that you're adding. You have two different languages that need to interface and actually communicate. So when you're using Spark from Python, the biggest thing to remember is if you want this to be fast, is you wanna push as much computation as you can into the JVM, which is something Holden already set up for me today in explaining that crossing that blood brain barrier between the Java virtual machine and a Python process is a really expensive thing to do. And so what we're gonna try to do is keep as much computation as possible inside of the JVM while using Python. So how are we gonna do that? One of the things that has become a best practice and is actually very ingrained in the way the machine learning library is set up is that you're expected to build up your collection of data as a data frame. So you're gonna create these tables that have fields and the fields have types. We have a schema to our table. And using that information, Spark is gonna be able to do a lot of operations that we describe at a high level in Python, but when it actually executes on the cluster, it's gonna execute it entirely in the JVM. So what we need to do is define the schema for the table that we have. We go and look back. If you go to the UCI machine learning repository, you'll see the titles for all of these fields. Got like the state that it's happening in, the length of time this person's been a customer, total number of minutes used in daytime minutes, nighttime minutes, international minutes. So we have a bunch of this information and then finally we have churned or not churned. And then we're gonna read this in. So my data frames, because it's faster. Data frames helps us herd the cats and really impose some sort of structure onto our data, which is useful. So we're not gonna build some sort of model. And in order to understand how this happens in Spark, it's useful to take a step back and talk about how the machine learning modeling lifecycle usually works. So who here has used scikit-learn? So the scikit-learn API, who likes the scikit-learn API and the abstractions of it? Yeah, so the scikit-learn API, I think is relatively well thought out. When you think about what a model is, this is actually kind of a hard question and a fun one to discuss. If you meet another data scientist at this conference and you wanna have somewhat interesting conversation, what is a model? What is a machine learning model? What does it consist of? What are the pieces that you need? So in my mind, a machine learning model is the data that you trained on, a description of the entire process for fitting parameters that you might need later when you apply it. And in my mind, it's all of the pieces that you need to fully describe the thing that you can apply later to produce results. So that usually means is that there's some sort of initial data that you have. You translate that into a label and a vector that is entirely made out of numeric values. The label, if it's continuous, might be representing a regression problem. If the label's categorical, then you're probably talking about a classification problem. And the vector of numeric values, we call our feature vector, right? So that first step that we have, we call feature extraction, where we're going from a pile of data to a label and a vector of doubles or numeric values. And in this case, we're always gonna be changing these things into doubles. And then we have some sort of specification of a type of model we're gonna use. So usually it's enough to say, I'm going to use a logistic regression model. Sometimes you wanna get a little more specific and there might be an optimization method that you're using specifically inside of that regression model. And there's little ways that you might tune it or tweak it, but you describe a model and how you're gonna fit it. And that is the modeling portion. But given that you've done that, you should have enough information to know how to apply it when you need to apply it. So if I had a feature extraction step and a model fitting portion, which gives me some parameters, I can then save those parameters and remember how I described the process of turning that pile of data into a feature vector and turn around and get new data in, data that I haven't seen yet, do the same feature extraction and apply the model that I've already fit. So there's a model fitting portion, which we usually call model training, which happens at the top. And that's at the top of this diagram. When we're building models, it's very rare that we are just taking all of the data that we have, fitting it and calling it a day. It's a machine learning model, it must be good, right? It's math must be right. That's not actually what we think, right? We need to actually prove to ourselves that we're doing a good job of building a generalized model that can be applied in many situations. And so what we do is often we'll take our big data set, we'll take like a section of it, 90%, train our model on that and then leave ourself blind to the last 10%. Apply our trained model to that last 10% and see how well we did. Use that as a holdout set. There's sort of two levels that this might happen on. We might do something like cross validation inside that 90%. But we're always trying to make sure that our model generalizes well so that when we take it out and throw it onto the real world, we're confident that it might do pretty well and actually generalize and not overfit or be too specific to the data that we actually trained it on. So this means quite a bit of iteration and needing to be able to easily split a model specification, so the feature extraction step and the model training step. Apply it to training data, get some fit parameters, apply it to our testing data and evaluate it and keep doing that again and again. So this is a flow that happens regularly and there's a series of steps that need to happen sequentially. And so this brings up a concept of a pipeline where what we wanna describe is a sequence of steps for feature extraction and then a model training or model application step depending on what context we're using it in if we're fitting the model or if we're applying the model. So how do we do this in Spark? Like I've already said, we need to be able to use Spark data frames because Spark data frames make things quite a bit faster, particularly when we're using PySpark and MLlib just expects data frames at this point. The newest components of MLlib work on data frames, work on columns. So MLlib, what we're gonna do is we're gonna define stages of our pipeline. We're gonna say here are the first pieces that need to happen when we're, here are the first pieces that need to happen in our pipeline. We need to do something to this column, do something to that column and then we'll specify what type of model we need to use. One of the weird things about MLlib is that I keep talking about MLlib and you might think to yourself, ah yes, that package must be PySpark.mllib. whatever. PySpark.mllib is actually the first version. They decided that, and the first version being a version that didn't rely on data frames. They decided that relying on data frames is so important that they're gonna build a new package in deprecate.mllib. So in my imports, you will notice that this is PySpark.ml. So if you go out and try and replicate this, don't import from MLlib, import from ML. So PySpark.ml, we define what columns we have. I just specify out of all the columns that we had, some of these are already numeric. We have a bunch of double columns, but we also have a bunch of string columns like whether or not an international plan is being used, what state someone is from. And so we need to do something to actually code that as a numeric value and so we can actually use that easily inside of our feature vector, right? We can't just pass it a string. We have to code it numerically. Spark provides tools for that. These are called string indexers. And so the string indexer is basically allowing you to do one hot encoding on whatever value you pointed at. So the US has 50 states. We would expect that if we look at the state value, we'll end up with 50 more components in our vector. And we'll have those encoded as a flag, like a belated flag for whether that state is being represented or not. So we're going to take, and that happens if it's more than two values. If it's a binary value, it'll default to just one element with zero or one, which is the sort of zero case of having one hot encoding with only two values. So for example, we have churned as a string that says either true or false. And we're going to turn that into a zero or one, which is what we need in classification anyway. We also, an international plan, are going to turn that into a one hot encoded yes or no value. So we take these indexers and so we specify these indexers. And then we also need to say the indexers have an input column, so the value that is a string and that has an output column, which is the new column that we're going to create that contains these values. We then need to take the columns that contain all of our numeric values as single digits and create a single column that has the type of a list or of a vector and use that as our features. So we can use a thing called a vector assembler, which is built in part of MLlib. And this vector assembler, the input columns are going to be all of the numeric values that we already had, plus this one hot encoded international plan column that we created. And the output column we're going to call very cleverly features. So we now understand what pieces need to go into the feature extraction pipeline. And we're going to use something called a decision tree classifier. Decision tree classifiers, again, punchline already given away, are really not that good. They tend to be very biased to the data set that you're using. And so we can fit a single decision tree classifier on this data set, but as we'll see at the end, it doesn't work very well. And I did this very purposefully, because I want you to be so infuriated by this horrible model. Why would I choose such a terrible thing that you go and you download this notebook and you replace it with a better one and actually run it and see how much better you can make it? So we're going to have this decision tree classifier. To specify what this classifier is going to do, we need to specify what features, what column contains the features that we have, and what column contains our labels. So label column, I've, again, very cleverly called label. Features column, very cleverly called features. And that is our classifier. So in our pipeline consists of stages that we list in the order that we want them applied. So we need to do our indexing. So our plan indexer, our label indexer, the vector assembler then assumes that the output of the indexer column already exists and that needs to come next. So we do indexing under our string columns. We create a feature vector from those columns. And then we have our actual classifier. In this case, it's a decision tree classifier. At this point, we can do something actually fancier than what I'll show you. We can do k-folds cross validation and that will, that is built into Spark in a very nice and easy way. But I want to just do this very simply. So I'm going to take my data set. I'm going to do a random split where 70% is going to be our training data and 30% is going to be our testing data. And so train is that 70%, test is the 30%. And then much like scikit-learn, we take this pipeline and we call fit on it and we pass it our training data set. And what it returns to us is a model. So this model is something that we can apply later to our testing set. But what are we going to do with that? And how are we going to know if we're doing anything useful? I think one of the funniest things about having a title like Data Scientist is that people think you're like a shaman or like a priest. You know, you're doing math, you must be right. And I think one of the things that is a good signal that someone is taking their job seriously is they're really skeptical about whether or not the models they're building are any good or the conclusions that they're reaching are correct. So are we doing a good job? How are we going to measure this? One of the obvious, simplest, clearest ways to do this, very, very common, is to use the area under our ROC curve. So on the y-axis, we have the true positive rate on the x-axis, we have the false positive rate. And we would expect that as we begin to classify everything as being part of the class that we're trying to predict that we begin to capture all of the true positive rate. So we're always going to have a point on this curve for a model that's in the far right hand corner. We're always going to have a point in the far bottom left hand corner. The question is, does our model look perfect, which would be a model looks like A? Does it look kind of average, a model like B? Does it look like random guessing, like a model like C? Or is it worse than random guessing? One way to summarize this really quickly is to look at the area under the curve. A perfect model would have an area under the curve for an ROC curve of one. Random guessing model would have an area under the curve of point five. So somewhere in between those two is where most fall. But if we get something less than point five, that means we should take our predictions and guess the opposite. That is how bad our model is. We would be doing a better job. So let's see how we're doing. Of course, PySpark MLib comes built in with these model evaluation facilities. And there's a binary classification evaluator. If you're interested in a multi-class classification evaluator, things like getting the confusion matrix out, that is also something that you can get out of MLib. We take our model that we fit before. We transform our test dataset using this model so it's aware of the pipeline and all the fit parameters that came out of it. Then we create the binary classification evaluator and we get certain metrics that we need out of it, like the area under the ROC curve and the area under the precision and recall curve, which is a variation on an ROC curve. So I'm now gonna ask you what the punchline of my talk is. Are we doing better than random guessing with this decision tree classifier? Who has a guess given all of the very subtle hints that I've, no, probably not. I'm probably not doing better than random guessing. And the answer is no, the AURSE is point four nine, which again means what I should do is I should take this classifier and I would do the exact opposite. I would just guess the exact opposite and I would be doing it much better than this. And so that's because decision tree classifiers tend to overfit and be too specific to the dataset you hand to them. Luckily, you can do things like ensemble models, create a random forest, and get considerably better performance. So I encourage you to go find this entire example, make it hub repo called DS for telco. I believe instructions for downloading the dataset are there. The instructions for downloading the dataset are there. A notebook with worked examples of that's more in depth than what I've shown here. And the output is available, a blank notebook with instructions of places that you can change things and things that might be useful, also available. I think it's kind of a fun way to have some hands on work. Thank you very much for your attention and your time. Great, do we have time for questions? Great, do we have any questions? Yes. What are your thoughts on using Spark to train deep learning models? Okay, the question that I heard was, what are my thoughts on using Spark to train deep learning models? The closest thing that I've seen to something you would call deep learning is that in Spark there's an implementation of Word2Vec. That exists. If I needed to train a distributed deep learning type model at the moment, I would use something like H2O AI. And if you're committed to using Spark, then what I would do is, if you're already using Spark in your stack, there's a thing called Sparkline Water, which lets you read data out of files, load that into a Spark data frame, and then pass that between a Spark data frame and H2O AI. H2O AI has legitimate deep learning facilities, very good documentation. Other questions? If you look at the deep library, most of the time you see what you mean by trying to be a pipeline. So what I want to do is I want to write a user by classification and stick that one into the pipeline. But I, so far, I'm referring to that. I'm not sure if it's even possible. So I'm pretty sure that much like Scikit-learn, if you extend the right classes, there's like, I think it's either an abstract class or an interface that you can extend from the ML Lib library, that if you extend the correct classes, you can put it inside of a pipeline. So that usually, like I think the classes usually say that you need to have two functions, you need to transform and a fit function on your classifier, but I believe it's possible to build your own classifiers or your own models in this as long as you're extending the correct classes. Oh, sorry, I should repeat the question. The question was basically how do you you build the custom classifier and use it in a pipeline? Yeah, you said you have a second question. Spark is in the adoption curve, so is Spark relatively new? The comment that I heard from the question asker is when you look for documentation about doing custom this or custom that, you don't see it immediately. Spark is one of the, is an open source project where generally core pieces are documented well, other pieces take more time and hunting to find whether or not it's documented. So for example, custom classification, like building your own custom classifier, it might not be documented anywhere, but the source code is available and generally there's pieces of it where what you can do is, there's pieces of it where what you can do is go look at the source code and try and figure out how to use it. If you have a good relationship with a vendor like Clodara, Hortonworks, MapR, call your person that you work with and say, how do I do this? I know Clodara has lots of people on their team that know how to do this and help customers through it all the time. Similarly, there's good books like Learning Spark, Advanced Analytics with Spark, I think is also very good if you're trying to work on that, but I don't think, but it's also a quickly evolving open source project and so new pieces are getting added all the time and not necessarily documented as well as everyone would like. It's also a great way to contribute to the open source community is writing down your experiences. If you figure out how to build a custom classifier, write a blog post about it so other people can learn from what you've done. Other questions? Yeah? Not out of the box, it has some sampling facilities which you can build your own sampling, down sampling strategy with, but this is imbalanced, please go fix it. Sorry, the question was, does PySpark or Spark have facilities out of the box for dealing with imbalanced classes in MLib? Answers, not really. Other questions? All right, thank you very much for your attention.