 I'm very glad to be here to share this with you all. Yesterday, we were working on a curiosity and help company. And for that, I'm just giving you a free introduction on myself. And with me, the data science in MSD is important. And my projects are being focused on applied machine learning in drug discovery research. And also advanced analytics in the marketing and other finance departments. So today, the case study I'm going to show is actually adapted from the real one that I've been working on before. But the thing is, the data is all in the data lake. So it's a priority data set. So the data set I'm showing here is actually a mock-up data set. So which means, unfortunately, you're not able to use your laptop to follow up the tutorials. But the idea is to give you some demos so that in the future, hopefully, you can easily copy-paste the course for your own PySpark projects. And the workflow is going to be like this. I'm going to give a basic background of the questions that we'll try to solve and the data sets. Then I'll be covering four parts. They are pre-processing using PySpark. OK. So the second part is using ML library for especially random forest for binary classification problem. Then we try to find out how to use AUC score to actually test the model performance. And finally, we talk about one of the challenges in the data sets, which is very unbalanced and how we can handle that in PySpark. OK. So the question we're trying to answer is what is the probability that a given customer would like a certain product? So the data sets actually have two parts, the product's features and the customer's features. So what we want to predict is the probability that this customer will like this product based on all the features. And so this is just an overview of the data sets, of the mock-up data sets we are using here. So we see some of them are continuous variables. Some of them are categoricals. And the feedback column is the column we need to predict, which is a binary categorical column. OK. So let's start it. OK. The first step we do is load libraries. And as we can see, we have ML libraries. We have even pandas, but we may not really need pandas for this tutorial. And then we have NumPy, some other libraries. So, OK. As this step, we're assuming that, OK, in a company, this is how it works. So you have a big data team that's help you to set up the platform. Then you have an IT team, a traditional IT team, who will help you prepare the data sets and give you whatever views or tables you need. And then you, as a data scientist, just to sit down here and write Python scripts to do processing, modeling, and show them the results to actually demo your POC. And we're assuming that we have all data sets ready. And that's why that's how we can load our data sets into a PySmart data frame by using this line. And then we are selecting the features from the table that will do further processing and modeling. So I'm actually naming all the features here. And then you just simply drop the duplicates from the table. Then we start with Zoom-in. OK. How? OK. So is this OK? OK. Good. OK. So the first step is that we want to see what actually the target column distribution looks like. So because, as I mentioned, the target column supposed to have two categories. One is negative, and one is positive. And with a simple bar plot, we can see that it has actually several categories, neutral, negative, and positive. This is very real in the real world problem because we, first of all, the data sets are very skill. We have a lot of positive cases, which is a good indicator that people are liking our products. But on the other hand, we have too few negative cases which will be hard to train our model because we do not have enough information for the negative cases. And we have quite a lot of neutral cases. So one of the way we can do that because the challenge is to differentiate the positive from the negative or other cases. So why don't we just group the neutral and the negative together? So that's exactly what we do here. So this fuel line illustrates the purpose here. So firstly, we do a group by count to find out what is a count for each category in the feedback column. Then this response will be a list of rows, and then we further take out the category and the counts and do a bar plot. Then that's what you see here. Then the next step is that, because like I mentioned, I want to group all the neutral and negative together so that eventually I have two categories. Of course, I will drop the non-rows so that eventually we end up with two categories. And this is what we do. In PySpark, you will find out a lot of cases you have to use UDF, which is your user defined functions. And it's actually quite easy to do. So first of all, you have to define your own function, which is finalize. After that, actually there's a typo here. So after that, you pass to the function called UDF with a function you defined together with the output data type, which is string. So with this UDF, you pass to the function called which column. Which column is actually create a new column. And together with the input of the column to the function, finally you will come up with a new column for your data frame, which is DF. The DF remember is our original data frame having the data from the table. And for the user defined function, we actually convert all the neutral cases to negative cases and leave the positive case unchanged. So that's exactly what it does here. So I think it's quite straightforward. So the next step is to find out all the null values in the categorical columns and also casting the data types properly. Because we have all the columns here, we have selected and some of them going to be continuous variables. So that's why we convert to float and integers by using Cust. So that's column.Cust. Then specify the type you want to cast to. And then for the remaining columns actually, which are actually categorical columns. So I pass to the data frame and change. And the next step I did was to convert all the null values to a string called NA. So that the column will no longer have null values. So I think this is quite straightforward as well. Then let's move on. After we finish cleaning the data sets, the next step is to see which categorical columns actually have too many categories so that we may think of other strategies to deal with it. The first step we do is to do a count, goodbye count, to see for each of the categorical columns how many different categories are there in total for this column. So the way you do this is very simple. You just do a distinct count. So the distinct values and the count. Then we can see here one of the categories is called product feature three. And it has actually 500 something different categories, which is too many. So one of the way we can deal with the situation is that we can group those minorities into one categories and leaving the majority unchanged. For example, we have 500, but maybe 400 of them are actually minorities. So we can group them to a one category called minority. This is one of the ideas you can do. And the way we do it is like this. Again, we have to use UDF. But this part is a bit challenging because this UDF actually will accept two inputs columns. One is a category column. Another one is accounts. So like we did before, we have to do a goodbye count first. We goodbye this column and do a count. After that, we join this new data frame to a regional DF so that we have an additional column called count. With each row, we have a number associated with the category, represent the count of the category in the whole data sets. And after we join this column, we have to create our user defined function. As I mentioned, this function will accept two inputs. The first one is a regional column, which is a product feature. And the second one is a count. So based on the count, we have set the threshold like 150 to filter out those that below the threshold and then assign those values to a universal minority category string. And after we create this user defined function, we pass to UDF and specify the output type. Then we just again use the same function with column to create a new column called product feature three reduced. This will be the product feature three but reduced in terms of number of unique categories. Then we pass the UDF to it and specify the input columns, which is product feature and count. After that, we will drop the original product feature, which has 500 categories, as well as we drop the count columns, which is not useful. Okay, so after we finish, the last step will be any questions? Yeah, yeah, sure. How have you defined this threshold? Sorry? This threshold. Why is this one... Oh, okay. Let's go ahead. ...one thousand... Yeah, yeah. Because, okay, the story is that I did another plot to see for each of the categories how many counts associated with different categories. Then I found out that 150 is a good threshold so that I can filter out roughly around 400 categories. But it's just because the data sets in the cloud I couldn't run in here, so I couldn't actually show you. But what you have to do is do a bar plot for each of the categories to see the counts. And if you feel like 150 is good enough to reduce enough categories, then you just use that. Yeah, but what is the purpose? The purpose is to collapse it to some amount of categories. For you, I want to have only eight categories. Or it is a percentage for the collection. This is actually the threshold, the count of the category below which the count of the category will be considered as a minority. No, the purpose of it to reduce the amount of categories. Yes, because we have 500 different categories. Some of them may only appear three times, five times, ten times. For all of these, I group them together to form a minority category. Okay, just answer my question. What is the amount of categories that you have? 500, in total. In total, one hundred. But after this transformation, how many? It's a hand or something. Hand or something. Yeah, hand or something. It's not ten, it's not twelve, and it is not ten. Okay, it's a one hundred. Yes, hand or something. Because I feel like I don't want to have too many. Two hundred is too many for me. I just use hand or something. But I didn't show here. Yeah, that's a good point. Yes. So, what this will achieve is that after you run this function, this original column called product feature three will only have a hand or something unique category left. Yep. I think if you look at the page, what's the point? There's so many categories. Oh, okay. Yeah, that's a good question. Because... It's not a problem why there's so many categories. I mean, we don't want to have five hundred categories. Yeah. But how many categories do we want to have? If you increase the stress of, for example, three hundred fifty, it will be something like seventy. Yeah. If you increase it for eight hundred, it will be probably twenty. What is the optimal amount of categories that we want to have? Okay. And how to choose the strategy properly? Actually, two of you have two different concerns. Both of which are reasonable. So, for your concern is that how many categories should we have which is a reasonable amount. This will, on one hand, depends on your data. Because if you find out that the top hundred category actually have quite a lot, and the hundred and one category onwards will, like, substantially below a threshold. This differentiation, if it is very obvious, so probably a hundred will be a good threshold. Right? So it's all dependent on data. So you can plot it and see, okay, maybe these onwards will all become a minority. It looks good on the data set and distribution. So then that is how I set the threshold. Or if you think a hundred is a, the maximum number of categories you want to have for random forest. Because random forest, sometimes, if you have too many, it doesn't make sense to train a model. Then in that way, which is his concern, then you can use a hundred as a threshold. So it's all up to what your concern is and how your data looks like. Okay. So the next part, which is the final part of the data processing is called one-hot encoding. So how many of you are actually familiar with one-hot encoding and know what it does? So, okay. So maybe I explain a little bit about one-hot encoding. Because for all category, categorical features, right, we cannot just push to the model to let the model train out. Because eventually the input data for the model is a matrix of numerical values. So in order to convert the categorical columns, like the cities or your set or gender, which are categorical values, you have to convert into a number. And there are two ways you can convert a categorical value to a number. One way is just to use an integer to represent, like Singapore will be one, like New York will be two, and et cetera. This is one of the ways. But this way, it's not the best way because there's no ordering relationship inside. New York is not two times of Singapore. So another way which is good is called one-hot encoding, if your column have like ten different categories or ten cities, you will end up as ten columns. And for each row, you will only have one number, which is one, which is associated with a column corresponding to the city. So you have ten columns. Nine of them will be zero. Only one of them will be one. If this row is, like, for example, Singapore, then this column represents Singapore will be one. So this is the idea of one-hot encoding, which is just to convert your categorical columns into a number, numerical form, so that your model can train out. But unfortunately, in PySpark, it's a bit hard to actually do one-hot encoding, compared with pandas or other packages. So the first step we have to do is use this string indexer to index the string into a label indices. So originally you have strings, like Singapore, New York, which is a string. Then you use this to convert to a label indices, one, two, three, four, five. After that, you pass to this one-hot encoder to actually convert the indices into the one-hot column, one-hot encoding columns, which is a sparse column. So this is the step you have to do it. The first thing you have to specify is the columns, which are categorical columns, and you want to do one-hot encoding. Then you just choose another name for each column after it has been converted. For example, feature three reduced. You want to name it feature three reduced cat back. Then you have the two lists. After that, you pass to one list. For each of the input from the original columns, you use one-hot encoder. You use the input column for the string indexer and output column. Then this output will further pass to the encoder, one-hot encoder function as input. Then the output from the one-hot encoder will eventually be your one-hot encoding columns or sparse columns. So this is very tedious, I know, but there's no other way we can do this in price bar. If you know, you can just let me know. So the sequence is like, you pass the original column to string indexer. Then use the output from the string indexer to go to the one-hot encoder function. Then use the output from the one-hot encoder as your final output column. So two steps. So because I want to make it as a pipeline, so that's why I flatten the string so that it will be like step-by-step for each of these columns. If you find out it's not easy to understand in the future just copy this code, it will do one-hot encoding for your columns. Okay, so the last step. Now you have all the features. After this tedious feature engineering part, you finally can do a modeling using random forest to train on your datasets and make predictions. One step before that is you have to select, you have all the features into one columns and to let the model know, okay, these are the amount of features I want you to train as predictors. And this specific column, which is feedback column, is a target column you have to predict on. So you have to let your model know this. And in order to do this, you have to use a function called vector assembler. And the input is the list of the names of the columns that you want to use as predictors. This is very straightforward. All of these columns are, we have seen it before, these are the features we have created from the original columns. Then you have to specify the output column. This is to specify a name, features, which represent all of the columns. A second step is to create a label indexer. And here you have to use a string indexer function. And the input to this is a binary response, which is a target column you want to predict. And you have to specify the output column, give a name, label. After you finish this, remember the temp is a pipeline that we have created previously. So you want to add these last two steps into the pipeline at the final two steps. And then eventually you can come up with your pipeline. So the pipeline is like you go through step by step of your future engineering all the way until these two last steps. So if you are familiar with Psyche Learn, you will know that a lot of the functions have the fit and the transform. A lot of the class have fit and transform functions. This is exactly the same here as Psyche Learn. So remember the df is your data frame. After you create your pipeline, you fit and transform on your data frame. It will go through all the steps you have to specify previously until the end. And after that I want to cache my data because eventually I have to do a lot of like assembling or ensemble so that I don't want to the pyspart to redo these steps again and again. So that's why I cache my data in the memory so that the output will be used directly. And then I do a split because I want to test my model's performance. So I use 20% of the data as test data and 80% as the training data. And remember to set a seed here so that every time you do the same split. Then I just want to look at the distribution of the negative and the positive labels in the training data. So it will be like this. It's the same ratio as we have seen before. It's around like 1 to 5 or 1 to 4. Okay. Finally you have created your data sets ready for modeling, which is a training data sets. Remember this is a pure numerical matrix. There's no strings, no categorical features. And this will be your input to train your random forest model. But before that you have to specify your random forest model to let the model know that feature columns represent a whole bunch of features in the data sets and the column called label is the target column you want to predict. And there's one or there's a few parameters one of them is a number of trees which you have to specify. After that you use the same way fit on your training data then you predict which is called transform in this package on the test data sets. So the transformed will be a list of predictions for your test data sets. So until here we have finished processing and modeling part. So any questions about that? So now you would like to see the performance of your model on the test data sets. Because this is a binary classification problem one of the way you can test your model is using AOC score which is another curve. And this score is normally between 0.5 and 1. Where 0.5 is like you flip a coin to randomly assign a prediction where 1 means like it's perfect correlated or the accuracy is perfect. So this is what we can easily do this by import this matrix from evaluation called binary classifier matrix. Then remember this transformed is actually a list of predictions we have got from the random forest on the test data sets. And we select the specific probabilities and the labels which is a ground truth. Then we assign all the labels and the probabilities we pass to the matrix and it will give the score which is 0.64 which is not very good but there could be other reasons like the data sets not very good or the model is not very good but it's still okay compared with 0.5 it improves by 0.14 and if you want to visualize your AOC score you can also easily do this which is the same way as you can do which is just to use map plotting features you can plot AOC score which is like this. So these areas represent the information or the advantage your model have gained over this area which is random guessing. So this area means like 0.14 something. So okay after you have finished this step we have come to the last step of this tutorial which is to which is to deal with unbalanced problem because remembering from the beginning we found the data sets very unbalanced. We have much more positive cases compared with negative cases. Unfortunately we cannot do much more in the random forward package on PySpark because we do not have the class balance implemented in the random forest as far as I know I don't think they have this so unlike other packages like Secular which is more robust and you can do a lot of features. So what we can do here is manually down sampling or up sampling. Down sampling means that because we have too many positive cases two more positive cases and negative cases so we can just take a sample of the positive cases and keep all the negative cases so we can bring down the positive and negative cases. This is called down sampling so this is just a distribution to see the outputs which is not too much to talk about. Let's see how we can do down sampling. First we specify the ratio of the positive cases to the negative cases because we want to bring down the ratio so we can specify the ratio is two which means the number of positive cases to the number of negative cases two to one. And we will keep all the negative cases because it's minority cases. Then we do a group buy same as we have done before. Group buy the positive and negative then we can get a count for each positive and negative. Then we calculate the ratio. Okay so this part is a bit tricky because the way I use is that for each of the positive cases I will random assign an integer within a bound. Then I use a threshold that I have calculated for all those rows whose random integer is below this threshold I will keep it. For all those cases whose random integer is above the threshold I will throw it away. So the one remaining will be roughly around two to one in terms of numbers. And this part will help you take care of this. Then eventually your DF sub-sample data will be a subset of your training data with reduced amount of positive cases. Then you use these data sets to retrain your random forest and make predictions and you realize okay it's improved a little bit which is 0.646. So by using this way you can increase your results a little bit. And another way is a bit trickier way that you can even do an example of down-samplings. Because every time you do a down-sampling from the positive cases you waste quite a amount of positive cases. So what you can do is you do like multiple times of down-sampling. Every time you take a different subset from the positive cases then you do 10 times and then you will get 10 predictions. From those 10 predictions you take the average which is an example and hopefully this will be better than one time of down-sampling. Let's see how it works. This code is a bit more I mean complicated so I'm not going to go through this. But the basic idea is you have done like 10 times of down-sampling. But each time you get a prediction and you average them. Let's see what the results look like. So the first time without down-sampling is 0.645 and then increase, increase, increase, increase. So this example actually helps a bit compared with the previous down-sampling one time. So these are the two strategies that you can actually play with when you have an unbalanced data sets. But there could be other ways you can do this but there's just like some possible ways that can help you do this. So like I said you don't have to fully understand like all of this how does this work you just copy paste and it will do the job for you, hopefully for your future PySpark projects. Actually based on my experience working with PySpark it's not very easy compared with, it's much more hard like much harder to compare us using pandas and other packages. But why do you still need PySpark? Well one of the obvious reasons that you have too much data you cannot fit into your laptop or you need a cluster to do this. So at this moment I think the PySpark documentation is quite limited and also the package is not very, it's not a lot. There's a very few packages and Random Forest is only in the basic implement with its basic features. But so that's why we have to do a lot of extra work in order to enhance the performances. But hopefully this can give you an idea what you do with PySpark. So yeah, with that I finish the tutorial and if you have any questions you can ask. Which version? Oh, which version? Sorry, actually I can't remember. It should be around that because the last time the update was last year because that is when I finished the project. Sorry, I can't remember. It's not the latest one. At that time there's no class balance for Random Forest and you cannot even get future importance out of Random Forest models. I like Pandas. Just 40 minutes. If I have negative 2 and 10 positive why are you just Random Sample from here for? Why do you have to go through that long step? Oh yeah, because you have to random sample from your positive cases, right? So how do you random sample? Because there's no function that you can help. That's why I keep saying PySpark is not very easy for a lot of things. So there's not even a random sample feature function. A lot of functions are missing. So that's why when you are actually working with a PySpark project you have had it along the way. That's why hopefully this will give you life easier. I have one question. I'm just starting to explore PySpark and could you provide some of your experience? What are the limitations of PySpark compared to classical skip learning? Pandas or Secular? Yes. Which ones you should know to use PySpark? That's actually a very good question. The one sentence answer is that there are so many disadvantages and I mean I'm not saying PySpark is not good. It's an open source project and how to say it. You cannot compare with Pandas because Pandas and Secular are so well developed. You see all the PySparks not even 10% as good as that one. For some cases you have no other ways but you have to use PySpark. Like I mentioned you have too much data. You cannot fit on your laptop unless you have a very large HPC which has a lot of RAM. You have to come to a cluster and you only know Python for example. You don't know Scala so you have to use PySpark. I don't think the PySpark R is as good as PySpark. You can expect a lot more if you use PySpark R. That's the reason I finally end up with PySpark. After I finish all of these obstacles I feel like even though it's very hard to work with but you can understand the logic because it's very different from other packages like the lazy evaluation thing. So you have to understand some basic building blocks of Spark then you can step by step work your way to finish your projects. One of the ways is to use other people's code. This is very important. Even a simple step you have to think of you have to use a lot of ways to work around to accomplish that. You just use one line you can finish the job. But be careful if you eventually want to give up and convert to pandas data frame that's going to be another huge issue because you'll blow up your whole memory. Because sometimes I just want to I don't want to spend any more time I just convert to pandas data frame then I receive an email from my big data team you are actually cost of the whole cluster to stop because you're using all the RAM. That's the thing. It's very large. You cannot even use one data frame to store it. I mean the pandas data frame so there's no way you can do that. But for small data sets you can convert to pandas data frame then do a quick calculation which will make your life much easier but make sure the data frame is very small. So yeah. How much is left? It depends on your RAM because you have to negotiate with your big data team. It's all up to them. Because it's not just you there are many other projects who are sharing the skills or resources. It's hard to say. It doesn't matter on the rules. It matters on the size of the data like the gigabytes or terabytes so it depends on how much data set you have. If you only have like 10 gigabytes there's absolutely no reason to use this. Don't get in trouble. If it's one gigabytes like modern laptops have like 16 gigabytes why not just use a laptop? Why do you want to spend all the trouble using this big data platform? So it's always the first question you have to ask yourself why do I need a big data platform? It's not because everybody using it, everybody talking about it you have to use it. That's not a good way. So think about your challenge think about your data sets. Do you really need it? Then you have to then you can't come to this platform. Okay. How big was the size of your data? How big was the size of your data? It's around 100 gigabytes for this particular task. And how long did it take to run your full notebook? It's actually quite long. Because the resources is very limited. It's only for POC so they don't give you too much too many executors so it's actually not very fast. So I think a few hours sometimes can take a few hours or even not finish. So it depends on your resources. This is very very is a lot. So it's hard to say. I'm tuning the two parameters. Tuning the two parameters. So the parameters is one of the thing. 200 is already quite a lot. Sometimes I have to use 50 for the number of trees because it cannot finish. So that's why one of the parameters you have to be careful is the number of trees you want to run in forest. The more trees, the longer time it will take. Actually I don't think it matters much. At least for my data sets. Because random forest, the good thing or the only good thing we use random forest is because it's easy and no parameter to tune. Comparison edgy boost or other models. So you just have to specify the number of trees which should be enough. For many cases. The maximum is 1,000 trees but I don't think it's necessary. You see. Consideration was actually the lack of 500. Yeah, it's one of the considerations of course. Like I said, 200 is already the maximum that I can afford. Otherwise, it's not going to finish within one day. Okay. Sorry. No. The down sampling is only okay. The thing is, when I was doing down sampling every time I get a subset of the data sets I run one random forest. Then I get another subset and run random forest. So one subset No. Okay. You are saying that the algorithm itself, right? No. They are doing bootstrap sampling out of your data sets. Then there is going to be some samples which is called all-out-back samples for every time you set the I don't think you can set any more parameters. The random forest package is very basic. If you go to the random forest documentation of the pass bar you will see that I don't think you can set that. Because it's very basic. You only can set the total number of trees of columns you use to sample from your total columns. So beyond that, you can check the random forest documentation but I don't think you can do that. Okay. What was some of the key steps that helped you to do all these samples? Yeah. That's a good question. Okay. If your business user or your stakeholders actually very curious about the accuracy, they wanted you to push it up a little bit. Then you really have to think about what can be the ways to actually increase it because this is not very high. Okay. So one of the ways you can get more features. Like I mentioned in the beginning, the feature we have is product feature and customer feature. Of course, this is a demo but the thing is this can be similar to many real-world cases. So from the product feature, you can get even more features. And the customers, for example, we only have the basic, the region, the job type and the gender of the customer. So we can even get more features. Like how frequently the customer come to buy our products and how many years this customer has been a loyal customer to our company. And even beyond that, we can even do a lot of feature engineering. Like we can do a group buy, for example, group buy this customer to count how many products this guy totally buy. Or how many times they give a positive out of negative feedback. It will definitely increase your model's accuracy. But unfortunately, it's very hard to even create one feature in PySpar so I don't want to go into that area. So I always try to avoid those kind of sophisticated things to do in PySpar. But of course, that way will definitely increase your accuracy a lot. Yeah. So I think with that time, yeah, sure. In PySpar, I think there is more MLE which one are you using? Actually, okay, I'm not very familiar with what's the difference but the thing is I'm using MLE. But I'm also using some other tools from MLE. For example, this one, AUC score, yeah, this one. So the model is from MLE and the feature engineering. But honestly, I don't know the difference. No, no, I think this should be very similar. I can't really tell the difference between those two. So yeah, I think that's okay for my part. Any other questions? Okay, yeah, sure. Do you have a problem in saving your model? Sorry? Do you have a problem in saving your model? Saving? Oh, okay. No, I don't think so. Okay, this is some other projects, right? For example, you have a cluster of handle machines to run your random force. And eventually the random force model size will be very, very large. So it's going to be a trouble and concern for you to think about how you can carry the model along and every time you want to make predation, you have to load the model from the disk. So that is definitely the concern you have to worry about. If you have a large cluster, you have a lot of data. I don't think my cluster is very large, so that's why I take a few hours to finish the random force. So I don't think it's a big problem to save the model. I actually haven't tried that. I just saved the model in the server every time I retrain it because it takes a few hours to finish training. So I don't reuse the model from the... But I don't think it's a big problem. But of course, we have a large cluster which is a case that we had previously. The model size is going to be very large. So random force is very large. Okay. So, yeah. Any more questions for women? Okay, good. Thank you for...