 I'm going to share with us a few of the ideas and experiences and reasons why they did what they did. Alright, we've got a round of applause. Hello, hello. This is for recording. Okay, okay. Okay, thank you very much for the introduction, Justin. I'm very glad to be here too. It's more of a share of knowledge of what we have been done with our customer. Sorry, can you hear me? Yes. Much better. Okay, okay, okay, yeah. So the employee attrition prediction is actually what we have been doing with our customer. And today I'm going to talk about how we approach the data science problem and how we use the programming language R to solve the problem. Okay, so this is the outline of the presentation. Okay, it kind of ticks. Alright, so basically I will firstly give you a very brief introduction and then how to predict the employee attrition in a scientific way. I mean, data science way to solve the problem. And then the walkthrough often we call it R accelerator. It's actually an R solution template and everybody can use it and customize it to solve their own problem. Okay, so actually we are with the Microsoft algorithms and data science team and we are located all across the world, you know, from a quarter of Seattle to London and we are in Singapore and also we have team members in Melbourne. And the Asian Pacific team here, we primarily do a lot of customer engagements to provide the data science solution to our customers to solve the real world data science problems. And also we are developing the scalable tools and algorithms that can be used for high performance analytics. And I'm not sure how many of you know data science, machine learning, artificial intelligence. What are the differences of these terms? I'm not sure how many of you have idea of this kind of thing or you just don't know what I'm talking about. Okay, can you raise your hands if you know data science? Oh, okay, great. Don't worry, there are some people who want to know data science. Yeah, yeah, yeah, great. Okay, okay, that's great. Yeah, so data science in my own understanding is something like you analyze the data and you starting with the business problem and you use the techniques of machine learning algorithms to solve the data science problem. So let me give you a very simple, a review on a very simple data, that is the Iris data. I believe most of you guys have heard about it. It's very simple, it's just like you have species of Iris and based on the two features you can classify Iris into three groups, three different species. So one of the problem is how can we, by analyzing this data and we can develop a model to predict or to classify some new data with unknown labels and put them into different groups according to their features. So what we usually do with this kind of problem is that we use a machine learning algorithm and that is, for example here I show, we can use decision tree. And the decision tree will take the input data, I mean the new input data and try to classify the data into maybe one of the species of the Iris. But my question is, is that all? I mean, is it just data science is just as simple as this? Actually the answer is no. In the real-world data science project, actually we have a very complicated process of dealing with a customer project. And we, I'm not sure whether, I believe some of the data scientists, they have heard about the cross-industry standard process for data mining, right? And Microsoft has also his own standard for data science process that is called the team data science process. And the advantage of team data science is that it is collaborative, which means that it is for a team of data scientists. And also it's iterative, which means every time you start from, let's say, a business problem and then you go to your data acquisition and understanding and then after some exploration and the pre-processing of the data you go to your model creation. When you, let's say when you create a model, you found out, okay, some of the features are not so useful, right? Maybe you have to go back to the business understanding part and you have to talk more with your domain experts and then maybe do some other future engineers. What I mean here is that this is not the unit directional. It's like you have a lot of back and forth and you have to iterate multiple times in order to have an optimal model. And after you've got the optimal model, you can deploy the model with everyone else in your department or other guys. They can deploy the model easily with maybe an API or whatever. So this is the very typical workflow of a data science project. And today I'm going to share a real-world use case scenario that is to predict which employees of the company will leave soon by analyzing some data about the employee. So let's say here we have one guy. Maybe we know his name, we know his age, and we know how many years he has been working in this company. And plus we also know some of, because today we know that employees, they will also pose to some, you know, their own opinions or maybe some random chat on their social media. And this kind of information is also very useful to analyze whether this guy will leave the company soon. And the benefits of doing employee attrition is that because actually the cost of employee loss is very huge, especially for big companies, actually not only big companies, also small ones, they have to care this kind of problem. And so that is why they need to find the solution by using machine learning and data science technology to solve the problem. Just a quick question. You managed to quantify the dollar loss because of attrition for every single employee? Sorry, I didn't catch it. Because you mentioned that it has a huge impact. There's a lot of losses. Yes. You guys mentioned quantify the dollar impact. Yeah, yeah, yeah. This is, yes, we have to do that because basically the machine learning or data science can just give you accuracy of the model. And that is why I said it's very important to talk with the domain experts and also talk with the executives or managed level people in the company so that they can give you something like a success criteria to tell you that based on the accuracy of the model, I know how much I can save for employee attrition. It's something like that. Yeah, yeah. I'm not sure did I answer your question. No, I just wanted to know what's the range that you guys discovered, like $100,000 loss? This will depend. I mean, this is not decided by the data scientists but decided by the customer or the company. So you guys actually linked that to the dollar that needs to hire the guy, needs to train the guy because that could be a correlation in that sense. Yes, there will be, but we don't have to do that. We just provide a model and the model with certain accuracy and the customer, they can handle that by themselves. How do you reduce attrition? But you didn't look at what is the dollar impact of attrition. That's the client. Yes, that is... Yeah, yeah, yeah. Okay, yeah. All right, so before we do any machine learning project, actually the first thing is to understand well of the business problem. And second thing is we have to know what kind of data we have. If we don't have the data with high quality, actually a lot of time people will say that, okay, what we are doing is just garbage in and garbage out. It's nonsense, right? So we have to firstly analyze if we want to predict the attrition of an employee, what kind of factors, what kind of data we need to collect. So here I have a list of the first column is normally what may be the reasons for a leave of employee, right? So we know maybe sometimes this guy just want to have a personal career change because he has been in this company for so many years, but he has not got any promotion. So we can probably use one of the features from this factor such as the years taken for a promotion, right? So this is just one example, but here I cannot say this is the exhaustive list of all of the reasons, but it's just for illustration. And we can actually pull the data into two groups. The first one is the static data, which means that the data actually don't change or just change in a deterministic way. For example, the age. Every year it will change, but just increments, right? And name, name will never change, or maybe gender, right? It doesn't change at all. And another group is about the dynamic data, which means that the data will change randomly. For example, the performance. Every year the performance may be different than they are. They are correlated in some way, but who knows? I mean, it can be regarded as the group of dynamic data. So another thing is after we identify the data we are interested in, we have to go to certain departments of the company to collect the data, right? So HR department may be one of the guys who can provide the data. And also IT department, they have some data that may be useful. And also your director reports, like your manager, he knows you very well, and something like the performance reviews, and sometimes you can use this kind of data as well. And also the social media network, you can collect the public posts of employees and analyze the posts in order to predict the attrition. And what's next? After we have got the data, what to do? A lot of times people just, okay, let's just go straight forward to model building. But actually that is not a very good practice. Normally we'll do some simple statistical analysis and also visualization on the data is very important. For example, here I just provide a sample data. And the first chart shows that what is the percentage of employees who have left the company and who have not left the company. And I group the data by drop level. And the second one is the correlation between drop level and monthly income. Okay, I mean, before that we just intuitively think, okay, maybe these two factors, they are correlated. So let's just visualize them. I mean, I don't know, but we can just plot something and then we put the two together to see whether there are some correlations between monthly income and drop level. And we can see that actually a lot of people, I mean here the labels are, I didn't explain, but the label here, yes, means this guy have left. And we can see that most of the guys who have left the company, they are here. They are distributed like low income and comparatively low drop level. So this is very useful information so that we can sort of confident these two are the important factors that we can maybe use of to do the prediction. And after we have maybe identified the initial factors for prediction, we have to generalize, we have to develop the framework where we can develop the model by using the data. And the framework here is, here I have, for example, the data of employee for the past months. Or maybe, I mean, this end is a variable. It can be anything. It depends on the requirement for the real case. And the data of the static data and dynamic data will be aggregated into, I mean, in the column, in the data sets. And these guys will be features for creating the model. And then what we are going to do is to predict whether this guy will leave the company in the next M plus one month. So the plus one means, okay, there is the notice parent for most of the company so we have to consider that. And this is the basic framework for prediction of employee attrition. And after we have constructed the framework, we have to decide which feature extraction, which features from the original data are used for to do the model prediction. Normally, we have a lot of techniques to extract the features for employee data. For example, some simple statistics like marks, means, standard deviation. And also sometimes we know we are analyzing some historical behavior of the employees and we can form the data into a time series and we can do some trend analysis or time series model to extract the features from the data. And also just I mentioned that a lot of times we can also use to text the information, text the data and we can apply the text mining techniques to deal with this kind of data. And after we have finished the feature extraction, we have to select which features are the most salient ones so that they are mostly correlated with the label we are going to predict. I mean, a lot of times not all of them are useful. So we have to do the feature selection. And after that will be the core part of the whole process which is model creation. We know that our problem is, say, supervised classification problem. Basically it's just to classify whether this employee is going to leave or to stay. So it's a binary classification. And for binary classification we have several choices. For example, we have logistic regression and we have support vector machine or decision tree or some other algorithms. And all of these are actually very popular machine learning algorithms for do this kind of task. And also a lot of times to improve the overall performance of prediction we can actually create the ensemble of the basic models in order to have a boosted performance. So there are several commonly used techniques for ensemble that is bagging, boosting and stacking. And all of them, I mean, a very important practice in doing data sciences how can you select this, which one is the best algorithm or how can we do the ensemble to improve the performance and how can I fine tune the parameters. So these are really, what should I say, it's more like an art in science because it depends not just on the algorithm itself. It also depends on the data characteristics. So a lot of times what we are doing is we have the selection of the algorithms and we just do a great search and we find the optimal one. Or maybe sometimes we also try that we can put them together to form an ensemble to have a better performance. And after we have created the model we have to validate the model whether the model has the required performance. So for classification problem we normally create the confusion matrix. And from the confusion matrix we can get the metrics of such as precision or recall or f-square or if you want you can calculate the area on the curve and all of these can be used to analyze whether the model is good or not. And yeah, so just now I've shared with you guys that how we approach a data science problem of predicting employee attrition. And now I'm going to present our accelerator and why we call it our accelerator is a lot of times actually when we work with our customers the very beginning of the project will be can you do a proof of concept for us? Can you just, I mean with some sample data set can you just show me okay this one has got to work. It's not like okay let's start with the big project and let's use all of my data available onto maybe Spark or some other big data engine to solve the problem, it's not like that. So everything starts from a small one and that is why we have the so-called our accelerator to accelerate the proof of concept to accelerate the process of developing a data science solution to help either the customer or help a machine learning learner to understand the business problem to understand the typical working flow of the data science problem, okay. So the our accelerator is very lightweight. I mean it's very small and you can easily adopt and you can easily develop with your own problem and also it follows the Microsoft TDSP format so it means that you can organize the project following the suggestions, recommendations from TDSP and also it's very easy for prototyping, presenting and documentation, okay. So I, how much time do I have? Five minutes. Five minutes, right. So I will not go to the details of the code here because... Please do, that's interesting. Okay, I'm not sure. Please do. I'll give you ten more minutes. That's okay. Okay, let me try my best because I'm hungry. I'm not sure about your guys, but okay. This is the walkthrough of our accelerator to predict the employee attrition. Exactly the same thing as what I described just now but it's more like an implementation, right. How we do that with R. I'm not sure whether, how many of you know R or have any experience of R, can you... Okay, so because... What I'm going to talk is a lot of this kind of code, very technical stuff, so I will just try to, you know, talk... Okay, okay, yeah, yeah, yeah. So this is my session, which means that these are the, my working environment and what packages I'm using for the problem. And I used two data, because, you know, because of the confidential problem, we cannot share the customer data, right. So that is why I used the public data and for the employee attrition data is from IBM and for the text data is from Glassdoor. That is the just review comments and you can easily find it on the website. And... So first of all, just now I talked about we need to do some preparation for data to make sure that the data is ready for model creation. And basically what we are doing with data processing is, for example, how we handle NAs and how we remove some non-variance and how we do some normalization or scaling of data and also sometimes we need to do some data type conversion as well. So these are the initial steps to, you know, just prepare your data for later use. And yeah, so here what I show is I will make some of the data in this data set to be a factor. And then another very important thing is because the sample data I'm using here is, how to say, it's a clean data set. It's not like what we are doing in the real world project because actually when we do this customer, the raw data set is very messy. You have to do a lot of aggregation, a lot of feature engineering in order to have data that is ready for use. But here the sample data set looks very nice. I mean, you have columns and they are very well organized. So we don't have to do the feature engineering. We just directly use it. But we still have to do feature selection. The problem here is that because the data is the mixed type data, it's not just numerical. It has also categorical data. So we cannot do just simply do the correlation analysis. Another way, another workaround is to create a model and this model can help us to find which are the factors that we can use for the best performance. So this is the feature selection part. And here you can see that after we create a model, we can have ranking of the variables in the data set. And probably we can, based on some criteria, we can select the top several variables to create the model. And after that, let's say we select several models. We just remove the last three. And these are the variables we use for model creation. And the next step is probably you know that when we do the model creation, we have to partition the data set into two groups. One is a training set, another is a testing set. For the training set, we just use it for training the model and the testing set is to validate the model. So firstly, we do this. But one thing that was mentioning about is the training set is not balanced. Because if the data set is not balanced, it will create a lot of problems for your model training. I will not explain in details, but here are normal tricks to solve the data in balanced problem is we can do either cost-sensitive learning or we can do resampling on the data. And actually the second one is more straightforward and we can up-sampling the minority class so that we can sort of balance the two classes. There is a very popular technique called SMOTE. And this technique is just to do the resampling and rebalance the data set so that we can have a better data set for training the model. So you can see that after we apply SMOTE on the training set, we can have a sort of balanced training set to train our model. And then after that, we will do the model training part. And because here I'm using the carrot package and actually we have many other choices but this is one of the most convenient package we can use for training a model and these are the training control. And to do some comparison analysis we use, I mean here is just an illustration, right? And I just use three different algorithms which are support vector machine and random forest and extreme gradient boosting which is very popular nowadays. And aside from the simple models the basic models, I also do an ensemble of these three models and here one of the ensemble methods I introduced just now is a stacking of the basic models which means let's say we have some models M1 and M2 and MN maybe and the idea of stacking is we can have a meta model. The meta model is at the final stage of the overall model which means that after initially the basic model gives some outputs of prediction the meta model will take the prediction results as inputs and then give the final results of the prediction. So this is the basic idea of stacking ensemble and a lot of times this is very useful to boosting the performance of overall because you can intuitively consider this scenario to be something like leverage on the advantages of different algorithms and to improve the performance of overall. So here I'm using stacking ensemble based on the previously trained model which are SVM, random forest and extreme boost and then I use another meta model to create the overall ensemble and after I have trained the model I have to use the testing set to validate what are the performance of these models and as I said before we can get to the confusion matrix of the validation results and calculate the accuracy, recall and precision and also here I also compare the elapsed time but the funny part is the stacking actually did not improve the performance of extreme boost and random forest too much so what might be the reasons? Actually if you are doing stacking ensemble one of the very important things to bear in mind is the diversity of the basic models is very important I mean if the models they are all the same the stacking will not improve at all so it means your model should be diverse and another thing is sometimes the data size or the more generically the data characteristics will also be very important to the performance of stacking ensemble so here my data set is not so big just 1,400 rows and also the diversity of model I just have three models and maybe the extreme boost has already done a lot of things to improve the performance so there is not too much space for the stacking ensemble to improve so that is why you can see the performance is not that good I mean the improvement of performance is very limited Usually how much data would you require? How many rows of data? It really depends but normally it can be maybe millions of rows or whatever so that maybe you can see the remarkably change of the performance But especially for this model evaluation what would be your recommended data size or the number of data rows? For what we have been doing with customers the data set can be millions of rows it can be very big or if we use the sample data it is tens of thousands of rows and also you can see the improvement of the models if you use that What is the big accuracy? Is it the mean square error or the R square or the F test? Accuracy actually you can directly calculate from your confusion matrix It is not like mean square error because that is for regression At which point accelerator comes into the room? Because these are almost R comments Yes What about the accelerator concept? You mean the accelerator, right? Accelerator is the concept This concept is... What I am talking right now is the accelerator actually because I am doing presentation but actually the accelerator you can check the GitHub repository it is written in R markdown and there will be introduction of the business problem and also a walkthrough of the data science techniques and why we do this and why we do that and people can easily understand and the most important part is the accelerator can be used to generate other stuff like pure code and also PDF and HTML so that you can easily present and distribute our idea to your colleagues or collaborators So that concept is called the accelerator but not the code I am showing here What is it like? A guideline to create models? Something like that, it is a how-to guideline You are on the page of visualization we did a lot of data visualization using that we have done about 6-7 years back and we added the predictive prescriptive and the descriptive modeling how different is this compared to that? You mean this one? Yeah 6-7 years back I used to do both on static as well so on real-time data How different is your visualization compared to tabloon, dendels and clickwhip? Because that gives me more accuracy it gives me 1 by 1 millionth of accuracy when I visualize the data Tabloon is a more... Tabloon only has one of the points as far as data visualization is concerned there are many more I understand Tabloon is the commercial software and it is specifically for visualization so what I want to say is visualization, data visualization is more powerful and more how-to-say compared to... because what I have done here is to use ggplot2 and ggplot2 is one of the R packages for data visualization and what I want to say is because when we do data science in R or maybe a lot of other folks they are doing in Python you want to immediately get the results of your... you want to immediately visualize your data in your working environment instead of going into... let's say you are developing some codes and you want to say what is the correlation of the data I just collected and you can do this in R but if you want to do this in... I'm not sure whether there is any interface between Tabloon and R but it will be a little bit tricky No, it's not what I'm trying to say Here is that it gives me much more precision and much more accuracy as a matter of fact I don't have to go through with this I use descriptive modeling when I... when I use them and combine them with the Tabloon and click views as a matter of fact it gives me much more bigger accuracy I thought I was just looking... I also do a lot on R and Python as well My question is how different is it or where exactly are you trying to say that you are different if that's it? Actually, I'm not to say that this is different from Tabloon or what I mean, this is just how we do in R and do they have realisation Definitely you can also go for Tabloon I'll get different So it also depends on implementation in algorithms Tabloon may have some algorithms but there's not much that needs extra boost at the moment so extra boost is probably one of the most accurate classification algorithms that they have probably surpassing much of Tabloon I haven't used Tabloon before The new one Yeah all of the vendors are competing there to improve them little bit by little bit you're getting marginal improvements pretty marginal improvements and often the variation of the performance of the model is wipes out the actual difference between the different variables So yeah there's Because I nearly got an IFR but it's correctly the ideal final result What's the variation of the performance values there I would ask and it all depends on the data Stop Sorry but I just have 5 minutes to talk so let me finish first and after that maybe we can have a discussion So next thing I would like to share with you guys is after just now I showed how to deal with some employee data HR data but now another important thing is how we can utilize the text data to get a sense of the sentiment of the employee so as to predict whether they are going to leave So here I also collected data from Glassdoor and you can see these are the review comments of these guys on their company and what we are doing with text mining for sentiment analysis basically we also follow the steps here we firstly do some initial transformation what I mean by initial transformation is that we remove the unnecessary elements such as stop words and numbers and other stuff which carry no useful information in the text and then after that because a lot of times you know the large companies they are MNC and they have employees in different countries, different locations using different kind of languages so very important thing is how we can align the text in different languages into one single language maybe English or maybe some other languages so that we can reduce the number of terms so if I say something like Ni Hao in Chinese it means the same thing like hello in English so after we do the initial transformation we can create a bag of words model to count either the term frequency of the documents or maybe the term frequency inverse document frequency I mean it's a little bit twisted but these are something that we can convert the text data into the numerical data because the numerical data will be something that is friendly to the machine learning algorithms so after we have introduced this let me show you how we can do that this package is called the TM package and also our package which is very very popular for text mining and you can see that basically what I have done is just to follow the steps introduced just now to do some transformation and convert the original text into term matrix and then after that you will see something like this the the docs index here actually each one of the docs represent an employee and the terms here you can see the company and google yeah this is the company name and the grade and people smart work so you can see that these are the term frequency of the employees and we can use this one as the data set or feature set for prediction and then after that it will be very similar to what we have been doing with other kinds of data which is put the labels of the data into the term frequency data set and then partition the data into training set, testing set and create a model here I just use SVM and then do some prediction and validation and here are the results these parts will be pretty much the same but the different part is when we are dealing with the unstructured data like text we have to do some initial feature engineering which is specific to this kind of data and for text we have to do a initial transformation and then do conversion from text to vectors and then do model creation so these are the general workflow of doing text mining in data science projects so here is the conclusion and in today's talk I have shared with you guys how we do a data science projects for real world problem and basically what I want to emphasize is feature engineering actually is very very very important and it consumes majority of the time of the project and another thing is I have shared some just some techniques that are commonly used for doing model creation and model validation and also how we can do sentiment analysis on text data so as to predict whether the employee has some negative opinions on the company and all of the resources will be available on github so here is the URL including the presentation actually the presentation is generated from our code as well so you can run by yourself these are the references for the talk and this is my content if you have any question after the talk maybe you can approach me and we can have discussions as well okay bye