 Hello everybody. Thank you for coming to this presentation. So I'm going to talk about Application based on data science and that's pretty common in real life how to build a classifier when you are faced with imbalanced data set so I'm a I work for IBM as a principal engineer. I've been working there for around three and a half years so I've been in the past developed many algorithms for machine learning for scalable big data platform and Recently, I've been developing a data science use cases and application and doing some kind of research also and So what's our agenda so we're going to talk about IBM and its commitment to the Data science and what IBM is doing and the second most important question What's you're going to take home from this talk? So that's very important So we're going to spend a small amount of time on that and then what tools and technology we're going to use Pretty much we're going to address three main issues here How to build a psychic base ML pipeline and second? What are the various model evaluation techniques especially when you have a data set which is highly imbalanced and the third We're going to build a real model I mean in a Jupyter notebook in the end we if we have time we're gonna spend like a five ten minutes on that so with that let's move on to the slides so IBM has been investing a lot in open source and data science So before that our group was called STC spark technology center But now it's had been renamed to Kodak which we do bunch of projects like project around the spark data science cancer flow and we also created a model asset exchange it contains a bunch of Models which are very helpful for your real-life applications. So with that IBM has created a Software as well as the infrastructure for all the data scientists We actually give five free machine in our cloud and you get I believe Close to ten gigabytes of space and you can hold it forever So with that I'm gonna show you one video with IBM commitment Making data simple and accessible to everyone gives it meaning We've interviewed hundreds of data scientists to understand not just what they do But how they think how they learn how they build up the work of others And how they get feedback to improve their output Step one is getting the right tools together in one place with the data We started with the open-source tools that are familiar to so many Jupiter notebooks our studio and we're working on incorporating many more But data scientists do much more than work with data. They research previous efforts in the field gather assets from searches Use tutorials to gain new insights and exchange ideas with their peers We're sharing the first steps of our journey Designing what will soon be the easiest way to do great things with data Here's what it looks like When you land on the homepage, you have immediate access to tools and data data from your organization a Marketplace of both free and premium data and access to data stored on your machine Filter the results to get to the content that you really care about and relates to your work If you find a notebook that is related to your work Copy it into your project and give credit to the originator of the work Here's where it gets interesting The right side panel we call it the maker palette It allows you to drag and drop code snippets and algorithms from articles comments or projects right into your notebook Watch tutorial videos in the maker palette while working on a notebook without opening multiple windows from various websites This data space is extremely powerful Not just because of the easy access and analysis of data Although of course that capability is fundamental It's the power of a community the ability to easily learn from others Work with peers discover work that moves the field forward We can't wait to see what this becomes and what amazing work it enables so with that So let's actually go into the meat of this presentation So what we're gonna do in this talk So there are many challenges we face in building a machine learning model Somebody else gonna collect our data set. They may not collect all details The data may be corrupted or they may not have insufficient information And then also people sometimes try to add the extra features because they just want to be careful And there will be a lot of irrelevant columns and everything So we're gonna see how to take care of those things and of course once you build a model There's a standard problem of overfitting and underfitting and we'll see how to generalize it pretty well So let's talk about one story. That's very common all you die if we have a lot of data scientists here So many of them may be familiar with this Your boss or your customer comes to you and say hey, I got this data set I want you to build a classifier to predict whether somebody gonna buy our product or not and you thought oh No problem. I'm gonna build it You follow all the standard rules you build a machine learning model you do a feature engineering You do cross validation and you follow all the tricks and techniques of your books and your tutorial But when you run your test model It doesn't perform good and you don't know why and then you explore a little bit more and you realize That the number of positive samples is very less in this case We are talking less than one percent point zero zero one and there are many applications where this happens For example, you are building a model for detecting if somebody gonna click to an advertisement or not In reality less than point zero one percent people click So what happened is that when machine learning model builds that more From a learn from a data set it considered was positive sample as some kind of noise So it never learns it think oh, there's so little positive sample. It's more gonna learn from a negative sample So it's something like you're like a five-year-old child and you always want to teach them ABCD But you never teach them one two three four, but you just teach him one day Do you expect him to know one two three four? Of course not so that's something the similar analogy So we're gonna look how to mitigate those problems and solve that so in this workshop We'll see actually kind of a talk How we gonna first we gonna first explore a data set and see is it really an imbalanced data set or not? Then we will use various pre-processing technique like one hot encoding and things like that to clean up our data And then we naively build build our first attempt to build a model and we will analyze it And we'll see that it doesn't perform good Then we will look into three or four techniques to make it better. Actually, we will improve it from 33% all the way till close to 90% and In the end we'll talk about a smooth algorithm and over over sampling techniques So what tools and technology we're gonna be using so this Jupiter notebook is particularly Designed based on the scikit-learn, but you can definitely use our spark Or any big data tool. It's just a technique. It can be applicable to anything else, but I believe Most of the big data library like spark. I don't doesn't have a lot of useful libraries compared to our or python, but It's up to your use case So with that We have chosen to use Exibust we could have used logistic regression or anything else It's just illustrate one of our one of our concept because Exibust allow us to do a lot of tuning of the parameters And if you look at the Kegel Lot of the competition winner usually use Exibust nowadays. They are using deep learning to win the competition And then it integrate very Seamlessly with the python and scikit way of doing things estimator and ML pipeline So that was a very good choice to use Exibust So I think we might be familiar with what is a tree base Machine learning algorithm. So I'm gonna go over a very short But there's a very detailed tutorial in the official website. There's a link and everything So what actually happening that you want to predict something? So you get a data set you pretty much build a tree So in this case we get a data set of bunch of people and our features are age gender Occupations and we want to see if they want to play computer games or not So of course you can build a tree and They say oh if you are younger generation you more likely to play a computer games and a male is more likely to play a Computer game than a female, but that's one tree What if there is a any other possibility instead of splitting on first on age? What if I split on something else? So here's an example of tree to here We are splitting it based on the whether you use computer every day or not So what it shows that when you build a tree you can build a tree many many ways So which tree to use so there are multiple approaches in the random forest approach You kind of build multiple trees on a subset of features and then you create an ensemble of model But exit boost is a little bit different on that so instead of building multiple tree So look at this equation. So what is does is that you build up first three and Then you figure out what's your error and once you find out the error You build a second tree only on the errors and the third tree You build based on the error created by first and second tree and for three all the previous three So what it allows us to do is that instead of having n number of trees or n number of cooks You just pick one cook and you it really refined it until you get a good enough performance And that's in sort the whole idea of a boosting tree and there's a very nice tutorial from a creator of Exibus His name is T. Jian and he's a professor in Washington as you get University you can look at the whole tutorial, but this is a whole idea of that So now let's look at the What typically data scientists do and how they built their model So first they read the input then they do Exploration just to get a feel of your data that what's happening and what they want to do and of course then you want to clean your data remove some strings into one hot encoding or normal Numeroek encoding and things like that and then you do a feature engineering and then you build a model Usually model building is a very iterative process means you do some you create one hypothesis and you build your model And you test it. Oh, you like it. That's good go But if you don't like it you keep on iterating so that's an in the end You're gonna test your model in the data set which you never seen it before so that you can really see how good your model is Okay So before we build a model as I talked about that one of the important thing of building a model is a iterative process When you're doing a iteration you need some kind of measurement criteria to see which model is better So of course like we have education system where a bcd So saying somebody get a is a better student and things like that So in a model universe there are multiple matrix. You can use it. It's not like saying the one matrix is better than another It's for saying for a particular use case one matrix is more better than the another one So probably everybody know about accuracy score. It's just you build your model. You see how many of sample it currently set is Classified divided by the total number of sample. It's a very simplistic Score they call accuracy score and it works perfectly fine in a fairytale word means like your data set content Balanced data set number of positive sample is approximately same as number of negative samples But there are certain cases where you cannot actually rely on the accuracy store because accuracy store Just care about how many things you currently is classified. So there is another Matrix they call is confusion matrix. So confusion matrix is nothing but just saying that You have some data set. There is true positive There's actual positive. Let's say somebody wanna detect if it is a fraud transaction or not. So that is the actual ground truth But your model may say hey, this is a Actual fraud that's pretty good. That's true positive. But your model may say no, that's not a positive so the sample originally was Positive sample but your model is saying negative. So that's called false negative So similarly the definition of false positive and true negative is there So your goal is that you want to maximize this green color parameters and minimize the red color parameters but it also suffers from some discrepancy especially for the cases when the data is imbalanced and So there is another matrix we call receiver operating characteristic It actually comes from the older time from a army days when we used to detect army ships from a radar and everything So what it does is that it's just a ratio odd ratio of a true positive versus false negative So on one axis you plot this equation on the top and another axis you plot Equation in the bottom the key idea here is that it measures Your model's ability to detect a positive sample as a positive and a negative sample as a negative And why it is very important because especially when you are doing a medical Kind of model like you know some cancer prediction and all those you want to detect all the positive samples But you don't want to say somebody hey you have a cancer But he doesn't have a cancer because his life is emotionally destroyed So you want to balance between a true positive and true negative? And so that's the one of the use cases where you want to use a receiver operating characteristic So a little bit more details on the receiver operating characteristic. So what's really happens is that? confusion matrix Okay on the left-hand side on the top left-hand side. We have two Distribution one for the positive true positive and on the left is for true negative We know that in real life. There's no one definite answer. There's a gray area So this gray area in machine learning we especially represent using a distribution and Gaussian distribution is very very common on that So if you look at the curve on the top left-hand corner There's two curves and they kind of tends to overlap and you may notice there's a vertical line By moving the vertical line. I can change the definition of positive and negative So what it means is that in a medical let's say you did your blood test or some kind of medical test and your doctor say you know If I put a threshold of 0.5 you more likely I think your test results say you have a cancer But you know 0.5 is a not good threshold. I'm gonna reduce it a little bit to 0.3 But then your probability of having cancer is way less so this ability For tuning a model by changing the threshold from a default 0.5 to 0.3 or 0.7 is a very powerful Techniques because using this tuning parameter you can tune your model for specific use cases this ROC curve is very commonly used for Medical kind of things like especially if you do some kind of testing related to What's your chances of having certain kind of disease or not? So when you move the threshold and in the bottom figure that black dot moves around that So your goal is that you want to move this dot in such a way that the area under that curve is Maximized so that's the key idea of our ROC curve. It was perfectly fine But it turns out that it doesn't work perfectly well when your data is imbalanced means We are talking about less than one percent positive samples So then there is another matrix they call as a precision recall curve and we're gonna There's a slight difference why it works So this is the same matrix one of the parameters are same and you may notice Let's just look at the numerator. So numerator. We are comparing TP versus TP So in an imbalance data set to positive gonna be very small. So you are comparing two small number So when you plot these two things on your visually, you may be able to notice a difference between them because these numbers are very very small But let me go back into this equation in this equation One equation is TP and the second numerator of a second equation is TN So in an imbalance data set TN gonna be really really huge Let's say you're trying to predict if your customer gonna buy your product or not True negative gonna be like 990 out of one sample means those people are not gonna buy your product only 10 people gonna buy it. So when you plot number 10 and 990 you may not be able to notice a small changes happens in the true positive Because it will be drawn by the bigger picture on the which is TN But that problem doesn't exist with the precision recall curve So typically you would like to use precision recall curve for the cases where you have the imbalance data set And that is a key idea. We started from accuracy curve confusion macros ROC curve and precision recall. It's nothing like one is better or something It's a use cases depending upon your business use case. You want to use one the other Now this graph illustrate one of the key point Precision and recall have a inverse relationship means you want to make something highly precise Where you call gonna be go down. So then you again in dilemma, which one to optimize precision or recall so then Engineers and data scientists, they are like more resourceful people they come up with that Hey, we want to optimize two things. Why not just take a harmonic mean of both of them? So that's what we call F1 score But in our use case F1 score may not be the best thing to do and why is it so? because as we notice precision and recall gonna have a inverse relationship and we already proved that in our imbalance data set precision recall is much better. So Do you wanna build your model on the matrix of precision or do you recall the answer is you have to go to your business person? Whoever is your client or your you know manager or whoever gave you the problem ask them Where you are building this model for I can give you an example here So there's a lot of explanation, but in the interest of time. I'm gonna just you know talking a little bit higher level So let's say you are building a machine learning model for Net 90 means it is the ability to block bad website for your kids You know like like who are young so in this case, what's your goal? You definitely don't want your kid to see even one bad website And you don't care if your machine learning model say that your CNN website is bad But as long as your machine learning model detect all the bad website, you are happy So in this case, you really really want high precision and you don't want a high recall So if you are building this use case you want to choose high precision because you cannot Optimize precision and recall at the same time, but look at the another example. Let's say You are trying to predict if some of the customer gonna buy from your shop or not Or maybe you have some financial product. You want to make sure a customer do buy it or not What is your goal there? The goal is that hey, I want to predict some customer whether he gonna buy or not If I predict he gonna buy I'm gonna give you incentive like 20% off or things like that But what if my prediction is wrong? It's okay I predicted this customer gonna buy from my shop and he didn't buy but that's okay But I was able to reach out to all the potential customers. So in this case The cost of mistake is not that high. So in this case, you may want to have higher recall Instead of higher precision. So once we understand this matrix now it becomes Simpler step that how are we gonna optimize our model? So let's go into a model training steps So as I talked about model training is a pretty iterative process first you pick up a model based on certain criteria and certain Input features and you train it and then you evaluate using the matrix in our case We're gonna use recall because we are predicting whether a customer gonna buy from your Buy your bank product or not and then you exist and then you treat again and again. So so in the first attempt What we will do we will nively build our exibus model and we'll see its performance is not good Then what can we do to improve its performance? Because we know that out of one thousand sample in reality You may have million sample or whatnot, but let's assume there's one thousand sample and only ten of them is positive Then of course the obvious approach is that since your machine learning model is considering those ten guys as a noise Why not increase its weight? So instead of giving every sample equal weight Just give the positive sample higher rate. You can try weight of one ten Hundred one thousand ten thousand and run your cross validation and see your performance Once you see a performance, you can choose a right amount of weight. So that is what we call weighted class Classifier strategy and the second is straight easy. So this is the plot It looks like for a different kind of weights and probably the font is too small But we try from weight one pill all the way till ten thousand So in this case since we are trying to optimize the recall We only but we don't want to make our precision really really bad. So instead of Taking ten thousand weight we choose one thousand weight because we also have to take care of Overfitting problem now the second approach to solve this problem is a minority Over sampling and majority under sampling. So this is kind of related to a weighted things So instead of giving a high rate So not every machine learning model allow you to change the rate as a tuning parameter What if you try to use the logistic regression in that case? There's no rate parameter So what you have to do manually the under sampler you have only ten sample which are positive you can just duplicate it like Ten more times so instead of ten sample you got hundred sample and what you can do because you have 990 Majority sample you can under sample it so you can make them like around 300 or something so if they are about close to each other then your classifier tends to perform pretty good So you're gonna get a pretty good result and you build your model your boss is happy and all those But your boss comes next base we can say hey, I mean Joe or whatever. Mr. X Y Z or whatnot You know your model really work, but our competitor is also doing the same thing We have no competitive age So what you can do to get this little bit as there is to beat your competition So that's what this paper came out from a chawla and others. They call synthetic minority over sampling techniques Intuitively when you copy a sample, it's like a cheating. So what we can do to correct the class imbalance problem So it's a very nice paper. You can please read it But I'm gonna visually try to explain in a like a simplistic term the big blue dot Circle is a majority sample and the smaller one is a minority sample The smaller big red color. So what we do we choose the two minority samples and we Create a new synthetic cat sample, which is based on the interpolation between them so please note that Interpolation will be not always in the middle you generate a random number between zero and one and You pick Somewhere in between so the way I like to think is that a baby sometime a baby one looks like mom a baby two looks like Dad so they are kind of interpolation between mom and dad So they don't get hundred percent from one parent, but you cannot guarantee So that's the same idea you kind of randomly choose how much percent is you're gonna get from sample one and sample two And it may turn out that on your decision boundary. You may see there is a blue color big big circle and you may interpolate a red Bit big red circle with a small a big blue circle and you will get data hybrid but you don't want it to create a synthetic sample based on the Majority sample, but that's okay because remember that on the decision boundary They will be few mistake. So even though you create a synthetic camp sample, which is a hybrid of a smaller like Red and blue that's okay because it's not gonna be too many You can just blindly follow this type of thing But it turns out that it is computationally too expensive because if you have Graph and your endpoint if you try to connect all the possibilities and see to it's like a n-square kind of thing So what we can do in practice is that instead of doing all the NC to possibility you only create a k-nurse neighbor So you create a Okay. Yeah, I think it's so you create a Circle and you only look at the Minority sample within this circle and create A synthetic sample once you do that you definitely gonna have a competitive is Compared to your competitor who just copied his minority sample into that and that is a key idea So I'm gonna show a little bit on the Jupiter and notebooks. I think I have 10 minutes So I can go and that so I May not have time to go in details on all the things because it's a really big notebook like but the key idea has been explained in this Presentation and it's around more around 50 to 60 pages once you take a printout and the notebook is in the IBM GitHub and the slides are in the my personal home page and I'm gonna update the latest slide here in like maybe half hour and you can use it to on your laptop or in your IBM Watson studio So with that, let me go to the Jupiter notebook and Jupiter I want to show Dataset and then we figure out that he it's really an imbalanced data set problem and then we go around and Explore a little bit more we do a Correlation plots here and we also do a heat matte plot and we figure out that some of the columns are Correlated but that's okay because exibus is pretty robust for the correlated column But your company policy or your CTO or head of data science will say hey We don't want to use exibus Just use the linear model like support vector machine or the logistic equation in this case You have to be very careful if something is correlated you want to remove the correlated column otherwise your performance of the model will be really bad and With that We do our first attempt we build a pipeline and we do our first attempt to Build a model and I'm gonna show you the confusion matrix graph. I mean you can go in details There's a lot of explanation in that so in first attempt Please look at this number in the Like one because remember we only care about our positive sample and your first attempt our recall is only 33% that's too bad. So what we can do so as we talked about we're gonna use a weighted classifier here and Once we use our weighted Classifier and I run my model again and I would like you Your attention to be pointed to this row containing one and you may notice that we have increased it to a 98% At the expense of precision because in our use case it's okay to we have a little bit less precision So that's pretty good But this is only on our training data set and we will see that our training data set 98% how does it generalize into the The data it has never seen it before and this is the Result so we get 84% on a data It has never seen it means we were able to improve the performance of a classifier For the imbalance data set from 33% all the way till 84 or 85% and Because we decided what we want to optimize and you may notice that overall it's 70% The precision and recall combination, but we only care about recall here and in the end We also have the exam proper links to the all the papers and examples So one of the things I wanted to point out is about future engineering the example is also there So in the future engineering what we do here is that originally the model was built on the 50 Features, but we achieve the same performance using the 17 features and the way we achieve it because Exibust has this very nice feature They call like future importance and they use entropy-based system to figure out what are the important features So one of the key advantage of using a less number of features is that your model will generalize much better so that's the key idea from that and Let me go back to the slide again. Okay, I think I Cannot go back, but in the I think that's pretty much the done I had a slide saying question and answer so with that I will take any questions Yes, please The first okay. I give you a microphone. Can we have for microphone, please? Yes Don't worry. Take your time great You can go. Thank you Hi, thanks for the talk. I was I found that the smote paper very I mean you mentioned I found very very interesting my question is if you interpolate data to Represent this minority sample more what happens if if there's new data you this artificial data You create what happens if it overlaps with the other majority data So if if there's point a point B and you make a point in the middle, what happens if that is it? If this overlaps with the other data, so essentially, yeah, you're Confusing the one you create with with the other one. Okay, so I think can we go back to a slide Oh, I think so I have a I mean in my picture. There is a specifically mentioned There's one blue circle on the left-hand side. So typically in the machine learning theory. What happens? No matter whatever classifier you do. They will be some kind of decision surface So let's say in the two-dimensional. They will be some curve So on the one side of the curve, you will be mostly One one decision. So in this case Our minority sample will be one side of the curve and majority sample will be on the other side of the curve But there is nothing called perfect system So there is a possibility that some of the majority will go on to this side of the Minority side so in this case the smooth paper recommends that you first identify a decision boundary And you create a sample at the decision boundary area so what happened is that when let's say for example if my this particular Thing is a decision boundary because let's say we only three guys on this side So we are minority sample and they're like hundreds of people there. So that's a majority one So now if we consider this as a decision boundary then Of course, if somebody is behind that a lot fire, you don't have to worry because you are overly confident that that's really Minority sample, but if I'm close to this decision boundary. I could be there or here I don't know yeah, because my model could be wrong So if I create more sample around this area where we have confusion You are and you run your model again. You are more likely to create a better model and to answer your question specifically Even though let's say some guy from that side come over here So from majority side he come over there, but they those kind of thing happened very small So even though you create some hybrid sample, that's okay because your machine learn Learning will say that there is five minority sample and one majority sample So that will be considered as a noise So that will take care of that problem even though there are some overlapping issues because remember that They will be always be there's nothing called as a perfect solution, but in majority cases will win Hopefully that answers your question Thank you Yes, yeah, I think it's been used by a lot of big companies to take care of their needs like I know for sure that Yeah Thank you. Thank you Any further question? There is any other question? If not, you can go to ask the expert over there. There is a guy with your hand up Let me I'm gonna go faster and I gonna run. I'm gonna try to do my best Okay, or I can throw the microphone to you, but will be Not the best option that's for you Here you are. Thank you very much for your talk My question is about what when do you recommend going for a small because Obviously small needs more processing power that Using just a weight example or a weight in class So what what is the point where you say okay? No actually it's most of it the one you have to use So what it wouldn't you recommend going for one or the other? so I think machine learning creating a machine learning model is never been a science like Like your Newton equation force equal to mass into acceleration It's kind of fixed right, but data science is a little bit more like data arts. So So for example Weighted one will give you some better result so It's always about the competition if your competitors are doing better than you what you can do to create a better result so you can try Multiple approaches first of course weight samples and then you can collect over sampling of a minority sample and then of course Smote algorithma so nowadays you don't have to code everything from a ground level because most of the standard library in R or Python they do contain package to you know create a Esther samples out of the smote algorithma So you just have to make a function call, but as you mentioned that for the Certain large amount of data set You can always use this parameter call K K will only look for the K nearest neighbor and that will be able to take care of your if it runs for a long time and There are of course other techniques you can create a better preprocessing you can create a drive features and other I touch one aspect of that assuming you have created a feature engineering and everything and you still not able to beat your competitors or your Clients are not happy. What would you do? So thank you very much