 Hello, my name is Emil Wietfeldt and I will be talking about Femus, dealing with in-balance data by using a sophisticated over-assembly. We'll start by a motivated fictional scenario. Imagine that you were under the health care startup, where the company's mission is to provide preventive care to lower the overall medical task. A new cancer screening is being made available and you have been tasked to develop a model that classifies customers that will benefit from it. So you spin up some modeling and you start by loading in the data, loading tiny models, you run the customer modeling template workflow and you get awesome results and you get a very variable accuracy. But before you report it to your leader, you go and look at some of the other diagnostics to make sure that your model is actually sound. So you start by loading in the confusion matrix. But oh no, it appeared that the model always predicted the majority class. We didn't see any of the observations being categorized as at-risk. And even worse, the class distribution is quite stewed, with the vast majority of cases being classified as low-risk. So the PRD model wasn't able to distinguish between observations at-risk and low-risk. So we need to be able to find a way to deal with the problem. One of the assumptions we need to make about our data is that there's a fundamental difference in the distribution of the people at-risk and the people that aren't at-risk. Furthermore, we have the added complexity that there's way more people not at-risk than the people are at-risk. So a lot of modeling techniques will not be able to properly calibrate. A lot of models perform best if there's a more or less even amount of observations within each class. So how do we deal with this problem? One way of dealing with this problem is to weigh the different observations according to what classes are. So in this case, we would assign higher weights to the observations in the minority class to be able to give enough power. Another way is to use some fancy ensemble methods that clearly reuses the data. So you have an evenness in the same way. Another way is to perform over-sampling or under-sampling. I'll be mostly focused on using over-sampling and under-sampling for the range of this talk. Before we go on, I'll do a brief definition of what I mean by over-sampling and under-sampling. This definition will be very broad, but we'll wait for what I'll be talking about today. When I talk about over-sampling, I mean the process of creating additional observations from the minority class. So minority class is any class that isn't the highest populated class, which would be the majority class. And under-sampling is defined as removing observations from the majority class. Ideally, we would end up with the different classes being more or less even. For the remainder of this talk, I'll give a disclaimer. So to be able to comprehend the different methods I'll be talking about, I'll be doing all the visualizations in two dimensions, I'll rest assured that all the methods generalized to higher dimensions would touch the old world in Euclidean space. Similarly, the examples I'll be showing would only have two classes. And then these methods would generalize to use multiple classes by drawing through the classes one by one. So here we have a small selection of data that could be a subset of the data we have before. So here in the lower left, the blue points are the customers at rest, and the green points are the people that don't have enough rest. And as we can see, there's a fairly decent decision binary between the two, but we still have some points that are between different ones. So one of the main outliers is happening right in the middle of this blue point being surrounded by all these green points. One of the first ways of dealing with this is randomly removing samples from the majority class. So that would mean that we would take some green points at random and remove them. So first we take half of the green points here illustrated in a dark green color and we remove them. Now we have a more even distribution between the number of blue points and green points, and our models will hopefully be able to perform better. It can be scary to think that we are losing some information, but rest assured that this is one of the ways to make our models work. Another way of dealing with this is to perform over sampling. So we want to create additional points as a lot of different ways of trading points. You should duplicate existing points. So there's a little bit like what we did before, so instead of removing points, we're just adding copies with replacement of the low town cases. We can also generate points around existing points. We want to do this in some space, which is generally done in the Euclidean space. Lastly, and every which I won't be doing over, is where you create a generative model of the distributions of different classes and draw samples from those distributions. So they will be completely synthetic, but they should have roughly the same distributional properties as the original data. I'll be focusing on generating points around existing points. So here we'll introduce the smooth algorithm. So smooth stands for synthetic minority over sample and TET need. Smooth is a very clever TET need, which works by generating points between existing points within the classes. It has a lot of different varieties and related methods. So I'll be showcasing how this method works by showing you how to smooth one point, which can then be generalized to smoothing many points. So we have the data and I'll be focusing on this little area to be able to zoom in a little bit, so we can more clearly see what's going on. So to smooth a point, we first select one of the points. Here I've marked it in a dark gray in the upper right. Then we find the nearest neighbors of that point in the same class. So in this case I found the five nearest neighbors around the point and to illustrate I've added dotted lines between the point and its neighbors. Then we randomly pick one of those neighbors. Here I've illustrated that by doing a fit black line. To generate the point, we then randomly place a new point on the line between these two points. And now we have one new point. To be able to smooth the whole dataset, we just repeat this process many times. So here we've seen it where is the data from before, but all the gray points as we see here are new smoothed points that have been created between existing points. And if we want to create a completely balanced dataset, then we take the number of points in the majority class minus the number of points in the minority class and generate that many points. And commonly in smooth what you do is you make sure that each point is taken more or less evenly. So you would take one point around each point or two points around each point to make sure that all of the points haven't been created around that one. One of the variants of the smooth algorithm is the all-aligned smooth. So here we, as the name suggests, we want to generate points along the border. So you take all the points that only has its same class as itself, as its neighbor, and map those safes. So these will be all the blue points over here. They're not near the green point, so they are far away from them and we taught them safe. On the other hand, the points that are completely surrounded by near-boring points we taught that was lost. So they are like too far along the decision boundary to really do anything. So we have points here and right in here. Trading points between that point and other points is going to overlap with other classes which will create some weirdness. Then of the remaining points, we say all the ones that has more than half of the neighbors turned from a different class will label those as danger points and will then use the smooth algorithm but only along those danger points. We'll notice that since this data set doesn't have that many danger points, the new, created points have been more focused in a smaller area. One variant of the borderline smooth algorithm is that we generate points not just between the neighbors of its own class but between all the neighbors. So we can very clearly see that right in the middle where the previous one doesn't touch the neighbors, there's a very key separation where the variant will put points between the two classes. And the last variant I'll be talking about right now is the Addison, so it's adapted synthetic generation. It then follows the smooth principle but instead the points that would form the basis of where we create new points would be selected proportional to how many neighbors that are from a different class. So all the points that we previously noted are safe, would have a low probability that they'll be selected and the points that are very close to other points will have a high probability of being selected to add new points around. And these all have different pros and cons depending on the data you're using and how you expect the distribution to be. So to be able to use this in practice I decided I needed three different criteria. I need the methods to be able to handle more than two classes. So I needed it to be able to have two or four minorities and one majority and they all be able to pull themselves up in one iteration. I also wanted to be fast and have a low memory footprint. As you can probably imagine by thinking about this algorithm I just explained it's very easy to write this code in a way you're not that memory efficient because you're just creating a bunch of small things. But by cleverly engineering you're able to do all the randomness at once and do some smart indexing. Lastly I want to be able to generate exactly endpoints. So some RNFs will generate a multiple of the majority count which is not what we want but we won't be able to achieve complete equilibrium. So that's where the Femus package came in. So I last year developed the Femus package which is now part of the tidal model framework. So it adds additional steps that seamlessly work with the recipes package and they implement these methods I've just talked about. And a small example you can see right here we have the credit data set with some bad status and some good status. We just put in the step smooth in our recipe specification and when we run it out in the other end there's an even number of points throughout. And it comes with a couple of more methods than what I've talked about so far and there's more methods planned in the future. If you want to know more about these methods you can read the announcement post on the Tidyverse blog and that's all I have for now. Thank you.