 Okay, hello. Now we're going to learn about extending scikit-learn with your own regressor from Florian Belland Thank you Hello, everybody. So in my talk, I'll In my talk extending scikit-learn with your own regressor I'll first give a short introduction to scikit-learn is maybe most of you know and Then I'll talk about an estimator, which is not included Which is not yet included in in scikit-learn a robust estimator called Tiles Zen With this as an example, I'll show you how you can implement your own estimator and scikit-learn how to extend scikit-learn Then I'll talk a little bit about what you need to consider if you want to contribute an own Estimator to scikit-learn and I tell a little bit about my own experiences in contributing to scikit-learn So first of all, what is scikit-learn? So scikit-learn is machine learning libraries So whenever you have some kind of data and you want to extract some insight from this data, you can scikit-learn You can use scikit-learn It's simple efficient tool for data mining and data analytics. So it's really simple to use and So that makes it accessible for everyone and you can really apply to all kinds of problems. So I took this Marketing sentences right from the web page, but it's really true. So it's really extremely simple So if you haven't used it, you should you should definitely look into scikit-learn It's built on NumPy, scipy and matplotlib. So two three famous libraries which are used all over in the Python ecosystem and What is really good? It's open source, but still commercially usable. So it's BSD license. So if you want to Not maybe not contribute everything you do with it back to scikit-learn, you can still use it Which makes it really good in the commercial applications Okay, so this picture it can also be found on the scikit-learn website I like it because it gives a nice overview of the things you can do with scikit-learn The basic areas of applications. So you can do Classifications so a typical example would be if you have like handwritten digits and you want to classify if a digit is a 1 or 7 for instance, then you have everything related to clustering if you're just looking for patterns in the data without having some kind of labels real real Target so for unsupervised learning you can use clustering It also supports dimension reduction techniques So when you have too many features and you want to avoid overfitting for instance You have a lot of tools to do PCA and so on so dimension reduction And of course the whole regression part if you want to find the relationship to a target variable depending on some features and this is what we're going to talk about so but before we start first a little Refreshing from the maybe from school who've learned about this the least square the least square method is called linear regression in scikit-learn and I want to shortly explain how it works because tilesen is a kind of extension to this regressor so We have independent variables x1 to xp and in scikit-learn speak their cold features and We have a dependent variable so the so-called target y and now we want to build a model We want to use the features to somehow predict the value of y and a typical really simple approach It's just a linear model. So you have a linear combination of x and the coefficient w and You try to explain your target variable y with the features x So in order to now find the the W's You minimize a functional which is given here so this is then the least square you are minimizing the the square distances and in a typical one-dimensional case this is a picture here, so the the blue dots is your data and In one dimension so the x axis is a feature and the red Line now minimizes the square distances to all black dots. So this works really well if you have If you have perfect data because there's an internal assumption that Arrow value is normally distributed. So this works then quite well, but in practice and in money Many many projects that I worked on all the data you get maybe from customers is less than perfect so you have a lot of outliers you have corrupted data because of measurement measurement errors because of maybe someone put in a wrong value somewhere and Then Quite often your data looks like this in one dimension So you directly see on the on the right half side that there are some values that don't really fit to the really dense line on the left side so What you would do in this case you would maybe just remove those dots just by by by looking at this plot and decide, okay I don't want to take this into my into my fit But what do you do if you are in a in a ten dimensional space or in a n dimensional space? then you can't just see by By looking at the plot like this, which are your outliers and you need to somehow make some complicated preprocessing to eliminate those outliers so What happens if you now just apply the ordinary least square? So you would get of course a complete wrong result So you would not expect the line to go like this you would rather want to have The line to go through the the black line to the dense line on the left side So this is something I think what this really you really do consider whenever you look at new data That there are no outliers in this new data and that you come up with something robust So the tilesen as a natural generalization of the least square method is an algorithm that now looks at all possible pairs of those of your sample points and Calculates a list of slopes and Then in this case and if you have in the end the list of slopes You take the median and the median is what it makes what what what them what makes the method really robust Because the median doesn't care about a single value. It only cares about the ranks So the order of those values So I think this is easily shown and understood with an example So here again our plot with the with the outliers We now take Two points the two red dots here. We calculate a slope connecting those two plots and add it to the list Just close to the x axis and the slope of 3.1 in this case and now we just go on with all possible points and so this time it's 3.1 again and And Now we are not so lucky anymore. So we have one outlier connecting with one point We would consider not to be an outlier and so the slope is 3.1 and you see the list is sorted and We go on another one and Bet likes or even two outliers and we could go on and on and on but already here We see that if we look at the center of the list of slopes oops that the median so this the center is Correct, so it's 3.0 and if we look at 3.0. So this is The the slope of the line we would expect. Yeah How the line should be so inside this dance line of our sample points. So The whole principle is this that you take the median and then that you don't look at All points so in this method the outliers. They're not really considered anymore in this case So this is the case for a two-dimensional problem. So just one feature and a target variable of course, this method can be extended to To end dimensional space because most cases if you psych it learn we will have a lot of features and not only one feature and In an n-dimensional space, so I've given here the citation to this paper In an n-dimensional space You don't have any slopes anymore so the slopes become hyper planes and the list of slopes then becomes a list of vectors and So but you basically do the same thing you sample in an n-dimensional space and plus one points make in hyper plane and Put this vector of the hyper plane inside the list and then it becomes a little bit tricky because Then you need to decide what is the median of a list of slopes and the median of a list of slopes Can then be for instance the spatial median and the spatial median is just if you see the list of Vectors as just points in an n-dimensional Dimensional space you try to find the one point so that the sum over all distances to all other points is minimized So this is the so-called Fermat Weber problem, but basically it works exactly like like it does here Okay, then again The comparison the ordinary least square and the tiles and if you do this iteration really for all points It finds the perfect line Okay, so This is about the motivation of tiles in and At one project I had to deal with corrupt data and outliers. I could not really by hand remove by hand and Then I tried. Okay. How would I now implement this? Estimator inside of a scikit-learn so The good thing about scikit-learn is that you have a lot of good documentation. So I think The the scikit-learn is used so often because the documentation is just so well so if you look for how to write an own regressor you directly get a manual and You need to if you want to write an own regressor you have to provide two Four functions set params and get params params This is of course for setting and getting the parameters of your estimator and those methods are They more or less used only internally so They use for instance if you do cross validation or if you use another kind of meter estimator those Functions are used to set and get the parameters of your estimator, but you need to implement them for Yeah for for your own estimator and of course you need a fit and a predict method so the base estimator class which is inside of scikit-learn already gives you an implementation of set params and get params so that you can just inherit from it and Since we have since tiles and is a linear model. We can also directly inherit linear model And this also gives you to predict method because in a linear case as we've seen before with the formulas I'm predicting a feature or design matrix X is just Matrix vector product. So you just take X times the weights W. We have calculated before So if we inherit like shown on the right side if you just take in let's our tiles send estimator inherit from linear model we already get set params get params and predict and Additionally, we have so-called mix ins in scikit-learn so the principle of mix ins is that you have some reusable code that can only work together inside something larger and You can combine different mix ins inside a class and in Python mix ins are Done with the help of multiple and multiple inheritance and in our case So there are a lot of mix ins classifier as requestors cluster transformer mix ins in our case since we're writing a regressor We of course inherit also the regressor mix in which gives us additional additional functionality like for instance a score function so but that's already about it so To see the source code so Tiles send as I said before we just inherit from linear model and regressor mix in To get set params get params and predict We override the init function. I made an abbreviation here. So Of course, you state all different kinds of parameters you have in your init function Like if you want to fit the intercept or if you don't want to fit the intercept In my case, yeah, they're like ten different Parameters also if you want to work maybe only on a subset of your sample points and so on and if you want to Make this subsampling with the help of some random State and so on so the more interesting part is then the the fit function Access now the design matrix the feature matrix and why the target as usual in scikit-learn here I Check with the help of check random state the random state if we really do some some some subsamplings if we work on some sub population of X if if you don't want to consider all combinations and We also check the arrays X and Y so check arrays and check random states are two function Which are in scikit-learn utils and if you write your own function if you write your own estimators You should have a look in in scikit-learn utils for all the developer tools which help you a lot doing those repetitive Things like yeah checking array is it a float is it a dense format? And is the random state given as a number that you should use a seed or is it a random state object itself? And should just be passed on So this is about the the developer tools inside scikit-learn then the actual algorithm comes I don't want to go in too much detail about this algorithm. So as I said before it's basically Quite simple. It's just technical because you need to Create all those different combinations of sample points in an n-dimensional space and You also need to consider that you don't do too much. So depending on some some maximum number of samples you might want to consider and also I did the Parallelization with the help of job lip, which is also included inside scikit-learn. So scikit-learn also comes with some external packages, which are directly included like six and Job lip Okay, and then in this green tiles send algorithm part I calculate then the coefficients of course the source code is online. So you can check it out and And Now the coefficients they need to be They need to be stored for the predict function to work and we store it in self-intercept and self cof So that the predict method That uses those arrays works and in the end of course we return self which allows us to chain Different methods together that we can call fit and directly dot predict for instance So after having Programmed this I was really happy that it worked so well So without being an scikit-learn developer or something I could really easily Take my tile send prototype and put it in inside this framework so that it can be used with things like cross-validation for instance and so on and I thought okay, why not just give this back to scikit-learn. So I Got the okay from my boss and decided okay. What do I now need to do to really? Contribute this and again so Contributing in scikit-learn is also well documented. So they have really good high-quality standards and So what you need to do if you also want to contribute something you your code of course should be unit tested at least 90% But of course 100% To make really sure your method works then Of course documentation is really important. So I think looking back the documentation Took me way longer than actually writing the code because you need to find good examples You need to explain a little bit your method. You need to define all your parameters in in springs and so on and Yeah so you should also consider what the complexity of your algorithm is the spatial and runtime complexity and Yeah, as I said before like you need to draw some figures Maybe you want to compare your method to an already implemented method in scikit-learn And if you got the idea of this method from some paper you should of course make a reference to this paper or papers Then they're of course coding guidelines. So as usual pep8 and pyflakes is used in scikit-learn and They really help a lot to find like yeah, quite obvious problems, but it's good that it's automatic can be automatically checked and And as I said before you should You should make sure that you lose use the scikit-learn utils that you don't re-implement stuff that is already there and Another big big barrier for me was that I had to Yeah, make sure that my code runs and Python 2 6 to 7 3.4 and so and so forth And this can be done with the help of 6 that you usually heard of and this is also included in scikit-learn and the probably station with the help of job lip Okay, so this is about the requirements for contribution and then I thought okay. Why not just Contribute this then so a little bit about my experiences so my first pull requests started on March 8 and Yeah, it was my first kind of pull request in in the open source world and the community of scikit-learn is really great, so There were a lot of improvements due to really good remarks, so I I could improve the over the help of The scikit-learn maintainers the performance was increased by a factor of five or ten even so it was a really huge improvement and Also, I got some coding guidelines still I had still wrong at this time So this is really good. So of course showing your code to other people always gets you good feedback and Then there was also a discussion about a tire send being more statistical Regressor and that really machine learning and so on if it should maybe be better included in stats model and If ransack, so this is the random sampling consensus Method is maybe Almost always better than than tiles and and this is something this is included in 0.15 and at this time it was a scikit-learn 0.14 so it was not included at that time So I didn't even know about this existed. So So during that time I learned about the new methods. So Yeah, it was really Really cool and if you yeah, I want to follow up on this pull request. So it's currently it's so cycle So tiles and is still not included. So I'm still working on this and If you want to learn about The discussion was really interesting discussion I can only recommend to everyone if you want to if you want to contribute to an open source project It's always a good idea because during that Yeah, during that way you really learn a lot just about Yeah, how to improve things and what common standards are and so on Okay, so that's about it with my talk Yeah, a little marketing slide blue yonder the company I work for is hiring. Maybe you've seen us just outside At our booth and we will be here until Sunday so even throughout the pie data. So if you want to come and talk to us. Well, okay Thanks a lot Any questions Yeah, so the question was a little more traditional if they're more traditional techniques like what Rich regression. Yeah rich regression is included But yeah, it really depends. I mean rich what rich regression does is it removes Features completely if you have too many features techniques like lasso is another one and rich and There the problem is more you want to avoid overfitting with those methods. So you have let's say 100 features, but only 1,000 samples and this is really prone to Overfitting and then you give it to lasso or rich or another one is ARD and then it's kind of says Okay, I throw out feature number five and it reduces. It's more like a model red rock a reduction thing. So yeah, so the the thing with the outliers is more it's different because you can have this outliers inside one features and So I think it's a good idea to also include More robust estimators in scikit-learn and as of now I mean ransack is now included and this is algorithm coming more from the computer vision So it's a more heuristic. It's not that complicated It tries to select the right points and checks if it adds other samples to this consensus set and so on So I think the scikit-learn Developers are really now looking for more robust things in addition to what they already have Yeah Some more questions Yeah, so the question was if tiles and could be paralyzed and Yeah, it can and I paralyzed it. So the thing is that taking out those Different combinations of all possible points, of course, this can be done perfectly in parallel and calculating then the hyper planes can be done in parallel and Writing back to some large arrow array is can be done in parallel So this is what I did with the help of chop lip which is included in scikit-learn and this works really good Only the last step that you need to find this one single spatial media median. So this is then So the algorithm is based on a reweight at least square. I think it's called modified Weitzfeld's method and this Is then iterative and can't be paralyzed, but the first part of course is easily paralyzed. Yeah