 Okay, thank you everyone. Next talk, Adrin is going to tell us about psychic custom transformers. I'm sorry, I missed that mic. Okay, thank you. Hello everybody. I'm Adrin, I work at Anaconda and I'm one of the psychic learn maintainers. Today, I'm gonna talk about how to write your own estimator. Before I start, I need to know how many people have used psychic learn. Okay, cool. So I don't need to focus too much on the background. It's good. For the rest of us who are not familiar with it, it's a statistical machine learning library. It doesn't, that means that it does cover all the old school stuff. The support vector machines, random forest k-means and whatnot. It does not include the deep learning ones. It doesn't cover GPU acceleration. Just not in this scope. That said, when we look at the library, what are some of the main components of the library before we start writing our own estimator? We need to understand that. We have estimators. The estimators are either transformers, in which case they take some data, they transform that, they spit it out. Or they're predictors. They are classifiers or regressors. Then we have squatters. These models, we need to know how they perform. Also we have different squares to measure their performance in different ways. Then we have meta estimators. The meta estimators take an estimator and they do something with it. Two of the important ones and relevant ones to this talk are pipeline, which allows you to have a set of transformers and then if you will, at the end, predict. You have your classifier at the end and then you have your transformers before that. And then it lets you treat that whole pipeline as one single estimator. And then we have grid set, which is easier to explain with a little example. The usual pipeline, we have our data. We need to preprocess and prepare the data to give that to our classifier in the case of a classification. So in this case, I have two steps to prepare the data and then I feed that to an SGD classifier. But each of these steps usually have some hyperparameters that you can tune. If it's a transformer that, if it's doing principle component analysis, like how many components do you want to return? That number, are you doing K means that K? If you're regularizing, what is the regularization parameter? How do you do that? That's your parameter set and then that set defines your space. And now you want to search in that space and find the best point for your data. GridThirst does that for you. You pass it your estimator and your parameter space and it does the search. If you want to use a different score than the default one, you can also pass that. So with all that flexibility, why would we want to write our own estimator? There are a couple of cases. One is that Scikit-learn doesn't have all the algorithms out there. It does have the classical ones, but it's not really possible for us to include everything. So if you fancy a new algorithm, it's probably not that. Or if you are a researcher who would like to implement their own and work on their own method, you probably want to write your own and then see how it works in combination with the other methods and transformers out there. Or my favorite example is if you're doing ethics. Doing ethics and like bias mitigation and detection are not in the scope of Scikit-learn. So if you want to work on that, then you would have to write your own Scikit-learn compatible ones and mitigate your bias. We also don't include things that are extremely specific to certain use cases. If you need to do something that applies only to your data, you probably need to write that and it's not going to be included in the library. Another use case is writing meta estimators. If you want to do something before or after every time you call an estimator, you can easily write a meta estimator, wrap around your estimators, and then do logging or auditing or whatnot. So what's the basic API? What does it look like? Estimators expose fit to train on the data, predict if they're doing classification or regression, transform if they're a transformer and score. If they're a predictor, you need to know how they perform. When I look at people's codes trying to write their own estimators, it looks as if they watched this talk or equivalent of this talk and they stopped here. So please don't. So if I want to write, this estimator is not really doing anything fancy, it's just to show how you could write it. What are the components that you need? The main one, okay. Before that, one thing, it is a very opinionated API and it has its own design. I know probably half of us in this room may not agree with that design, but that's what it is. So that's not the discussion. We can talk about it later. We do composition. That means that if you're writing an estimator, you have to have base estimator. If you're writing a classifier, it's classifier mixing. And then depending on what you do, you would need different mixings, regressor mixing, meta estimator mixing, and a bunch of other ones. We have a bunch of really nice methods to do input validation. You really don't need to write your own input validation. You don't need to check if the input is a non-parry or not. All of that is there. And then my classifier is gonna wrap around an SVC in a very poor way. Then, so things to note here, I have my init and in the init, I accept my hyperparameters and the only thing I do is that I store them. I store them in public attributes and I do no validation. That is important. All the validation goes into fit. In fit, I do input validation. And if needed, I do validation on my parameters. If two parameters are not compatible and I need to check like only one of them is set, this is where I do that. And then here I'm just storing my trained SVC in an estimator with the trailing underscore. And that's again important. The convention is that attributes are attributes that are public. If there's a trailing underscore, it is set in fit. If it's a leading underscore, it's private and the backward compatibility is not guaranteed. Then I have predict. I check if I'm fitted. If yes, then I check my input and then I delegate to my estimator's predict. So what did I use there? One of them was check is fitted. It checks if there is any attribute with a trailing underscore. You can tune the behavior. Check array is a really long and important one. It does return an umpire array unless you say you do explicitly want to support sparse arrays, in which case it doesn't convert a sparse array to a dense array. And if it's a pandas data frame, it converts that to an umpire. So if you want to, for example, check your, like, get your feature names before from your pandas data frame, you do that before passing it to check array. And then check x, y does the same thing plus doing some extra validation on y. Now that we have it, how can we be sure that it is now compatible? The compatibility is usually checked through our common tests. We have check estimator. It does a whole bunch of tests and we recently added this decorator, parameterized with checks. You put that on top of your pytest test and then it does all the tests individually and you can easily check and debug what went wrong. For example, when I was writing this one, I forgot to set the classes attribute which is needed if you're a classifier. And then it complained and then I go back and set it. Now that we have it, then it's easy. You can use it the way that you would use before. I have a bunch of data. I can fit on my data. I can get my score. I can put that in a pipeline. Here I have a select k best and then my classifier. And then I can even pass that to a grid search. I feed my grid search and then if I check my best estimator, I see that my classifier, this is the hyper-parameter selected by the grid search. So what are some of the conventions? I pretty much mentioned all of them except the one that the parameter is passed to fit. The one that you see usually in the existing in Psyched Line API is sample weights. But you could pass other stuff, you could pass groups. You could do, like when we have in the context of bias detection and fairness, we usually have our protected attributes that are not a part of the data like gender, zip code, race, all of that. All of that you can pass to fit as a fit parameter. The convention is that everything that you pass as a fit parameter should be sample aligned. If you have feature attributes, probably don't pass it there. If you have something that you could pass as an init parameter, do that there. And it's important because if you do pass things that are sample aligned, the grid search, when it does the folding and the cross validation, it does slice these extra parameters for you and then pass that with your data to the fit function. Then, so these are the usual ones, but not all estimators follow all of that and either other meta estimators or some of the tests need to know that. So that's why we recently introduced estimator tags. They're still experimental, as in we may change them without prior notice, like they don't go through the usual deprecation cycles, but they're pretty useful. You can tell the other meta estimator or the test what kind of input types you allow. Do you support multi-output? Do you accept NAMS? And then if you want to change any other defaults, you can do that by having a more tags attribute. So what are we doing now? This is like how it works now, but we're adding a bunch of stuff to the API. And they're useful, but that means that you would also need to add or change your API a little bit. One of them, the first one that is coming in, which hopefully will be there in the next release, are N features in and N features out. We want to be able to inspect the models and know how many features went in and for a transformer how many features are coming out. That's the first step and it helps a lot for us also to clean up the code, but it also helps to understand what's going on in a pipeline. The step after that is that we want to have feature names. If usually if I have a data, which is not just a numerical block, if I have a pandas data frame with a bunch of feature names and I have a pipeline, I would like to follow in my pipeline how my features are going through. If I have a bunch of transformers, at the end, if I have a classifier, I want to know what went into my classifier. If I have a linear model and then I want to inspect my coefficients of the linear model, I want to know which feature was it that now it has a high coefficient there. So for that, we would have right now, the API allows you to have get feature names, that you return the feature names, but it becomes ambiguous, like is that the input feature names or is that the output feature names? Sometimes it's not clear to define what it was. So we're deprecating that and we're gonna have feature names in, feature names are pretty clear. And that means that if you pass a pandas data frame, you would extract the feature names for you and then at the end, you will have all of that propagated in the pipeline. The next one that I'm really excited about is data properties, sample props, feature props and data props. Sample weights is an example. Gender that I mentioned is another example. The issue there is that right now in a pipeline, if I have a pipeline and I want to pass that to a fit, I have to say, okay, pass this one to the fit of that step of the pipeline. And then if I want to pass the same sample weights to another fit, I need to copy that and say, well, also pass that to this one. And if I have a meta estimator, I don't know if the meta estimator should pass that through or not. Maybe I have to duplicate that and pass the one that is used by the meta estimator and pass another one that is used handled by the meta estimator. So it's really not clean. The idea here is that I could have a really nice routing and every step, the pipeline, every meta estimator would know what needs to be passed to which step and not just fit, also score and predict. If you need to pass other features, other properties to them, then you should be able to pass them. That requires changing the API a little bit and then there are some prototypes and hopefully they will go forward and we'll have them soon. But that's not all of it. I only showed you like a really, really simple one. And for example, here I could, if I really wanted to write a clean estimator, if I am wrapping around an estimator, I may be a classifier, but I'm also a meta estimator, which means that I shouldn't have to care about which hyperparameters are there. The user should be able to pass it an estimator and then I should be able to know what the parameters are. You don't have to do that yourself. You can use the meta estimator mixing. And all of that, you can see in the points that I give here. This one, this documentation covers most of the stuff I talked about. This file, the base.pile has everything other than the except the meta estimator ones. There are other mixings that you probably could use. And then the meta estimator one, and then the validation.pile has a lot more utility functions that you could use. Thanks, I'll take questions now. We have time for lots of questions. Don Misha. So this estimator can work with like tensor flow data structures and the other from Keras like has it enter a property or just for basic data structures. If I have a data in like tensor flows that is for the structure, can I just insert this estimator from Cycler or it will require some kind of other conversions in the way? So if I understand the question, is the question that if you could put, for example, a PyTorch model as an estimator here? Yes, yes, or something like that, yes. Yeah, so the idea is that they don't, so the default API of any of those libraries doesn't follow this API, but usually what happens is that they have an escalern wrapper. So you could, I don't remember which one has it where, but for example, if I see PyTorch.escalern, then I know that that's where I can find my Cycler and compatible estimators. And then those estimators, they wrap around their own estimators, but they expose an API which is compatible here. Therefore you could take that and then plug it in a pipeline here. Okay, so principally it can. Yeah, no, it doesn't, people do that. Okay, thank you. Don't be shy, raise your hand. If I don't see, raise it higher. Nothing, okay. Thank you.