 Bonjour tout le monde, mon nom est Olivier Grisel, je travaille à Inria en Paris et j'ai des fondations de l'Inria Foundation, peut-être qu'il n'a pas vraiment besoin de ça. Je suis un enjeu de software pour soutenir le développement du project de l'enseignement de psychique, avec un couple de collègues à Inria, et avec un commun international de contributaires du monde. La library de machine learning de psychique est dans le Pyton. Comment beaucoup de vous ne connaissent pas de psychique ? Je n'ai pas utilisé le psychique dans le passé. Ok, un couple de nouvelles personnes. Pour cette présentation, ce que j'aimerais faire, c'est donner la première introduction à Psychite Learn pour nos commerces. Et puis, faire attention à des nouvelles features depuis le dernier réglage de Psychite Learn, c'est le 2021. Surtout, l'algorithme de la new gradient boosted trees que nous avons contribué. Et puis, donner un demo sur ça. Et probablement, nous n'avons pas de temps à faire beaucoup plus, mais si nous avons plus de temps, je peux parler aussi d'autres choses qui ont aussi été réglées dans les dernières réglages et d'un travail enceint. Donc, premièrement, le psychique et la machine learning. Donc, le modélisation de psychique, qu'on appelle aussi la machine learning statistique, c'est le processus d'utiliser des événements répétits qui sont récordés dans le database. Donc, pour garder un record historique et d'extraire des structures statistiques de ces records, afin de les transformer dans un modèle exécutable. Donc, le but est de générer un programme qui va pouvoir prédire l'outil des événements répétits. Donc, vous pouvez voir ça comme un alternatif pour les règles de hard coding, élevé par des experts qui connaissent bien le processus de modélisation. Mais, d'abord, nous avons utilisé un record de données et espérons que cet outil exécutable va pouvoir faire de bonnes prédictions sur les événements futurs. Donc, c'est généralement le plus utile pour les petites prédictions, un grand nombre de petites prédictions qui n'ont pas de grande importance si vous faites un défaut individuel. Mais, vous voulez que, à l'avantage, votre prédiction soit meilleure qu'un randonnage de gaz, ou des règles basées que vous pouvez designer rapidement. Et puis, vous pouvez être utilisé pour optimiser beaucoup de processus business ou de discoveries scientifiques, etc. Donc, la flèche générale est de commencer par le record de données. Donc, vous avez un processus d'acquisition qui pourrait être basé sur les smartphones, les caméras, les microphones, l'interaction avec les utilisateurs via un web app. Et puis, tous ces événements individuels, ils sont récordés dans la base de données. Et la première étape, c'est d'abord trouver une représentation numérique de ces records. Donc, ce sont les lignes bleues dans cette base de données. Et vous avez une colonne spécifique que vous êtes intéressée par, la variable, la variable target que vous voulez prédire. Donc, c'est le gris. Parfois, c'est naturellement présenté dans votre base de données. Par exemple, si vous voulez faire un forecast de quelque chose, vous avez l'information passée qui est déjà là. Parfois, vous devez collecter des annotations humaines. Par exemple, si vous voulez faire une classification d'image, une translation ou des choses comme ça, vous devez avoir des annotateurs professionnels pour vous donner ces labels. Parfois, vous avez cela. Vous pouvez pluger un certain type d'algorithme de machine learning. Donc, ce sont les modèles mathématiques implémentés dans Python, ou C++, ou autre chose. Et le processus d'acquisition est un modèle statistique. Donc, le modèle statistique est quelque sorte de summary. C'est un peu un couple de méga bytes de paramètres, par exemple. Ce genre de summarisation est un grand set de training que vous utilisez comme l'input. Donc, typiquement, les modèles statistiques sont un couple de méga bytes, ou même plus petits, alors que le set de training est peut-être des gigabytes, ou des terabytes, en quelques cas. Donc, si vous avez cela, vous pouvez le déployer. Et, basiquement, vous n'avez pas besoin de récolter tout le set de training. Vous pouvez juste faire un copier de ce modèle statistique et le déployer sur un autre serveur ou sur un téléphone mobile. Et récolter de nouveaux données, avec le même processus d'acquisition, et ensuite exécuter l'algorithme de production par remplir les nouveaux tests récords et les modèles pour obtenir l'outre. Donc, un exemple typique de cela, c'est que, si vous êtes d'une certaine agence, vous pouvez récolter les transactions historiques pour des types de houses différentes. Et pour chaque de ces records, vous collectez des features numériques ou des features catholiques, dans différents columns. Donc, le nombre de rooms, c'est un integer. L'arrière en mètre est un variable continuant. L'année est un variable ordinaire. Et, typiquement, vous avez aussi le variable target. En ce cas, c'est le prix dans vos rôles, dans une certaine colonne. Et vous pouvez aussi avoir des nouveaux records dans votre database où vous n'avez pas ce variable target parce que ce sont des nouveaux customers qui veulent obtenir un estimateur pour les houses qu'ils veulent vendre. Et pour ceux, par définition, vous ne savez pas le prix de la transaction futures. Donc, ici, le goal est de trouver les relations statistiques entre le prix et les descripteurs dans le data historique. Et ensuite, utilisez cette relation statistique pour prédiquer l'estimation du prix. Donc, les noms que nous utilisons pour ces noms sont les features, les columns que nous utilisons pour l'input. Le variable target est l'output du modèle. Et ensuite, les records dans notre database sont les samples que nous appelons en machine learning. Et ceux où nous avons des labels et qu'on peut utiliser pour les modèles, sont les sets de trainings. Et les nouveaux noms où nous voulons faire les prédictions sont les données de test-set. Et quand nous collectons la vraie valeur, nous pouvons comparer la prédiction avec la vraie valeur et compter l'accuréité du modèle de machine learning sur le test-set pour faire la mobilisation généralisée qui est la qualité du modèle. So, Scikit-learn est un library de machine learning qui a des centaines d'algorithmes classiques de machine learning implémentés. Ce sont les familles traditionnelles. Le but n'est pas d'implémenter l'état de l'art plus tard, mais d'apprécier des scientifiques avec des bonnes lignes pour appeler sur leur data pour construire quelque chose utile ou pour comparer si ils ont de nouvelles idées pour des algorithmes de machine learning pour comparer que leurs nouvelles choses sont plus belles que les choses traditionnelles. C'est un projet d'apprentissage en utilisant le license BSD donc tout le monde peut l'utiliser. C'est juste que si vous avez un disclaimer, si vous avez un bug, vous avez la responsabilité. Avec la communauté de centaines de contributaires maintenant, je pense que c'est plus que 1 000 et on a un équipe de corps-developpeurs qui s'étendent au globe en Australie, en Chine, en France, en Ria, en Germany et dans les États-Unis, en Colombie et en Université. Donc, la plupart des algorithmes sont utilisés par un API Python et sont parfois implémentés avec la language Python avec l'aide de les algebras numériques et les algebras linéaires fabriqués par un API Python parfois les opérations vectorales qui sont très efficaces en API Python ne sont pas suffisées pour implémenter des algorithmes comme des décisions-tres où le bottleneck n'est pas naturellement une multiplication matrice. Donc, dans ces cas, c'est une compagnie qui peut être très efficace. On utilise un langage programmé qui s'appelle Psyton qui est une extension de la syntaxe Python qui peut être utilisé pour générer un code C avec des types afin d'utiliser une compagnie pour construire une extension pour Python avec une syntaxe de très haut niveau et vous avez encore des opérations numériques c'est le fait qu'on vous donne des modèles très mathématiquement hétérogéniaux sous un API homogéniaux donc des scientifiques peuvent essayer de swap les modèles très facilement parce qu'ils suivent tous les mêmes types d'applications d'application la méthode FIT pour entraîner le modèle et puis la méthode PREDICTE pour appliquer le modèle pour de nouvelles données il y a des outils standard pour évaluer la qualité du modèle pour la classification pour la majeure d'accuré pour la régression pour la majeure d'absolute erreur les procédures de validation les modèles de sélection les modèles hyperparamètres pour assembler plusieurs modèles dans un grand modèle et parfois aussi pour construire les pipelines avec la pre-processing c'est un projet très actif je pense que nous sommes dans l'année de plus de 800.000 activités mondiales qui s'étendent en ligne et c'est tout de suite on est estimé que c'est peut-être un million d'utilisateurs ou quelque chose comme ça donc, qu'est-ce que c'est nouveau dans l'O21? donc, si vous regardez le change log c'est juste un snippet si vous étiez à scroller la page et prendre des screenshots ça va couvrir les mondes donc, je vais juste focuser dans un subset et en particulier, je vais focuser sur la nouvelle complétation de gradient boosting dans le cycle donc, gradient boosting est un modèle de classification très utile c'est basé sur des fruits de décision que vous étiez sequentially un après l'autre que ce deuxième fruit de décision essaie de correcter les erreurs de la production de la première et vous avez construit un ensemble par sequentially refiner les prédictions en utilisant différents fruits donc, dans le cycle de la complétation depuis environ 8 ans mais la façon dont c'était implémenté était d'implémenter la méthode traditionnelle exacte mais cette méthode traditionnelle était prête d'être complétée par une nouvelle méthode implémentée par exemple dans un autre projecte de C++ qui s'appelle LightGBM et d'autres libraries comme LightGBoost mais en particulier, LightGBM pourrait montrer que vous pouvez obtenir une très grande performance par ce qu'on s'appelle Bining the data in a small number of buckets and computing histograms for the frequencies of when you see this data point with this value and this makes it possible to not have to do sorting operations that are very expensive in the traditional algorithm so because we do that we can get rid of sorting and we just do comparison operations and counting operations and this is much, much more scalable and furthermore, the Bining step itself can also improve a bit the quality by adding some regularization so to implement this for scikit-learn we first started to do it with the prototype using the number framework which is basically a just-in-time compiler for scientific programming because I wanted to find a good opportunity to learn how to use number and so this was implemented in this project which is also open source but just as this single algorithm implemented with number and after that, once we could prove that we could reach this kind of performance with number we can translate the code into Cyton, which is very similar to Namba we just need to have additional type declarations basically so that it can be easily embedded in scikit-learn that already has a dependency on Cyton and we don't want to introduce a new dependency quickly so maybe in the future scikit-learn we use Namba but it was easier for the short term to just translate it but we still plan to keep this PyGBM project around, so if you are interested in Namba and Gradient Bustic 2 you can still use that directly the two projects are basically the same performance and when we measure on some benchmark data sets for instance the Higgs boson data set it's quite competitive with light GBM which is very optimized C++ implementation, sometimes it's even a bit faster so the Cyton translation was actually contributed by Nicolas Huguet at Columbia University and we are still working together to integrate the missing features so this slide is slightly outdated, now we have more losses we can do classification with a multi-class we still want to have new losses for quantile regression and so on to support sparse data support missing values to merge and there are other features like this from light GBM that are not yet present in scikit-learn we are still working on that just to give you some intuition of how Namba makes it very easy to do high performance algorithm in pure Python so this is actually a snippet of the code for the beaning part so to go from continuous values to integer values and buckets and so this is a very naive algorithm but there is no it's the traditional way to write beaning and there is no obvious way to do better but if you do that in Python you see that you have two nested loops it's very inefficient but if you just import Namba and the NGIT operator and you decorate your function like this then Namba will use LLVM to do a native version of this code automatically for your platform and this will run as fast as C++ and furthermore if you use the P-range function here you see that the for loop will change the range operation with P-range and automatically by doing this you get a parallel execution of the for loop so you can do in different threads on multicore machines the independent operations in different threads and get a good speed up because of this so with Cyton it's very similar it's just that we need to have additional types so it wouldn't fit on the slide anymore but the code would be very very similar anyway so the Cyton implementation is still flagged as experimental in scikit-learn so because we plan to implement more features in it so we know that we are going to need to change maybe the behavior of the default hyperparameters so because of that and we are very conservative in scikit-learn not to break our user's code if you want to use the new histogram of creating bootstead classifier you need to acknowledge in your source code that you are using an experimental features and that for this specific model we do not guarantee backward compatibility with the deprecation cycle we will not break it for fun we will probably just change this small behavior for the default hyperparameters so we still need to implement sample weights parse data missing values is almost ready capital recall variables and so on but it's already usable for numerical values or if you preprocess your data you only have just numerical values so from a performance point of view if you benchmark this this is a classification problem with I think 20 features and it's a synthetic classification problem and you see that on the X axis you have the number of samples from hundreds of thousands to millions I think the last point is 5 millions so it could go to 10 millions but on this laptop it was using too much memory in the end you see that you have this kind of linear scalability so this is a log-log plot on the Y axis I missed the level but it's the time of in seconds so you see that for 50 million data points it takes approximately 10 seconds to train a typical ensemble of trees on this using this algorithm so the blue line is scikit-learn the orange line is light GBM so you see that for small data sets we have some override in scikit-learn because we do additional input validation but then when you move to large data sets that last for more than 1 second we are very competitive with light GBM and significantly better than XGBoost so they don't have exactly the same hyperparameters all XGBoost and light GBM so it's hard to compare exactly but this is the kind of performance that you can expect so I would like to switch to some interactive demo so here I have a Jupyter notebook where I will show how to use scikit-learn basically in a typical use case and compare the different solvers and in particular focus on gradient boosted trees so the data set sorry I was not at the beginning so the data set that I am going to use is what we call the California housing data set so it's a small real estate data set where you have 20,000 samples so it's not very big and you have those features to describe the current housing type and the goal is to predict the price of the houses in different neighborhoods so it's all numeric data so first we split the data set into a training set and a test set so that we can measure the quality on the test set and not cheat by just memorizing the prices that we have observed on the training set so first we will just use a linear regression baseline it's possible with scikit-learn to do linear regression like in excel and it's very fast you see it's 11 milliseconds and we can compute the error so it's the absolute error so I don't remember the units but it must be several tens of thousands of dollars or something like that and so you see that on the training set the error is slightly lower than on the test set which is kind of expected it's easier to memorize than to generalize and you see this number so it's just a baseline we can also do this kind of plot to compare the predictions on the x-axis to the true labels that we wanted to predict on the test set and you see that if you have a perfect model all the points should be on the diagonal they should be the identity function basically here you see that there are many off-diagonal elements and especially for the large values our model tends to under predict the large values it goes off-diagonal there is this kind of bias so this is probably because we have this kind of distribution for the true labels so this is the distribution of the prices using a histogram and you see that there is this kind of censoring effect at 5 which means that basically housing above this price where a sensor were limited and so in the database just recorded 5 instead of more than 5 so this is an artifact of the training set and for linear models it can actually be a problem because the loss function that they use does not make this kind of assumption so what we can filter out this and we can also take the logarithm of that so that it's more like Gaussian distributed and see if our linear regression can perform better in this case and we also make a pipeline where we do some standard preprocessing, linear regression on the data that has been preprocessed on the labels that have been preprocessed so if we do that and we compute the scores again when we compute the scores we need to inverse the preprocessing that we did on Y you see that it's significantly better already so just using a linear model let's say more correctly we can see that we already improved the predictions by quite a bit so let's see if we can use a non linear model to gain some more expressive power so in scikit-learn you can build complex pipelines where you do transformation of the data, for instance in this case we will do polynomial feature extraction so we will keep the original numerical features and we will do also pairwise interactions and actually up to 5 degrees interactions between the numerical values and then after this feature expansion basically we will fit a final linear model but the full pipeline is no longer a linear model it's more complex, it's more expressive so this takes significantly slower because 4 secondes instead of 11 milliseconds but you see that it's more and more diagonal you no longer see this kind of extreme deviation pessimism by the model so instead of doing feature engineering to turn a linear model into a non linear pipeline what we can do is directly use a non linear model like a neural network so it's kind of generalized linear regression with built in non linear capabilities and if we do that it's also more expensive so in scikit-learn we have basic neural networks so that you can use if you really want to use neural network I would advise you to rather use Keras or TensorFlow or PyTorch because you have more flexibility to design the architecture that you want but if you just want to use traditional MLP architecture you can use scikit-learn on the CPU and here you see that the accuracy is getting even slightly better than the feature extraction that we did manually with the pipeline so it's 0.21 on the test set and the training time starts to be slow, it's 8 seconds now so we can do again this plot and you see that it's more and more diagonal it's still not perfect but it's slightly better than what we did before so finally the gradient boosting so whenever you have this kind of tabular data where you have some kind of excel spreadsheet with different columns with different units, physical units like a number of rooms the criminality in the neighborhood whether or not it's close to a public transportation whether or not it's close to the ocean the GPS coordinates those are different columns with very different heterogeneous types physical units so this is what we call tabular structured data and for this kind of data typically neural networks are not significantly better than traditional machine learning and in particular decision trees and gradient boosting is very very competitive and most of the time significantly faster and less finicky to adjust so if you have this kind of tabular data this is the kind of machine learning algorithm that we recommend to try very quickly it's always good to start with a linear model but then very quickly try this one to see if it's better so this is gradient boosting the traditional ones that were previously implemented in scikit-la so the exact algorithm so you see that it's taking a bit more time but 6 seconds and this is with the original data set without filtering the sensor stuff and you see that it's already much closer to the to the diagonal and the test error is even smaller than with the neural network and it's easier to find the good hyper parameters with this algorithm so also something that is interesting to note is that if you serialize the trained model it's not very big it's 1.5 megabytes so it's easy to store on your disk and to deploy on servers to load in memory on servers to run predictions on a compute farm or even in mobile phones they are small models most of the time and if you make predictions you can time how to predict for a batch of 100 samples you see that to predict for 100 houses it takes just a couple of hundreds of microseconds so less than 1 millisecond so they are also very fast to predict so this is a good feature to make a model of production you can contrast this with the random forest which is another way to build ensembles of decision trees more traditional and maybe more popular in the past so from a fitting from training time point of view it's kind of equivalent from a test error point of view you see that it's very similar it's 019 and here it was 018 017 so it's very close typically gradient boosting tends to be slightly better than a gradient random forest but the big difference is the size of the model that you get for random forest you generally build much deeper trees so they take a lot more space in memory and furthermore they takes a lot more time to predict you see that it's hundreds of milliseconds to do the same prediction plus de 30 times slower than the gradient boosting trees so this is why typically in production people tend to favor gradient boosting trees rather than random forest you can reduce the inference cost basically so now in scikit-learn you also have this the new histogram based gradient boosting trees so you can do the same and what you will observe is that typically for this small data set the training time is very similar on small data set it doesn't make a big difference the accuracy is very similar to the traditional gradient boosting algorithm the model size is slightly bigger because just for this run maybe it has built slightly more trees I don't know exactly but this is not very important and it's quite fast also to predict something that is very interesting is that if you use that on a data set with millions of data points then the traditional method that I've demonstrated before would not work at all it would crash, it would use too much memory and it would be much too slow whereas this one would have no problem with tens of millions of data points it would take between tens of seconds and a couple of minutes depending on the high parameters and something else that is very interesting with histogram of gradient boosting trees is the ability to do early stopping so remember that I said that we train one sequential we train the trees one after the other to try to fix the errors of the previous tree so when we do that we can keep on monitoring the train and validation error of the model so this is on the x axis it's the number of trees that we put in the ensemble on the y axis it's the score function basically the higher the better the negative error in this case so you see that when you add more trees the training accuracy basically increases and the validation accuracy on the held out validation set is also increasing but at some points it's reaching a plateau it's making no longer good significant improvement so what early stopping does is basically computing this accuracy on the validation set after each tree such that whenever you detect that you are reaching this plateau you stop and by doing this we stop before reaching 100 trees so it's smaller than before we can build smaller models that are faster to train because we do not need to go to the end and faster to predict and smaller to store in memory or on a disk so it's very good to use early stopping in practice ok, I will stop here for the demo so that I have some time left to talk about other things in Psykitlan so in the previous release of Psykitlan 020 that was published in September or October last year they were also a bunch of very cool features that some people might not necessarily know in this room and in particular one that I would like to highlight is the column transformer and there was a lot of effort done to make it much easier to do a feature engineering typically on heterogeneously typed typed pandas data frames so with columns with categorical variables columns with numerical variables different distributions and so on there were also other improvements but those are the ones that I would like to emphasize so for instance here we are using pandas to read a CSV file from the titanic dataset so it's the list of all the passengers of the titanic with a specific column that highlight whether or not the passengers have survived the titanic or not and so we can introspect the different columns of this data frame and look at the data type the data type of the data frame whether or not they are integer of floating point values and if they are integer of floating point values we say that this is a numerical column numerical data if it's not integer of floating point for instance if it's a string, an object or something it's probably the label of a categorical variable at least it's a case on this data set so because of that we decided to split the columns into two groups, the numerical ones and the categorical ones and then we can use this method to define two different pipelines one that will apply for the numerical values and one that will apply for the categorical values so for the numerical values we will do a missing value imputation with the median whenever there is a missing value in a record in one of those numerical columns we compute the median for the non missing ones in the same column and we put that instead we can also insert a new indicator column to say that we have done this imputation and then we will use the numerical processing tool which is called kbin's discretizer that we present in the next slide so this is the pipeline for numerical values and we can do a similar pipeline for categorical values but in this case the missing values we cannot compute the median because a category is just a name, a label so in that case we just fill in with a constant value which is missing and that's enough and then we use the one-hot encoder dummy categorical value encoder and then we can combine the two pipelines using the column transformer and call that the resulting operation the preprocessor once we have this preprocessor we can make a final pipeline that stage the preprocessor first and the classifier second the logistic regression in this case and we can pass this full pipeline to the calculation procedure that will do the model evaluation of the full set of modeling decisions basically and if you do that you see the accuracy that you get is 81 which is a very good baseline for this data set and you see that the code that we have tried to do this level of flexible preprocessing is kind of limited and still very easy to understand so the numerical preprocessing that I mentioned is the cabins discretizer so this is interesting because on this we have three different data sets so the first data set you have the purple points and the green points that are arranged into overlapping folded half moons so the goal here is based on the position in this 2D space the position is the input for the model and the color of the dot is the output that we want to predict basically the goal of the model is to generalize the color basically to all the possible locations in this space and on the first column you see that we fit a logistic regression classifier which is basically a linear classification model so it will try to find a linear boundary into that 2D space to separate the two classes and when the two classes are overlapping like this you see that it's not possible to use the linear model for that for the last row you see that the two groups are approximately separable but in this case the linear model is optimal but for the other data sets the first two lines it's not possible to use a linear model and get good performance in this case but what we can do is use the cabins discretizer as a preprocessing stage before the linear model and in this case we will group values between ranges of possible values and output more features and if we do a linear decision boundary in that higher dimensional feature space then it corresponds to a non linear decision boundary in the original feature space and this is what we observe so the second column is the combination of cabins discretizer with linear regression and you see that logistic regression and you see that the quality even on non linear problem is very good and this is a very fast model so you can compare to gradient boosting that are also non linear by nature so they can solve this classification problem quite easily but the first one is significantly faster and smaller to deploy so in this case it would make sense to just do this kind of preprocessing plus a linear model and that would be enough so there are other features that were introduced I will skip those I just highlight that when we worked on that we also fixed a lot of things directly into the python standard library for the serialization of large objects with non pyraise to work on a distributed computing cluster so it's also important to keep in mind that the python ecosystem is a community and when you work on your project sometimes you want to quickly fix and hack stuff on your project but sometimes it's better to make the investment to fix the problem upstream so that other project can benefit from the solution so this we improved the pickle module over a span of 2 years and now it's going to be integrated in python 3.8 so I would like to just go to the end and thanks to the partner of the INRIA foundation who supported this work and thanks more generally the scikit-learn developer community and users community so you see pictures of last couple of sprints international sprints in Paris, in Austin and in New York and thank you for your attention maybe we have a couple of minutes for one or two questions so there is a microphone here hello, thank you for the presentation are there slides somewhere and ipython notebook available I will, they are already online but I need to tweet the URLs I will tweet them my twitter is ogrizel ogri is here and probably that the conference organizers will collect the slides and put them on the page of the website and the second question you introduced a few algorithms one after another in a ipython notebook and random forest was the last one but the random forest and you said it's used in production very often but the random forest often over fits from experience the random forest can over fit but it's not necessarily the case typically in random forest the more trees you put in the random forest the lower you will over fit so by putting more trees you can decrease a bit the over fitting you can also limit the depth of the trees using max depth in scikit-learn and you can also do some kind of feature selection to remove the features that are not predictive and that will help also combat over fitting so sometimes you have a lot of noise in your data and there is no perfect solution but generally I would say that it's possible to reduce the over fitting with the random forest the main problem with random forest is that they are much bigger models and they are slower to predict compared to gradient boosting like in the new scikit-learn or in exibust or in light gbm those models are smaller and faster to predict so this is why they tend to be favored in production ok thank you I have a question for your column transformer basically it seems like you have now attached pandas to work directly with sklearn and you don't have to transform everything to a numpy matrix before you input it and basically there was a tool called sklearn pandas so you don't need that anymore this sklearn pandas was basically a big inspiration for the column transformer so we wanted to have something similar by default in scikit-learn because when I started to learn scikit-learn I was really a bit put off by that it makes it much easier to do future engineering from original data that was loaded with pandas and in the future this could also probably be adapted to work with other kind of tabular data structures hi thank you very much for the talk just a question regarding the support of categorical variables you said it's like ongoing is there a timeline? ok so so there are two ways to support categorical variables either you do preprocessing like I showed with the column transformer and that will work for all the models in scikit-learn and you have much more flexibility on putting some business logic like filtering the rare events and we plan to improve so far we have one hot encoding for linear models on this kind of neural network we also have ordinal encoding which is better typically for decision trees but in the future we want to have impact coding that use the target variable to also find a good representation and better support for weird distributions of rare values and sometimes you could also have labels that are informative categorical labels that have typos but are almost words and for this there are a third party project like dirty cats that are meant to be used in scikit-learn pipeline to improve this and to do this kind of categorical variable preprocessing and then for decision trees it's possible directly inside the decision tree to deal with categorical variables so this is not yet implemented right now we are focusing on missing values and there are a couple of other stuff that have more priority like sample weights but I think this is after that in the coming months we plan to work on this Can this new algorithm work on GPU units so for most of the scikit-learn solvers they depend on either an empire site and those libraries they do not support working with GPUs right now and furthermore some models will not really benefit from GPUs like for instance linear models they are memory bound and they would not really benefit from GPUs so there is no point in trying to use GPUs for them but there will be a presentation by NVIDIA later today and maybe another one on Rapids AI I don't know one on desk and one maybe another one on yeah I'm not sure you will see on the program or you can ask Peter here that basically provide similar estimators to scikit-learn a compatible API it's not necessarily an exact drop in replacement but basically similar features and some of them can really benefit from running on the GPU for decision trees for gradient boosting it's not always the case that running on a GPU will be faster so you have to keep in mind that it's not as for neural network that really convolutional neural network really benefit from GPUs for decision trees and for linear models sometimes the CPU is good enough yeah Hi, thanks for the very informative talk maybe you can quickly comment on regularization you mentioned L2 regularization in the prototype also planning to include the others like L1 so basically when we do the binning preprocessing simplifying the representation of the data because we have decreasing the precision we are using 8 bit integers on 256 levels to represent the original values that were encoded in 64 bit float for instance and by doing so we reduce the complexity of the model and so this is kind of a kind of regularization so for decision trees it can help a bit but it's not magical in any way but sometimes you observe that by doing this approximate method you get better performance than the exact method so this might be a bit surprising but it's caused by this regularization effect I don't know, I think we should stop here because we're running out of time ok, thank you very much again