 Now we are going to have a Catherine and Matias talking about Auto SQL, no, Auto SQL learn, sorry if I mispronounce His name, Automated Machine Learning in Python. Hello, people. How are you? Hi, good. Thanks Where are you streaming from? Freiburg in Germany. Oh Nice, and how's the weather over there? Pretty warm. It's too hot outside. And is it sunny? Yes, very sunny. Oh, you're lucky. You're here in the Netherlands, and it's warm, but super cloudy. Well, my change next hour. Let's hope so. So if you're ready, we're gonna start with a Catherine, right? Yes, exactly. Okay Thanks. Yeah, welcome and also thanks for the kind introduction and it's my pleasure to have a talk here at Europe Python on Auto ML in Python and present our open source tool Auto SQL learn. So first of all, why are we doing this? Why do we need Auto ML? Machine learning is a very important key technology being used in many applications Indications already and will be used in many more. However, mastering the art of machine learning requires expertise and experience Auto ML democratizes machine learning and makes it available for everyone Our vision is a bit more specific because we want to do this in four lines of code So these four lines perfectly summarize our session today import auto SQL learn instantiate fit predict Done. Our goals for today are we want you to get excited about automated machine learning to see what has been achieved and what could be possible in the future Also, we want you to understand how Auto SQL learn works such that you can apply it to your own applications We're gonna split the session in two parts. First I'll tell you about Auto SQL learn and how it evolved over the past years And then Matias will give you a short demo session and we'll have two q&a slots one after my part and one after Matias part So then let's start Let's assume you have some data and plan to use scikit-learn for that You read through the docs and stumble across this cheat sheet Describes a huge decision tree compiling many rules of thumb of when to use which algorithm This is awesome. I do like this cheat sheet. You follow a path and end up with a model to use While this is a good starting point There are many more decisions to be made in order to find the best performing model And wouldn't it be great if there would be something that automatically does these for you? Before I'll tell you how we do this in Auto SQL learn. Let's have a closer look at this at the design space at space research So I said assume you have data So now we assume you have x-train and y-train. So your data and labels You might also do have a test set for which you don't know the labels yet You do have a budget which are the compute resources you're willing to spend on solving this problem And there is a loss function that describes the performance measure you're interested in like accuracy or error or a you see Um, then the machine learning pipeline For us that's a pipeline that consists of a data preprocessor such as outlier detection or imputing missing values then a feature preprocessor um That does dimensionality reduction or adds new features Followed by a classifier that does the actual predictions So more concretely This is what it looks like for second learn For each of these steps in the pipeline you have a lot of choices and Not only you need to select an algorithm. You also need to select the hyper parameters for each algorithm So in total there are over 150 hyper parameters and growing Which can be set in order to construct a pipeline. So the question is how do we find the best configuration in that space? The answer to this is black box optimization It's called black box because we have no idea We don't have any gradients and we assume we don't know anything that helps us to find the best configuration So really the only mode of interaction is querying it with an x Which is the configuration of a machine learning pipeline and observe f of x Which is the performance of that pipeline And since time and resources are finite We want to execute this loop as few times as possible and we want to select the access To evaluate in a smart way The most obvious way could be a human optimizer also called grad student descent that comes with the advantage that you'll learn a lot about your system And it's probably not very efficient and error prone and you really need expert knowledge or a lot of time It's slightly better would be grid search. It's a very simple approach. You just trace your dimensions Evaluate each grid point this can be done in parallel and you can also use it to study your problem However, it does not scale to high dimensions And the grid needs to be defined even better random search Even simpler and still easy to parallelize eventually also converges to the optimum It's still not very data efficient and thus computationally expensive so What else is there there is patient optimization That is the state of the art and it's a very data efficient search Procedure it trades off exploration Meaning it explores areas of the design interface that it has not seen And it exploits areas of the design space that it believes to be promising Also on the on the drawing side Well, we used to say does not scale with parallel resources There are also solutions for that And that's what we do. We use patient optimization Combining this with our pipeline. We already get the very vanilla version of audios celeron Well, this works. It's not very efficient and we added many things to speed it up First of all, the first thing we did was adding meta learning We have a lot of data obtained on previous experiments because we run it over and over again on data sets and we want to reuse that experience So such that for a new data set we don't have to start from scratch Concretely, we collected a lot of data sets search for a good configuration on each of these And for a new data set we compute the distance of this data set to the data sets in our database And we start the optimization procedure by running the best performing configurations on these data sets So we warm start patient optimization The second thing is that we figured that simply returning only the best model is a waste of Kind of a waste of resources and it's well known that a team of models performs much better So we construct ensembles to get the most out of it And combining this then um, that's the first version of audios celeron It works nicely and defined the state of the art in 2015 by winning several prices in competitions However times have changed data sets through larger and audios celeron needed to scale We discovered two shortcomings A meta features are quite expensive or can be expensive So we need that to compute the nearest data sets for warm starting And secondly be that large data sets can be an issue since it can be too expensive to evaluate even a single model I describe our approach to tackle these in the following The first change we did was um getting rid of the meta features and the k nearest data sets Instead of computing the initial configurations from scratch for a new data set We went with a portfolio. This means for every round of audios celeron We executed the same set of pipelines And for this it's very important that the set is diverse and covers as many use cases as possible So for that we use a greedy algorithm to compute the portfolio Assuming on the right you can see that we have a large set of candidate configuration That's the x-axis c0s to c4 and we have some data sets d0 to d5 Um, we start adding to our portfolio configuration Two that's the one in the middle because it has the best average performance. It's not really good on any dataset, but also not bad The next addition to our portfolio would be configuration zero because it really performs well on data set one to three And then we would add configuration four which performs well on the four and the five So these three configurations cover all data sets in the database and would be a good diverse portfolio to warm start optimization Having solved that we took care of large data sets for that we rely on successive halving which recently gained a lot of attention Successive halving itself is a pretty simple concept If you have an iterative algorithm like gradient boosting on your networks You can get a good estimate of the final performance after only a few iterations This also works for data subsets when you evaluate the algorithm on a smaller subset of the full dataset Successive halving exploits this by allocating more resources to promising configurations How does this work in detail? You have a few configurations And you evaluate all of them on the lowest budget Then you drop half of them. Those are the lines that do not continue You double the budget and evaluate the remaining configurations Then you drop again half of them and so on. Of course, you don't need to half You could also only keep a third and then triple the budget But that's the basic idea You do this until only the best configuration survived And by that you can evaluate much more configurations than if you would evaluate all of them on the full budget But What about small datasets? We want an automel system that works on all datasets on large ones and on small ones On small ones, we probably don't want to use successive halving But rather evaluate everything on the full budget and maybe use cross validation So let's have a look how these decisions impact the performance So what we have here are results from running our tool with different optimization strategies There is on the left-hand side of the plot. So all of this is balanced error rate Lower is better on the left-hand side. You have full budget evaluations Runs that use full the full budget on the right-hand side. You have runs that use successive halving So on this dataset, we probably want to use successive halving and hold out But on another dataset exactly that policy performs very bad And we would rather want to use full budget and 10-fold cross validation Also, if you compare different time horizons, so on the bottom left, we run the system for 10 minutes And 60 minutes on the bottom right So on the left-hand side, we would want to use hold out But if we run for longer cross validation would have been a better choice So the conclusion for this is it depends on the dataset so Did we make it worse? AutoMeld aims at making the application of machine learning easier And now I told you that there are even more hyper parameters So did we solve one problem and created a few more? And can we automatically select the optimization policy? That's what we covered in our latest research and the short answer is yes That's possible. We can do this with a learn selector And for more details, I am referred to our latest work While I will now jump to a comparison So we had in the beginning I told you about Autoscalon 1.0 which uses hold out evaluates everything on the full budget and uses the nearest datasets Now we have Autoscalon 2 which uses the selector and portfolios to warm start optimization And here are some results from our work We show the average balanced regret across 39 datasets from the Automeld benchmark for two time horizons for 10 minutes and for 60 minutes We can observe the following The longer you run the tool the better it gets And also the second Autoscalon 2.0 is significantly better because it achieves a lower regret Yay! You'll also see more numbers in Matthias' part And let's now move on to the conclusion and let me briefly talk about the success stories of Autoscalon The first version of Autoscalon was built in 2015 to participate in the first Automeld challenge For this challenge participants had to submit software that runs without human interaction and produces predictions for ANSI datasets so the ideal Automeld setting Our small site project evolved to a fully drone project and we spent many nights on making our submission robust and efficient The effort paid off and in the end Autoscalon dominated the challenge by winning a substantial amount of prices For the next two years or so we did some further research and maintenance And then in 2018 we participated in the second Automeld challenge for which we had to scale and that's why we introduced successive halving and portfolios and again won a nice intermediate milestone for us was in 2020 when we opened the 1000 pull request and had more than 5k stargazers on github And finally earlier this year we released Autoscalon 2.0 as a first step towards truly hands-free Automeld And this also brings me to our current team putting effort into this where Matthias, me and Eddie working on the code but also Marius and Frank providing new ideas and feedback And most importantly since this is open source there are many other contributors maybe you providing bug reports and fixes thanks a lot for that And that concludes my part I talked about Autoscalon and Automeld in four lines of code It's built on scikit-learn and our latest research It's open source and you're welcome to try it yourself And now I'd say we take some questions if there are some questions Otherwise, I'd hand over to Matthias and we'll have another larger Q&A at the end Okay, yeah, we have some questions Let's see Over the last minute The first one is What are the pros and cons of Bash and optimization using Autoscalon with respect to evolutionary algorithms like the one used in the library Depot Shall I take that? Sure. Yeah so The difference there is the way it searches So Bash and optimization uses another machinery model in order to guide the search By the evolutionary algorithm actually has a population of solutions and evolves them over time The evolutionary algorithm used by people is actually a very special one as it can handle Basically an infinite space of possible machine learning pipelines That has the advantage that well can have Really really nice solutions that are out there, but also it it grid doesn't really scale that well there is An open source Automeld benchmark out there where both are compared and At least like the runs we did recently ourselves We fare a bit better than teapot Okay, the next one is quite a long one Can you please briefly explain the idea of meta learning that is applied to for meta features Give a new data set how it's possible to get relevant nearest data set using meta learning You take that too. Yeah. Yeah, I guess yeah So the idea here is to describe a data set by meta features so meta because we are on the meta level And these features describe the properties of the data set such as how large is it Like how many features does it have how many classes does it have? How many missing values does it have etc etc? We then have a feature description of near data set We can use something like the l1 distance to To have a space or also euclidean space doesn't matter But then we basically have a space of data sets and we if we map a new data set in that space We can compute the distance to all the neighbors and figure out which one is closest And we take the closest ones look in the meta space What perform best in them put these in the list and run those those things Just looking is there anything else? Let's just back to the question You'll have two more questions If you want we can read one more and then you can continue and then in the end if we have more More time you can finish The the this one is a continuation of the previous one and says are you collecting meta feature like A static sticks mean standard deviation data skewness in order to compare the data I think the short answer is yes Perfect Among others And well now if you want you can continue All right Thanks to you and the second g&a So yes, let's now continue with the demo session. Um, I will be using Two pattern outputs. You can find them also on github.com slash autumn l slash auto dash as to learn dash thoughts And yes We'll um, yeah as before auto has clearance a problem in replacement for So I can learn and we'll demonstrate in this notebook briefly I'll run you through installing auto as to learn getting data setting up a few baselines before actually then running auto as to learn We will then have a look at what the tool did And inspect the model to see what by things which features are most important So the first thing I didn't execute the cell because I already had auto as to learn pre-installed But it's as simple as pip install auto as to learn It'll it'll download package dependencies for the dependencies we have built wheels so that you do not need a compiler on your local machine Second step is then loading the data for this will be using a cycle to learn help of function Um Which is called fetch open l you provide it with the data set name for this example We'll use the demo data set credit g short for uh, german credit card data set We pass it the attribute as frame it was true so that we actually get data frame We also ask cycle to learn to return the x and y values directly Instead of giving us an internal object of cycle and we then split into a train and a test set um So this data set comes from open l so open l is a platform for hosting machine learning Data sets and results we can now go there and actually have a look what this data set looks like Um We can see here Or one of one of the nice things that does for us actually is looking Like calculating statistics of the features and visualizing them We can see here the features of the data sets which are total of 21 Most important for us is the feature class which is actually the target that we aim to predict Um, it is normal. So it's a classification data set. It has two unique values And these are good and bad. So whether it is a good request for credit or whether it's a bad request for credit We don't want to give that person credit We can see that the data is in balance So there are 700 samples of good while there are only 300 samples of bad besides that This also tells us that a lot of the other features are nominal Some of them are numerical. So we have mixed features And if we scroll down further we get further information about the data set Such that it's pretty small. So that's why setting sample data set in the beginning And it has a zero missing value. So that makes our life a bit easier actually because we do not need to Take care about them as well yes, um Have a have a look at the data set. Um, I briefly want to mention if you if you use data from somewhere else You can also use your favorite pandas method to figure out what's in the data. For example, describe Or you can use a tool like pandas profiling, but that's not within the scope of this presentation So we'll move on to actually doing some machine learning So as a first step We'll fit a model that is often used as an As a default or like like first example model in a machine learning class because really nice to explain decision tree classifier So here we import decision tree classic fire We instantiate it without specifying any hyper parameters. So we just use it as a circular gives it to provides it to us we then um Have to do something about the categorical data because unfortunately scikit-learn Doesn't like categorical data very much. It needs to be transformed into something it can have and What we do here is we use a so-called column transformer Which applies individual transformers to certain subsets of the columns in this case we need to convert the um the categorical features Into numerical ones we can you because we have a decision tree We can use the ordinal encoder, which simply replaces the category by consecutive integers For the non categorical so for the numerical features we can simply pass them through to the decision tree We then put these together into a pipeline that was probably too much Put these together into a pipeline call fit on this Then we predict probability now the first important thing as we've seen before The data set is imbalanced So that means we cannot use accuracy for assessing the quality So that's why we opted for the area under curve score and use that now to estimate the The goodness of fit of the decision tree to the data And we can see that we obtain an area under curve of 0.57 Which if we recall from properties of area under curve pretty bad because 0.5 is pure random while 1.0 is perfect fit now let's After that is more close to random than to perfect. Let's let's have a look at another model that is commonly used in machine learning examples, which is a support vector machine Contrast to the example but we have put here the hyper parameters of the support vector machine as well So that you basically see how much you can tune there in order to get the best performance Also the support vector machine requires different preprocessing Instead of the ordinal encoder We now need to use a one hot encoder which replaces each category for feature By individual features one for each category, which is then put to one if that category is active for the attribute or zero otherwise Also this time we need to create the numerical data in order for the support vector machine to work and we use a standard scalar which Which scares the data to have zero mean unit variance Again, we put this into the pipeline We call fit on this pipeline and we again compute the area in the curve This time it is 0.79. So that is quite an improvement But as you might have seen there there were quite some quite some steps to get there You need to know which how to transform the data into do it and then hit them on the next The formal model I'd like to look at now, which is the model that is currently most involved is gradient boosting gradient boosting classifier It has even more hyper parameters that you can set and change But as it is based on trees, we can use the same encoding as we did for the decision tree namely an ordinal encoder We again put the whole thing into a pipeline call the fit function Again predict the probabilities compute the area in the curve and obtain a slightly improved performance so We've now seen three different machine learning models Um applied to the same dataset, but you now need to actually know which ones can I use What are the hyper parameters of that? How do I the how do I do the preprocessing? in order to obtain such performance and As we said before we would like to do automated or machine learning in four lines of code So let's now actually get to that and see what autos here 1.0 can do for us so again, we import That's the first thing then we construct the classifier We call fit we call predict and that's it there are um Contrast to earlier. There are a few arguments to the classifier, but these are um, what's the problem specific? So, um The default limit would be one hour. We didn't want to wait that long So if you'd like her on download the notebook, you don't have to wait for an hour to do this But it actually runs in 10 minutes You put in a seed so that's reproducible um Because the dataset is rather small we decided to use cross validation instead of fold out and This here the area on the curve We need to let the model know or we need to let the autumn l To a no that we want to optimize the area under curve so that we get good performance with respect to metric we interest in And and that's it. It does all the machine learning for us. It does all the preprocessing for us It directly ingests the data frame as it is and if we look at the score It is now better than any of the three models that we had above Which is pretty nice like because we we didn't really do anything. We just imported a tool called it and waited 10 minutes and then we got the results so To show you now a bit what it did. Let's have a look at what the what happened under the hood so There's a function called as print statistics if you have such a dataset name the metric that we optimized and the performance of the best model um As you can see here the performance is quite a bit lower than up here So I'd say this this is due to the ensemble that we constructed In total all time is to learn constructed 44 models or it's tried to construct 44. It succeeded constructing 39 One fit crashed most likely due to some numeric issues and secular But that actually doesn't matter to to the overall process because auto S. Learn is robust to algorithms failing Three things exceeded a time limit So that's another thing to get to robo. There's an internal time limit That makes sure that a we finish on time after these 10 minutes and b no single algorithm call Or no single model that we train can take all the time and and like like eat up all the time that we have and another thing is One model exceeded the memory limit You also apply memory limit on the target algorithms on machine learning algorithms So that auto S. Learn doesn't bring your system down in case it Was to do which makes this lot of cases And we can next have a look at the models that were actually trained and selected So these are the models that are in the ensemble They are sorted by the cost is one minus the area on the curve We can see because it is so small it took quite a little took Very few or very little times I actually fit the models and We can see now the model sort of a cost is the first and best one is something called a passive aggressive classic fire The good thing is you don't really need to know now what this is. It is something it's like to learn But you can read up later in case you haven't heard about it second most second best one is a qda classic fire Followed by a linear model trained with sgd and only then we get the more up-to-date breaking boosting So in in case you took a regular machine learning class day I've never seen any which teaches about passive aggressive. So it probably wouldn't have chosen that for this data set Yes, but what's also interesting is that not necessarily the best one gets the highest assomber weight that actually the the tool decides Which of the model should get the highest weight based on how much it would contribute to the assomber performance It turns actually out that this model here is very helpful overall despite being not the best one by itself Next we can actually have a look at the individual models This here is a list of tuples with first thing being the weight again And the second thing being the actual pipeline with the hyper parameter setting that was run These here are all the hyper parameters that are required for the pipeline And If you if you look closely in here, you can then see that there is a feature pre-processor Being used that is called a kitchen sink So this basically means that we use to be used here an svm approximation and approximate svm Which which was chosen as learned because actually it It supposedly performs very well on this data set in combination with the other models We can then use all this information to to learn about what's in the data set how to best read it how to best handle And finally for compatibility with scikit-learn, we also Provide access to a field called cv results That is how you would access the results with If you do hyper parameterization with scikit-learn so that Our tool is really black in or black and play replacement for scikit-learn estimator Yes Now that now that we've seen this um Let's have a look at what we can do with what else we can do with the model and we also want to demonstrate that We're really black black and play replacement. So we'll now have a look What what the inspection model of scikit-learn can do for us and what we can learn from it for this we'll have a look at the permutation importance score So permutation importance is a way of variable importance and This tells us which which feature of our data set was most important for the model in this case the auto-sql learning is on and But what it does for each feature it randomly shuffles the data computes the performance but also computes the performance without shuffling and the difference is then Basically, how much worse are we if the information of that feature is removed? And if you get the performance decrease quite a lot this feature obviously was important for the model and With that Give me a second here We'll learn that a feature called checking status Was the most important one It would not having that information would like reduce the area in the curve by 0.12 And we would end up with something around 0.68 So that's quite drastic But while on the other hand, we know that a few features like job age or housing are not that important for the model Yes, and you can use that with anything that works with scikit-learn There's one caveat maybe that a lot of tools Unfortunately cannot work with categorical features such as for example, the famous chef tool or also scikit-learn other Ways to to figure out feature importance but we have uploaded another tutorial about regression that only contains numerical features with the california housing data set and We again have here. I'm briefly going over this. I'm not going over this in detail here We again have the partial the permutation importance plot, but here we also have the partial um dependency plots And we also demonstrate how to use the chef package To actually compute a chef review for this data set and feature imports with a chef review Yes, coming back here I'd now like to finish this demonstration by showing you how to go even more hands-free with auto scikit-learn 2 And as you can see here um I removed the cross validation effect You cannot even specify cross validation as an argument anymore because as you know before this is automatically not decided by By the autumn l model itself And all the arguments left here are how much how long we want to wait to see for the ability and the metric that we are interested in and This now actually improves the performance even a bit further And we get like two points of area and curve over standard scikit-learn models by actually doing less by using an automaton What we can now see as well in leaderboard is we get quite some different models We more get more tree models get a few multi-layer perceptrons aka neural networks that are then used in this assemble in the end and now that you have this autumn l tool that that Takes away the machinery part from you you can focus on other things like getting more data Doing feature engineering to improve over the scikit-learn office us featuring for engineering Of course, it's also possible to customize autodesk you learn a bit more to your needs if you go to our documentation websites You find a lot of examples Starting from basic examples on how to do classification regression, etc Over how to restrict to only use interpretable models How to change metrics how to do model explanations. I just showed you And then also like changing the search strategy running in parallel doing random search or Changing the innards of patient optimization And finally also how to like plug in your own models or replace models that we have currently in there Yes With that, let me go back to the slide deck Yes, that was now demo session I'd like to point you to more material that we have. It's available on Webseq 12 autumn other part. There is for example the book on autumn l Which contains which goes very very much into details And there's also a blog about work that the machinery research group at the university of fry work does Including also blog post about the autodesk learn method And it has information about upcoming talks and events if you're interested in autumn l And are a student of machine learning and practitioner or researcher There will also be an autumn l fall school with hands-on sessions Networking sessions invite talks from experts and so on Yes, and with that, I'd like to close the presentation. Thank you very much for your attention If you like autodesk learn, please leave us a star and get happy if you're using autodesk learn, please drop us an email And with that, thanks again for having us at your apartment. Thanks for your attention And we'd be happy to take your questions now Okay, thank you so much. It's a mind blowing even though I'm not in the topic Seems fascinating and there is a quite a bunch of questions first Let's see Are there possibilities to use reinforcement learning as an optimizer? What are the downsides of using a reinforcement reinforcement learning? Please Okay, yeah Yes, it's possible. There are scientific works That use reinforcement learning as an optimizer Um, I haven't seen any compelling evidence that there are better But it actually makes the the optimization procedure more complicated Then we have can autodesk learn detect overfitting on the dataset well Would be great if it could um But it doesn't do that But what we do instead is we're using ensembles and that fight again a bit against overfitting And we of course well use cross validation hold out To prevent overfitting The next one. Are there any non-limitations to autodesk learn? Yeah That obviously is Kind of a quick question. I guess because yeah, I guess there are quite a lot of them still is I think one of the biggest limitations is It only works for supervised specifications. So you need to have a target metric um, the other thing is basically The the way that autodesk learn is built requires that there is a pipeline predefined And that that is then basically searched through So it will not discover any fancy new pipeline unless you specify it on the other hand I think that's also an advantage because it doesn't go crazy try all stuff that most likely will work It actually will give you good results in a decent amount of time and The unfortunate thing the limitation that was asked before wanted to take overfitting, but none of the automotives were to take overfitting Yeah, maybe also to add here, sorry Maybe not a limitation but a clarification. So autodesk learn so far works only on on tabular data. So that's what we focus on It doesn't natively handle images for example or audio signals You would need to transform these into tabular features Uh, the next one is can you use autodesk learn for natural language processing ML applications? You could if you transform it to a tabular feature, but um, I guess there are other tools for that and the last one Is can autodesk learn also use models that are not from sklearn such as x g boost classifier Um, yes, it can. Um, we provide examples on how to extend autodesk learn with other models Um, they just won't be in the metadata. So you would have to run autodesk learn a bit longer for them to actually have an effect Um, the last one where can we find the notebook? I will post the link in the chat. I guess that's easiest perfect okay, well, um If people want to continue the discussion, uh, you can go to the breakout For the part and ask more questions there Thank you so much to both of you catamat Thanks for having us and see you over in the breakout room. Yeah, perfect. Okay Bye. Bye