 use R series that we had in the last month or so, so this is going to be led by Michelle Lange from the University of Munich, he's a statistician and software engineer and also by Bern Bischel. He's an investigator of machine learning at LMU University. This tutorial will be in English, but there are closed captioning available. You just need to press the CC button towards the bottom of your screen and those closed captions will be in Spanish. So we're going to switch between English and Spanish if we're introduction only. So buenos dias a todos, muchas gracias por unírsenos a este webinario sobre machine learning con MLR3, que es parte de la cycle de conferencias de use R. Este webinario va a ser dictado por Michelle Lange, que es la Universidad de Munich, es estadístico e ingeniero de software, así como Bern Bischel, que es investigador de machine learning y trabaja para LMU. Este webinar va a ser en inglés, pero tenemos nosotros disponibles subtítulos en español, simplemente presionen CC en la parte de abajo de sus pantallas y saldrán los subtítulos. Los subtítulos están disponibles por las dos primeras horas debido a conflictos de horarios, pero estamos trabajando para tener subtítulos disponibles durante toda la duración de este evento, cuando subamos este evento a canale youtube de our consortium. So this webinar will be recorded. I'm just going to mention a few things now. So this webinar will be recorded. Whoops, English ones. So I just need to remind you that you're going to be abiding by a code of conduct and that as well as use R, which basically means that you just need to be nice to each other. We're going to be using Slider for your questions. The link is available on your screen right now, but Michelle and Bern will provide my details as well. So this webinar will be made available later through the our consortium youtube channel. Ahora en español, este webinario va a ser grabado, va a estar disponible en youtube, in our consortium, como les decía. Los subtítulos, si no los pueden acceder por alguna razón, directamente desde Zoom, simplemente pueden dirigirse al link que ahora aparecen sus pantallas en la parte de abajo y también pueden activarlos. Vamos a utilizar la plataforma Slider para hacer preguntas durante este evento y el vínculo está ahora en sus pantallas también. Y por último quisiera recordarles que cuando aceptaron o se registraron para este curso estaban aceptando nuestro código de conducta, como el de use RR, que básicamente dice que nosotros debemos ser respetuosos con los otros. Las personas que estamos ahora de co-host serían las personas de Arlaidis Ecuador, que somos Sulema Basurto de Arlaidis Guayaquil, yo que soy de Arlaidis Galapos y también Elena Chicaiza, que es Arlaidis Quito. So this webinar is being co-hosted by the Arlaidis Ecuador. We are three groups that are doing this today. So it's Arlaidis Guayaquil, which is being led by Sulema Basurto, Arlaidis Galapos, which is led by myself, Denise, and also Arlaidis Quito, which is led by Elena Chicaiza. And we should start now. So I'm just going to ask Michelle to take over now. So I'm just going to stop sharing my screen. Thank you very much. Yeah, okay. Thanks for having us. I would like to point out the GitHub page we set up, where you can find all the links that you will need today. It's also linked on the Meetup page. So if you have this link, you can take your way through to this page here. If not, you can go to github.com slash mlr minus org slash user 2020. And then you find this GitHub repository here. And you can find the PDFs for the slides here and also some important links. For example, here is a list of packages you need if you want to follow the exercises interactively. And yeah, also some other important links. For example, here's a link to our book, right? So if this is today a bit too fast and maybe it will be, you can just look up everything in our book or in the documentation and so on. But also we recommend, if you start with MLSV to look at our sheet sheets, which you can also find here linked right. And for the first part, yeah, something like this here. If you have a printer next to the screen, then it would be a good time to print it out now. We are going to talk about three packages today. The first one is the base package MLSV. Then Bernd will take over and will introduce MLSV tuning. So for tuning, as the name says. And then the last package will be MLSV pipelines, which is a package for pre-processing, pipelining, and all sorts of stuff. Really exciting what's possible with this. And as Denise already said, you can ask questions via Slido and we will, so we have three parts. After each part, we will have a short break and answer some questions. And we will also have a 15 minute break somewhere after one and a half hour. So, yeah, I think I've mentioned everything I wanted to mention. We'll start with the slides. Bernd has nothing to add. So, okay, I'm going to talk about the first package MLSV. So this is the base package. We have something like 20 packages now. I will give an overview of the package at the end of my talk here. So we are now just talking about the base infrastructure package here. So, why do you want to use a package for machine learning? As you might know, that R already has many, many packages for machine learning. For example, that packages for random followers, for SVMs or stuff of things, right? And the main problem is that you don't have a unified interface for this. So if you want to do something like a benchmark, you want to find out which learner worked best on the dataset. You have to do something like benchmark. You have to do a performance comparison, right? And if you do this without an extra package like MLR, MLR3, you would start writing something like this. For example, you want to try how relevant SVM works. Then you take the E1071 package, for example. There's an SVM implemented here. And this implementation supports a formula interface. So if you want to fit SVM on the iOS data, you have to say, hey, this is a column where I want to predict labels for and use everything else as features, right? And then you get back something like model here for the SVM. On the other hand, if you want to do something like gradient boosting here from the screen gradient boosting package, there's no formula interface supported. So you have to first split the data into a numerical matrix here. And then you have to provide the target column with labels as a sector here. And yeah, so two packages, two different interfaces. And if you want to compare something like 10 learners, it's often like 10 different interfaces. And so you're writing boilerplate code. And this quickly piles up to hundreds or even thousands of codes. So yeah, there are other packages which have the same goal, like MLR3. You might know the PDESA MLR. So same guys. So we are also the developers of MLR. And this is a successor here, the MLR3 package. And there's also carrier or tidy models now. But all have these intention to solve this problem here, to make performance comparisons, comparisons and benchmarking more convenient. So what's in MLR3? We have objects for all machine learning, stuff for tasks with objects for learners and for measures and so on. And this enables you to do performance evaluation and performance comparisons in an easy and convenient way. Before we jump into machine learning, I have to quickly introduce you to R6, if you don't know it yet. R6 is the class system we use internally. And not only internally, you have to work with these kind of objects if you want to use MLR3. And so you need something like a brief introduction. It's not too hard. R6 objects can be constructed by calling the class name and then the constructor new. For example, if you want to create a classification task, you call the object task classif and then the constructor new. And then you can pass some arguments here. For example, the ID, the data frame of data, and what's the target column. What you get back is an instance of this class, an instance of task classif object. These objects have fields, which you can access, which just hold information about the object. For example, you can ask the task, how many rows do we have? All right, by just accessing the field and row. And to return value is here integer 150. Objects may also have methods. So it's not just like a list. There can also be functions in there. So the task object also has a filter method. And you can say, hey, filter the task so that only the first 10 rows are kept. And everything else is discarded. And what's special about R6 object is that if you do this operation here, you mutate the object in place. So this is something you might know from environments in R. This is also called reference semantics in other languages. So after calling the filter method here, and you don't have to assign this to the object task again, just by calling this, you change the object task here. So if you again ask how many rows do we have, it's only 10 now. This is something many people are not used to in R. But this is basically what environments are also doing in R. There's another thing. There are so-called active bindings, which are supported. So these fields can be just a value on an object itself, an object, but you can also use a function, which is then called automatically. For example, if you say the field N row, which is actually an active binding, and you assign something, you get an error here. So because we have a function, which is encode, if a function turning, which is encode, which says, hey, you can't set this field. It's not allowed. Right? So we can also use this a lot for something like argument checking. So if you have a task here and task and have properties, and you can't assign null, so it's not allowed. But it would be allowed to assign some random characters here. So not that important to work with MLR. So we don't need to completely understand this, but we are talking a lot about active bindings later, so that you have an impression what we are meaning with this. So this is basically just like a field, but internally function is called. This can have side effects. So, okay, and this is all for our six for now. Back to MLR 3. When we started developing MLR 3, we wrote down some stuff, which we find important and what we have learned from the Prodessa MLR. So the previous package MLR used S3 as class system, and we felt kind of limited with this approach. So we really wanted to do object orientation, and so we switched to R6. This was the first thing we wanted to really accomplish. Second, we wanted to use data table more because, well, it's pretty fast. And we also liked that it has reference semantics and that you can place complex objects in the fields. So it's also possible with data frames, to be fair, but not that well supported for printing and so. And this allowed us to work with the data structures in a more efficient way. And third, we wanted to be light on dependencies. So these are our external dependencies. So we used R6, data table, logging package, this to create unique identifiers, some data sets in here, and digest for cryptographic caches. And that's it. And all these packages don't have any more reverse dependencies or recursive dependencies. And there's some more packages in the script profile, but these are all developed by ourselves. So we don't constant. So, yeah. Now moving slowly to machine learning again, we usually start with a tabular data for the most problems, right? And this is what is support by now in NLR3. And we say that we have some columns, which are called the features, right? And one column for survival, maybe two columns, which are the target, for example, for classifications, the labels we want to predict, or for regression, the outcome we want to predict. Yeah. So the target column basically determines the type of task, which is continuous, then you have regression task, if it is factor variable or discrete, we have classification task, right? And, for example, the IOS data set, right, here is the target column, this will be a classification task. And these will be the features here. And this is what we call the target. And for construction, we've always seen this because a task-classive object and the constructor of it with an ID, the data frame, and the name of the column, we want to use this target column. And we'll get back an instance of the task-classive. If we print the task, get a nice overview here, for example, the dimensions, the target, what properties the task have, what kind of features are in there. And you can access certain things of these objects. For example, you can access the dimensions here again, the feature names, the target names. And you can query the data, you can oops, you can subset or filter the task, and you can also expand it, combine columns, rows, and there's much more, but the more technical stuff is in the manual page. One last insertion here, if you, so this kind of bulky and lengthy calling these constructors, so we have something, we use dictionaries for all the objects that MLR provides for you. And so we have some dictionaries that we store, for example, tasks we use in use cases or in examples a lot. And you can, for example, in MLR tasks, the IDES dataset is in there, five to six more other tasks, and you can get the task out of the dictionary with this short form function here. This is for tasks, this is for learner, measure, resampling. And another advantage is that these dictionaries can be populated by their own packages. So if you load the MLR learners package, you, this dictionary MLR learners will be populated with 20 more learners. But if you load the survival package, this dictionary now holds survival learners. So this is, yeah, that is what you have to remember, these short forms we will be using a lot in the next slides. So you can, for example, here query what's in there, if you don't provide any argument for this short form function here, and you see, oh, Boston housing dataset is there, best cancer, and so on. And if you want to have a single object out of the dictionary, you provide the ID here, for example, to get an instance of the classification IRIS task, we just call the task function here with the IRIS constraint, and you get back the object. Okay. And you can also convert it to a data table. You can basically do this for all objects, I think, which I implemented the package. So we love these tabular representations. So for example, you can convert the implemented learners to a data table, and then you can start subsetting it, for example, for predict type or stuff like this, many more columns here. Okay. So we have tasks now. Next important thing is data. We want to learn something, so we need learning algorithms. So this is basically what learning algorithms do there. Two stage, there's a training step and predict step, the training step, you provide the learners some data set, right? This is what we call the training data. And then the learner learns something about the data, learns the parameters and coefficients, right? And stores something internally, what we call the model. And then if we have another tabular data structure, you call the test data, you can provide it to the learner what we get from the learner now based on the learning model as a prediction. So this is the estimate for the target variable, right? To construct a learner, we use this short form here, LRN, for learner, and say we want to have a classification tree from the R-part package here. The learner has this ID, class of dot R-part, and then we can just train it, right? So we have an object of type learner here, we call the train method, provide a task. The learner will update itself, stores the model internally, and we can also access the model, right? By accessing the field from the model. And what's in here is the model as learned by, or as returned by the R-part function. So if you want, we can now exit MLS3 and call plot on this object here or something else, I don't know, summary, whatever. Learners have hyper parameters, for example, this is the data table of hyper parameters for the decision tree. We have stored information about these hyper parameters, the ID, the type, low and upper bound, allowed levels, and so on. Right? There's a little bit more in here. This is really handy if you want to start tuning these learners, or if you just want to look something up, which is the allowed feasible range or something like this, you can do this by querying the learner directly. Michel, is this okay if I add one sentence? Sure. So I think some people are wondering how to follow along with respect to the code and so on. So maybe we didn't make this clear enough at the beginning. So the tutorial consists of three parts because it's MLS3, it's MLS3 tuning, and it's MLS3 pipelines. But each part actually contains two subparts. Okay? So the first subpart is either Michel or me presenting the PDF, right? And you're just supposed to listen there, and it's much too fast anyway to type this and to go along with this, right? But we are going to do nearly exactly the same thing on a concrete use case after the PDF. And these are linked also on the GitHub page, okay? So I think we are mainly linking to the rendered out HTML. We'll also show you before Michel goes to that first use case, we'll also show you directly how you can download the R&D, okay? Which I think might be a little bit hidden, but we should have linked to it better, and we'll show that, and then you can really download this on your machine and follow along with this if you wanted to, okay? But for now, just listen to this, try to understand and the rough concepts, and then we're going to do this in the use cases again, kind of a second time, okay? Yeah, thank you. Okay, so you have these hyperparameters in there, and you can also set hyperparameters for a learner by just assigning a name list to the field values of the field parameter set here. And then here, for example, we say the maximum test for the classification tree is now one, and if we train the learner again, well, it will behave differently, right? It will give a smaller decision tree here, for example, decision tree of test one, so this is just a stump. This is really, really shadow tree, just one split point, right? Yeah, and then to get a prediction, obviously, you have to provide a data set, again, for prediction to test data here, we have something which is called new data, there are two rows in there and values for all features, right? And we then can call the method predict new data of the learner on the new data frame, for the new data data frame, and what we get back is a prediction object, right? And as you can already see here, this is kind of tabular. We have stored the row ID, we have stored the two values as two response, which is an A here because it's not in the data, so we don't know it, but we could have opted to also provide it here, then we would have also stored it in the prediction. And here we have the predicted label from the decision tree, for the first column, for the first row and second row. We can also say that the learner should predict not only the label, but also probabilities, so to do this, we just set the field, predict type to pop for probabilities, and then if you call the predict new data method again from the learner, we again get a prediction object, but now we have additional information in here, so not only the row ID, the truth is response, but also probability for class setosa, class vesicolor, and class virginica. So for example, the decision tree is perfectly sure that the first row is setosa and don't know if it's vesicolor or virginica, it's 50-50 for the second row, so then just sampled, I guess, the label. So closer look at the prediction object, you can again use a data table to get a tabular representation of the data which is stored inside, and there are also these fields or active bindings where you can access some of the data directly, for example, so response here, and there's some more stuff which you can look up in the menu page. So now we have trained the learner and got a prediction. Now it's time to score the prediction to measure how well the learner performed on the dataset. So this is what is called performance evaluation here. We now have here a learner which is already trained, so there's already a model in there, and we have some new data here, and we start by splitting the new data of the test data to features and target. The features go to the learner, and the learner gives us a prediction, right, and the two column, or the target column, so one truth also called. We then compare with the prediction here using a measure. So this could be, for example, a quadratic plus or something like this, and what we get out here is a single number, which, yeah, just depending on the measure is low, predictors are really good, and high is not just guessing also, or the other way around depending on the measure. So just a comparison here with these two vectors, just the function which gives us, depending on these two vectors, a number. To do this in MLS3, we first create a task where we also have the target column in here. We could also do this with the data when we don't necessarily need a task to do this, but it's done with the task on the slides. So we have a learner, and we call predict instead of predict new data, because we're now providing a task on the task here, and what we now get is a prediction with the truth column here, not an A, but actual values here, and then we can call the predict, the score method of the predict object, and provide a measure here, the classification error, and what we get back is the number says A, you have scored half of the predictions correctly, and the other half incorrectly. But so just, yeah, just counting how many are incorrect, basically, and then dividing by the number of objects in the prediction. So if you repeat this, this is then called, sorry, resampling. For resampling, you first get the data, the complete data you have into two parts, first the training set, and then the upper row here and the lower rows are the test set, and the test set you also divide into the features and the target. The training data goes into the learner, where you again learn a model, and then you pass the test set, the features of test set to the learner, get a prediction, compare it with the ground truth, and get a performance. This is basically exactly what we've did before with the training and predict step manually. But we usually do this quite frequently, so, and also not only once, but repeated, because we could be lucky. Well, for example, if these, this test set here, the observations of the test set are really easy to predict, then we would get one impression about the prediction performance we are calculating here, so we typically repeat it and then measure it and then aggregate the performance value we get. So this is what we are basically doing. We look at different splits of the data into training and test. So this is the training and the great odd stuff is test part. We calculate performance for each of these split, we get performance measure each time, and then we aggregate it into an aggregated performance. And this is typically what we report for, yeah, for learner. And all of this is done by the resample function. So to do this in MLSV, again, we have to create an object here, we create a resampling object. We use this short-hand form here to get a resampling strategy, which just defines what we are, how we are planning to split up the data. There we say we want to do cross-linkation with five-folds. This is what we've done here, right? And if you have created such a resampling object, you can pass it to the resample function. There is a third argument, it also provides the task and the learner. And what it could get back then after some computation is a resample result object. Again, we can convert it to a data table. And as you can see here, now there are objects stored in the fields here. So here is the task and there is the learner, the resampling, iteration and prediction, all stored as objects. And we have active bindings and fields to get out information of this resample result. So, for example, you can call the aggregate method and provide a measure to the performance evaluation density. So what we've done here, right? So I call aggregate, you calculate the individual scores and then aggregate these together into one aggregate measure. This is what's happening here, internally. You can access the prediction, the merge prediction over all resampling iterations. You can also access individual prediction per resampling iteration. There's a list of prediction objects here. This first one, for example. You can also calculate scores for these resampling iterations and so on. The last part is comparing learners and this is what we call benchmarking. So instead of doing a simple resample, we now repeatedly do resampling with different learners and we can also repeat this with different tasks. So you get something like this grid here. So this is the accurate performance measure which results from resampling from learner one and task one, learner two and task one, learner three and task one. Here we have a different task and also applies to learner one, two, and three, but an exhaustive grid fashion. So this is what we call benchmarking. And doing this is really easy also on MLS3. Here we load the MLS3 learners package to get some more learners populated in the dictionary. So we can, in this example, we want to load the K and M learner to compare with the R part learner. And here we have a list of tasks, the IOS data set, the NAR data set, and wine data set. Just small, these are just some small example data sets. And then we build this design. Yes, grid design. We say we want to apply each learner on each task using a five-fold cross-validation. And this design is then passed to the benchmark function, which does all the computation. And what we get back is a object again, a benchmark result in this case. You can also call the aggregate function, get back a data table. And, well, this is the important part. Here we access the columns task at the learner D and the performance measure and see here, for example, for task at the IOS, we've trained the learner class of the airport. And we got an classification error of 0.06. And we could now say, hey, K and N outperforms the decision tree, overall task here, for example, just by comparing these numbers. A closer look at the benchmark result is always the same. You can call it a data table. You have active bindings and functions to easily get the information out of there. And we also have a visualization package, which is called MLR3Vis. If you load this, you call the auto plot function on a benchmark result. You get a nice summary, so you don't have to compare the numbers looking at them and just generate some box parts here and draw conclusions using these box parts. One more technical aspect of MLR3Vis is the control of execution. Before you start the resampling with the resampling function or benchmarking with the benchmark function, you can choose a backend which defines how the learners are calculated, meaning in a way how they are parallelized. So we are using the future package for parallelization. So before you kick off your benchmarking with the benchmark function, you put this line here first, then the benchmark function will run on all your local course, right? Just choosing a backend here for parallelization, everything adds a standard then internally. So this is all you have to do. First of all, if you are doing some more large and comprehensive benchmark studies, you will encounter some problems usually because some learners will crash for certain hyperparameter settings or even ZEC fold. It is always pretty inconvenient. So we support what we call encapsulation of learners. So you can say when the learner is trained, do this in a separate R session, for example, or do this in a special protective environment. For example, if you say here, this learner should be encapsulated the training step using the call R package and predict them also using call R package. Then before the learner will get trained, separate R sessions are started and which the learner will be trained and the result will be communicated back to the master. So even if the learner ZEC folds, this won't tear down your R session so that you can continue computing. So this is really convenient for larger studies. Also, logs are captured and you have the possibility to fall back to other learners, for example, simple learners which doesn't ZEC fold often just to get predictions and do something statistically sound in the aggregation of the results. Some notes on how to get help. Okay, check these slides. Of course, also mentioned the book already. If you have an object of MLR3 and you are unsure how to get help, you can always ask for the class. Here, for example, get this object BMR is of class benchmark result and R6, but then look at benchmark result and you will usually find something. Another new way to get to the respective manual pages that you just called the method help and something will pop up. But this is not yet implemented for objects. So we just started doing this. So these are the most of your most important parts. For data, you have the task objects for classification and regression and you can access predefined tasks with the TSK function. Same for the learning algorithms, you can use the LIN function to get a learner. You can call train and predict and you will get a prediction. For resampling, this guy here for measures MSR, the resample function, of course, will give you a resample result and the most important message is the aggregate method for the resample result and for performance comparison over multiple learners and multiple tasks, you see benchmark function, which will give you a benchmark result. And as Bernd already said, I will shortly start showing a use case on the German credit data. But before, here's another slide that you get an impression about what is already there and what's still in the planning. So this is our current ecosystem. In the middle is the MLR3 package and where do I start? For example, to access data or handle this data, we have here a package to communicate database back ends. For example, MySQL servers are something like this. We have connector to OpenML database. Here's the visualization package right here as learners. We have packages for feature selection here. We have a few package for tuning. This Bernd will be talking about tuning in a few minutes and pipelines over ES pipelines, which will be the last package for today. And yeah, also for some special tasks, for example, ordinary regression or clustering or survival endless, we have some packages already. Okay. Any questions on Slido already, Bernd? Should I proceed? There was a little bit of discussion going on in the Zoom chat, with mainly me posting links. And I watched Slido, so I didn't see anything there. I hope it does work for everyone. And because of your good presentation, there are just no questions. I can see a couple of questions at the moment. Should I read them out to you? Well, actually better if I understand where I didn't see them before. Let me check. I can see them now. Sorry. Okay, great. Michel, can you see them or should I? Okay. I'll answer the middle one first. There's one question, what is encapsulation? Michel, try to explain this. So if you run a large-scale benchmark, right, you have like, I don't know, 20 different learning algorithms and a couple of data sets. Some of these, I mean, because some of these implementations might be faulty, or you just might run into edge cases. So what very often happens is one of the two cases, either you get like a sec fault sometimes, if you're very unlucky, or even worse, the whole R process runs an infinite amount of time. Or it just runs for hours and hours and hours and never stops. If you're unlucky and you run into one of these numerical edge cases, that happens even more often if you later do like tuning and auto ML and so on. So what encapsulation means is that we run this computation of the benchmark resampling experiment in a fresh R session. Which means if that goes down, it only tears down this new session. It doesn't tear down the master session because in the master session all of the other experiments are also being run, right? And if one of the experiment goes down, you don't want that to kind of destroy all of the other results that you are computing, right? I mean, there's nothing worse than setting up an experiment in the evening, starting it for 20 hours, then you come back to your machine or to your cluster, and then everything is gone, right? Where it has just stopped after 30 minutes because one process was faulty. So encapsulation is exactly this, spawning a fresh new R session, and we do this by using this call R package. Have a look into this if you want to learn more details about this. This is not from us. We are basically just reusing this as we reuse future. And it's nicely combinable with the future package. What else? Conceptual questions, does ML R3 have an ordinal SVM? Answer is currently no, but we are working on an ordinal package, which would include this. So check back maybe in three to six months, and then maybe yes. So there is intermediate code, so to speak, to work with ordinal CASC. At the moment, it's kind of not officially supported because we're not done with this yet. I couldn't follow. Could you please clarify the commands learner predict task and learner predict new data task? Ah, okay. So is the letter for validation with the new data set? That's a good question. It's actually very simple. So this predict or these two predict functions, they are both there to predict on new data. So the predict task function predicts on data, which is already included in the task. But very often I think that's what very often happens if you do model selection, or if you do scientific experiment, you have all of the data in your hands already. But later on in applied work, you create your model at one point in time, and then I don't know three months later, new data comes along, and then usually the data comes along in tabular format, and then R, you have it in a data frame, or in a data table. And in order to be able to also predict on that, we have this predict new data function. It's just basically what format the data is in, and we don't want, we don't want to force you to kind of merge your new data into the task, right? That's it. I hope that answers all three questions. Okay. To the use case, we have the MLR3 gallery, which is, which you can also find linked on the project home page or the GitHub page for this tutorial here, where we just collect use cases and, yeah, basically collect use cases. And we have three use cases on the German credit data set. We have one basic use case and one for tuning and one for pipelines. And so if you, this is, this one here basically sums up everything, most of the things you can do with MLR3. This is really relaxing, and there are many explanations in there. Unfortunately, I don't have the time to read it out to you, and I don't think this would help. So I will do, yeah, I will, I will just try to do this here, interactively, so that you get an impression how we are, how it's supposed to, how you are supposed to work with MLR3, right? I will just do this this way. So first, you need to, to load some packages here, I load the MLR3 package, and the next thing would be to get, load your data, right? The German credit data set is a binary classification task where you have to predict if it's, someone should get a credit or not. So it's good or bad person to give the credit, and we have some 20 something personal, democratic and financial features here. I don't want to go into detail here, this is not about the data set that you can all look it up here. But so stuff like previous, the credit history, the job stuff like this, right? You can, the data is from the art challenge package. But this is also an example task. So we could also get it by just calling the short name. Here's the short function here, credit, right? And we would get the task also. The memory construction is discovered here, right? So we have 1000 observations of 21 features or 20 features in one target column. There are 14 factor variables in here, three integers, three order factors, right? And what we are doing here in the, in the gallery is now an exploratory data analysis. Basically, you're using the skim package, skim on the data. Okay, it's not installed here. So yeah, so this just gives you a summary of what the factor variable here is and so on. And then the next step here would be to use data explorer to get some plots. Again, not really the aim of this job here to do this analysis. You might draw some conclusions. For example, that you have skewed distributions that you have missing values and that you have empty or where factor variables. So this is something you might have to react to or account for during modeling. But we will just go to the next step and start with the modeling. We already have the task, right? I cheated a little bit and just constructed using the TSK function. The next would be to create a learner to learn something. So these are the available learners, not that many, but if I load the learners package beforehand here, you will see all these learners are now available for me to use. And I can also convert this to a data table. Okay, doesn't work that well for the screen resolution. Yeah, but you get an impression of this is meant to be used. So here's the example. We are using logistic regression, right? So I'm doing the same. This is called learning correct, right? And now we want to train logistic regression on the task. So you could just call the train message and provide the task. And you get nothing back because the learner internally updated its model slot. So we could just ask, hey, what is the stock model? And this is the output of the logistic regression in there. We usually don't want to do this on the complete data we have at hand, but only we want to divide it into a training a test set. So this is what we are doing here. We sample the sum of the raw IDs and we sample some of the, yeah, some of these for the test sets just done here by using 80% of the raw IDs, such as the integer numbers here, sample 80% of them. And here we just say the test set is, yeah, all the raw IDs which have not been used in training, in this training step. So we're just doing the set def, set def function here, right? And now we can train the learner again and say, hey, only train on these IDs. And now we should have a different model. Can't see this here, but this is just trained on the training data. And we can now predict on the test set by just saying, hey, learner, predict. Now we provide a task and we say on which raw IDs are the tasks we want to predict. And this is the test set. And we'll get back a prediction word. So this is all also covered here and some more commands. So we have now predicted for this logistic regression and maybe, yeah, maybe let's just go through here for now, get an impression what is possible on the data set because the active gain function of the prediction object. And now if we don't provide a measure of the default measure, sorry, for the score function here. And if we don't provide a measure of the default measure for classification, we'll be used, which is the classification error here. Yeah, if we don't want to have the classification error, but something more fancy, let's say area under the curve, we have to do something special before we have to say the learner that we want to have probabilities as prediction top. So yeah, now I constructed a learner, which will give me probabilities. I trained it again, necessary again, but now the prediction object also includes probabilities. And now I can say here score using the AUC measure. And yeah, so now I have the area under the curve here. Usually it's a good idea to compare against something different, right? For example, non-linear model, because we already have a linear model here. Now, for example, against a random forest. So first I will construct the random forest again using the short function here. And the one that's called, I like the ranger package. And I will also say to predict probabilities. So then I can train it again. Copy some lines from here. Learner, random forest, dollar train on the task and on the same training set I used before. And the prediction also on the random forest learner the task and the same tested as before. And now I can also calculate the AUC and it's 0.8 and it was before 0.76. So a little bit better. So what I've now done is something like manual benchmarking, right? On a single split in the training test. And the next step would be to do a proper benchmarking on this data set. And yeah, this is what we want to do now. So I need to create a benchmark design, this benchmark grid function. And as you can see here, you need a list of tasks, you need a list of learners and a list of resamplings. Well, I have a single task, so I will just provide this here. A list of learners could be list, logistic regression and random forest. And a list of resamplings, well, I don't know. Let's do, yeah, why not, cross validation with three folds. But I will also provide this here. This one is called the name design. So you can have a look. This is just a data table. And we can then call the benchmark function providing the design. We will look back at the benchmark result. Maybe if I can add one sentence. Is it okay for add one sentence? Sure. I mean, this might be confusing, maybe two people. I think you explained this before, but I just want to emphasize this a little bit more. So also because we did this different in MLR2, I mean, I guess some of you might wonder why there's actually two function calls, right? Why is there this benchmark grid and then this benchmark call? So the reason for this is because the first or the input to benchmark is this data table, which describes what should be computed, right? And very often you want to have like every algorithm on every data set. Yeah, basically with every resampling type, very often there's just one, I don't know, 10-4 cross-validation. Yeah, but every learner on every data set. But what this API design allows you is to also deviate from this. Yeah, you can very specifically design what you want to compute on what and control this by setting up this data table of experiments yourself. And we also later on exploit this a lot in tuning, yeah, in our AutoML packages. That really helps you or allows you to completely control what you want to benchmark and what you want to compute. That's the reason. And then I guess many people will very conveniently just call this benchmark grid function first, which computes the complete cross product now of everything. Just to explain this, sorry. So you can also do this yourself, what, just define a data table here and can just swap out some objects. But this would be the same design. Yeah, benchmark grid is basically an abbreviation of what Michel just showed there, yeah. Yeah. So this just means the first row is a resampling where you apply the first task you provide it on the first learner you provide on the first task you have listed here using the first resampling. And the second is also resampling here by define, right? So you can Yeah. So benchmark already executed, but you could also That's no future plan. Say you want to do this on multiple cores. This is all you have to do called benchmark again. Okay. Colors were not named correctly in my manual design. Okay. So if you're wanting distributed, you don't get the nice lock output, but everything else should run exactly the same way. To be fair, my computer is already pretty Yeah, and I'm pretty heavy load because of zoom and screen sharing. So just cancel it here. Um, so I hope still have the object from before. Yeah. So this is the benchmark result. And we can calculate the aggregate performance, but aggregate and say, hey, let's do every other curve again. And yeah, this is the output, right? So the task German credit for both um, experiments when the learner idea logistic regression ranger, um, and here the resulting scores. And as you can see the ranger when forest ranger here, it's a little bit better than logistic regression. Um, could also again, um, create a plot of this using this package and just call auto plot the benchmark result, right? And also, right? And now compare the the medians. Um, yeah. Okay, there's much more in here, but I won't cover this because we have two other packages still to come. Um, and if there are not that many new questions on Slido, I did answer one or two extra ones. I think there's, there are new ones and I think there's nothing in the chat for zoom. Also remember that you can vote up questions, right? Um, at the, at the moment, it wasn't really necessary. So I just answered everything. Okay. In random order. Okay. Um, yeah, I guess then Bern can take over with some slides on tuning and then we will have a 50 minute break. Right? Yes. Okay. Yes, I guess. Yeah. Stop my screen share. Yeah, we started also a little bit later. Okay. I'll try to kind of be a bit quicker. Uh, so I'll try to screen share now. Let me remove this. You guys, Michel, you can see my screen, right? Yes. Everyone. Okay. Good. I'll start now. So this concludes the first part and I'll continue with the second part, which is on hyperparameter tuning with MLR3 tuning. So again, there will be a first part covering, I don't know, the abstract concepts behind the package and some, some code examples on the PDF slides. And then we'll go through a second part of the German credit data tuning use case of the MLR3 gallery where I will basically redo the same thing on a more concrete data set now with some more interesting outputs and results. What else should I say? Did we cover, I'm not sure how much we also covered the cheat sheets. So if you later on, I mean, if you feel a little bit overwhelmed by all of these new commands and so on really print out the cheat sheets for MLR3, there's a good cheat sheet for MLR3 tuning and there's a pretty good cheat sheet already also for MLR3 pipelines. Really print this out later on if you kind of want to see everything at one glance on one or two pages. Okay, also one final organizational comment. So I had some problems with the layouting of the PDF which forced me to hot fix them. I think directly in the first couple of minutes when Michel presented, so I didn't change any content, but the last two or three slides of this thing here now have a better layout. So you might want to download them again, but nothing else was changed. And I think I pushed them again into our GitHub repo, I don't know, 40 minutes ago or something like this. Okay, so as most of you I guess will note, many of these non-permetric non-linear machine learning algorithms that we like so much like support like the machines boosting your networks and so on, they all depend a lot, their behavior depends a lot on so-called hyper parameters, so control parameters that are not learned during the training part of the algorithm of the learner, but that which are basically an input to the training procedure and we as users have to set them and very often we have no idea how to do that. And of course we want to select these parameters so our algorithms work well in terms of predictive performance. Good hyper parameters are very often data dependent, so very often good defaults are very hard to come by and this is the reason why hyper parameter tuning or how we often call this nowadays auto ML or we could also call that black box optimization has become such an important hot topic in machine learning. And what we are basically doing there is we are just trying out different configurations, different settings of the learning algorithm and cross-validating the algorithm again and again at our data set at hand with respect to the performance metric that we interested in and then trying to kind of sequentially improve upon our results and we can do this in a very unstructured manner through grid search or random search or we can kind of do this in a more intelligent manner by using true black box optimization techniques like evolutionary algorithms, racing or Bayesian optimization and so on. And MLR3 tuning is there as an infrastructure package kind of combine all these methods and allow you to use them in any way you like and also to connect them to MLR3. So the package is simply called MLR3 tuning. We kind of restructured this a little bit sometime ago and there's also now kind of an infrastructure package in the background which is called BVOTK which is for black box optimization toolkit. I'll not really explain what functions and what package it doesn't also really matter in terms of understanding just load the two. Yeah I guess it doesn't really matter where the functions live in especially for this tutorial. So the reason why we did this is because we want to kind of reuse some of the infrastructure also in other parts so we kind of abstracted this away a little bit. So how does tuning work like from a conceptual point of view? So first of all we have a parameter space and what our tuner basically does is in a continuous loop it suggests hyperparameter configurations so certain settings in this case it might be two parameters yeah parameter one and parameter two yeah and while this tuning loop goes on we suggest a certain configuration and we evaluate its performance by using this resampling procedure or this benchmarking procedure that Michel already demonstrated to you and we kind of directly call into this and all of the results objects that we're going to create in MLR3 tuning will look exactly the same maybe a little bit of extra information as in Michel's MLR3 demonstration. So you don't really have to learn too many well at least not too many new container objects okay and how to access data which I think is quite good and that was also a lot worse in MLR2. So we evaluate the performance which is now this blue X here and of course now we want to feed that back into a tuning algorithm. The tuning algorithm hopefully kind of learns something from this evaluation and suggests a different configuration we try this out again through cross validation again we get a performance value back and so on and so on and we could also even do this in batches so some of these tuning algorithms can or want to evaluate multiple points in parallel because that also obviously enables nice parallelization during the tuning thing for example of an evolutionary algorithm so something that's like population based there you could kind of evaluate the whole population in parallel. So what MLR3 tuning assumes is that we always evaluate in batches, batches can be of size one and then all of the evaluated performance metrics are fed back into the tuning algorithm and the tuning algorithm iterates and of course at some point we have to stop and so we need also a termination criterion in MLR3 tuning this is called a Terminator also an object that describes when you want to end. So in order for us to now evaluate performance we need all of these objects here which you already know from Michel's talk so of course there's a learning algorithm involved of course there's a task involved of course there's a resampling procedure involved think cross validation and of course there's usually a single metric involved that you want to optimize for and obviously we need to extend this a little bit to define really this tuning scenario of course we also need to define our search space so what do we want to optimize over yeah we call this the search space and I'll show how this is defined later on so this would be something new and of course we need an object that describes when we want to terminate well let's just call the terminator that's a pretty simple thing so we now need to tie all of this together in MLR3 tuning and so we have to create something which is called the tuning instance that's kind of the major object so the tuning instance is basically kind of a bundling object that takes the task that takes the learning algorithm takes the resampling procedure and measure the search space and the terminator object and bundles them together here in this tuning instance or you could call this the tuning scenario or you could also call this I don't know the black box function that defines if you how you map a certain configuration to an outcome value a performance value okay and then there's also the tuner and the tuner has an algorithm that acts on the scenario and iterates iterates iterates until it has produced a final configuration business which is good or optimal in certain sets and like I said these four guys here Michel discussed already at length so I'll now discuss these two guys here a little bit more and of course these two guys as well so first of all the search space so the search space describes what parameters we want to optimize over and in what ranges and what data types these parameters have okay um so our basic object is for this is called a parameter set that just bundles different parameters together you have seen this already before because every learner also contains a description of what hyper parameter it has yeah inside of it and here we use this kind of in an alternative manner or a second time to describe our search space for tuning and the parameter set is constructed by passing along or passing inside simply a list of parameters or parameter descriptions so and there's a couple of constructors for different data types so there's a param double for doubles there's a constructor for integer parameters for factor variables we'll also call them categoricals for logicals booleans so to speak and um UTI that's for untyped that's pretty useless for uh tuning because tuning basically usually needs to know what parameter type it is and how the ranges and constraints look like and for numerical parameters we have usually a lower and an upper bound now so the ranges what you tune inside of so most tuning algorithms require to know this and of course for categorical parameters we need to know the factor levels that we want to tune over well for logic is very simple it's always true or false so usually you don't need any extra information and in a certain sense you might don't need this at all because you could have also done this here with the parent factor now I guess that only exists for convenience and completeness sake and um if you want to see how this usually looks like you have to load this extra mini service package here which contains this description language for the parameters which is called paradox um so you describe your search space here by always saying param set dollar new and then you create a list and pass in um description objects for all of the parameters you want to tune over and with all of their respected respective site um next thing we have to define the termination condition so we have to create a terminator object from the terminator class I'll go over this super quickly because it's very very simple so you have this trm function here I might pronounce this the term function um which is again uh syntactic sugar to create such an object it's again a dictionary you can list the dictionary in here you can see all of the terminator objects that exist so you can for example say I want to stop after 50 evaluations and you can also stop with respect to clock time so you might want to stop I don't know at 8 p.m in the evening or you might want to say I want to stop after 50 minutes of computation or you want to stop after you have reached the accuracy of 0.95 yeah it's all very simple um you can just take the correct object or you might maybe you want to um yeah terminate when you have stagnated when performance has not increased a lot and it's also possible to combine all of these combine multiple of these yeah so for example either after one hour or when I have stagnated with respect to performance development and then you can use this meta terminator combo here for example yeah you can combine with end or or uh constructions um yeah and it's as easy as just writing term number of evaluations equals 20 and then you stop after 20 evaluations um and the next thing is you have to choose a tuning method so for that you need the mlr3 tuning tuner class that's also quite simple it's again it's a dictionary again there's a little sugar function which is called tnr um to create from this and here you can list the the tuning algorithms which are currently implemented so you can do something like a grid search you can do something like a random search you can run a um simple as simulate or actually not a simple actually quite complicated simulated annealing algorithm here genus a uh there's nlopt also in there and or you can even specify exactly at what design points you want to um evaluate manually um of course um I guess you might some of you might be missing some of some more advanced algorithms in this we have already um on github um a nice version of hyperband and we'll create very soon a new version of nlr mbo so for model-based optimization a Bayesian optimization currently working on that give us maybe one or two months so that we have a prototype that we are comfortable with publishing and I think I think then you also have some pretty um yeah efficient techniques actually at your fingertips so you load the two tuner um that also loads all of the service all of the underlying packages you need again you you might have to set some control parameters so for example we can create the grid search here and we can set the resolution parameters to three and resolution means so for every numerical parameter in a grid search we discretize that to three values so if I take this if I take my k from the k and n and I want to optimize in the ranges from one to 20 so resolution three means that I'm evaluating k equals one k equals 10 in the middle and k equals 20 at the upper ends of the spectrum and you can also see here some properties so what the tuners can do what parameter classes they can work on the batch size settings and so and of course the strategy parameters that we have set here but except for that it's not very complicated it's basically just you construct the thing and that's it um yeah I know there's usually this batch size parameter which is important for pluralization so I don't want to go into details here too much because that's a bit technical and then you tie all of this together so first of all you create the tuning instance so that's basically just a very long call where you pass all of these objects in here I've explained all of this and because I'm using grid search and grid search always has a finite amount of points to evaluate and usually want to search all of them exhaustively we can also use this terminator none here so there's not a specific termination setting now we just finish when the algorithm thinks it's finished and um after we have created this instance here the scenario so to speak the tuning scenario we just do tune out dollar optimize on the instance yeah and that gives us back and also stores in the object an archive of all evaluations and it also returns to you a data table that contains the optimal settings that were found during the tuning process so this looks a little bit complicated I guess or the middle part here so let me explain at least parts of this so the first columns here are simply the parameters that you're optimizing over these are all always like scalar parameters because paradox doesn't allow anything else so these are just regular columns here in the data table and you can just easily see them access them and work with them and then here the last column is always this respective associated performance metric that is associated with this optimal configuration that we have found here you also have two different ways of kind of specifying this optimal configuration so I'll skip so there's two things that can change this configuration so later on I will also show you how to do parameter transformations so you can kind of transform your parameters into lock scale and so on and this stuff is stored here as a list and you can also maybe the learner itself has also some constant settings associated with this so what is in this list here is what is actually being evaluated on the learner so these are the transform parameters also with other constant settings so this is kind of what is really being evaluated and this is here what the tuner of the optimizer is actually searching over okay and if you don't understand this now doesn't really matter so just a simple example it's all of these things here to completely coincide and there's not a difference between them and we have technical information online to make this clearer and and here I'm doing now exactly the same what I showed before so I'm just doing this with a few more evaluations so here I run my grid search now with a resolution of 20 again I create my tuning instance I optimize over it with my grid search I print out the results and what you can also see is here now how I can nicely access the instance after I have run the tuning and I can access what is called the archive slot and turn this into a data table and the archive slot now contains every evaluation so with every parameter setting and with every the associated performance metric and it looks exactly as this here you know just a longer table with more rows and every evaluation that the grid search has performed or that any kind of tuning algorithm would have performed is in there and from this because it's a data table we can nicely now plot so I'm here plotting this trace or I'm plotting this a k versus associated classification error so you can see which k's actually work quite well in which work less well right so the optimal settings would be something like I don't know 13 or 14 for my cane and on this data says yeah I mean the experiment is more or less a toy experiment so it's not that interesting and here's a recap I guess I'll skip over this because it contains nothing new just the code on one page I guess it demonstrates to you how simple tuning is to the I mean I guess what's not in here is the parameter set definition so I guess that should be here the instance tuning algorithm optimization that's it okay some at least a couple of do I have some minutes left for this well maybe a few a couple of words on parameter transformations so sometimes we might not want to optimize over an evenly spaced range of parameters as I did before so for example as a motivation so if you look at k equals one versus k equals two so the second thing here is twice as large as that one so we might say there's a big difference between the two but I mean and it might be interesting to kind of evaluate both of them but do we really care about the difference between k equals 101 versus 102 so there's a lot of parameters where you kind of want to optimize them on different scales so other standard examples for example are regularization parameters where you want to try so the C parameter in the support vector machine or a lambda parameter in regularized statistical models very often you want to optimize them over a range where you have values that are very close to zero and then values that are also very very large and very far away from zero and then the usual trick is to optimize them on a rock scale and you can do this in a very general manner in mlr3 so you define your parameter set and then you add a so-called transformation function to the parameter set so what does that mean so what I'm doing here is now I'm creating kind of a fake or temporary parameter k before truffle I've also called it k I guess it doesn't matter just to make this more explicit and we are optimizing this from log one to log fifth and now I'm creating here a transformation function which actually does x of that and then rounds it back to integer so what this will do is we evaluate a lot of values here which are closer to one but it will have larger gaps between well larger values of k okay and another very regular way of defining this for example for the support vector machine you might have seen this before so if you optimize over the C parameter or the or a kernel width parameter you very often optimize from something like minus 10 to plus 10 yeah in the tuning but in reality what you mean is actually two to the minus 10 to two to the 10 now so two to the minus 10 very small values close to zero and two to two to the nine two to the 10 large values with larger gaps yeah and in this case you would write here something like param double it goes from minus 10 to 10 and then your transformation function would just be um yeah c equals two to the x okay and the nice thing about designing it like this is you can do any type of transformation that you want right so you can kind of create fake spaces that the tuner is acting on and then this transformation function kind of converts it into the space that you want to have the learning algorithm being run on so I guess that's more of an expert option for people who are a bit more used with network optimization hyper parameter optimization now and the nice thing is also because the trafo acts on the complete parameter set so it's not a setting per parameter it's kind of a function it's an r function that acts on the complete parameter set so this x here is a list of all parameter settings for this parameter set and you can you can even do multivariate transformation so you don't have to do it on a per parameter basis you can act on the complete configuration in the transform it into anything else so the only thing that is kind of I don't know require is this x here will always come from this parameter set and what you compute here and this list that must be acceptable by the learning algorithm now that's kind of the contract model so to speak so that this will work so in this case what is opera transformation doing so you can see here how yeah we how the tuner thinks it optimizes on this space while in reality we're actually evaluating here on this space yeah values yeah I mean that's kind of what I explained already what this lock space thing does right and and you just call this here again just looking at this so there's not a big difference we can also plot this again so there's also nothing new in here so I'm now plotting this k before transformation yeah which I guess it's a little bit I don't know that the scale might be a bit confusing for people because it's now this this you know numerical scale or this double scale which goes from zero to four but in reality and we are actually optimizing from zero to 50 and this is also why this yeah let me show this again why this x domain here is interesting right because this here contains the what the tuner is acting on at this before the transformation and this guy here contains the configurations after the transformations which I guess are more interesting to people to later analyze and I've shown here both if you dislike that these parameters here are kind of I don't know hidden down in a list you this archive thing here and many of our data tables that we are supplying with mlr3 have these unnest functions or unnest options they can kind of unnest everything from the list and then it will create separate scalar columns in the data tables you can easily access that which is pretty convenient so finally I guess no hyper-remitted tuning talk is complete without referring also to nested resampling so I don't want to go into theory here so much so I'm assuming that most of you know that we need to perform nested cross validation or train validation test setups to estimate learning algorithm performance in an unbiased manner if we do hyper-premitted tuning otherwise if we only kind of if you optimize on the cross validation for performance estimation this will create optimistically biased results and we shouldn't do this so in order for us to achieve a simple nested resampling for mlr3 what we are doing here is we are actually creating something that's called an autotuner object and what the autotuner object does is it basically creates a two-step procedure so as a second step it really runs your learning algorithm with the optimal configuration and the first step here is this tuning thing so basically does what I've just shown to you it runs the tuning completely first until completion it computes the optimal configuration here and then it finally trains the model on the past in data set with this optimal configuration and then we just kind of I don't know in a software engineering manner tie a box around this and we create a kind of create a coupling mechanism that runs the tuning first and then the learning algorithm and this is what is called the autotuner and the nice thing is if you cross validate this thing here this has now an autotross validation and an inner cross validation in the tuning and this kind of automatically does everything correctly because you cannot cheat anymore because in the outer loop yeah always fresh test data is being used if you find this confusing and you don't know nested resampling really pick up I guess a good book on machine learning evaluation I've also I guess published papers on this really look this up because it's pretty important for practical work and proper evaluations so many people have done mistakes here and nice thing about mlr3 in terms of software is it's super simple you can now just do the following you create an autotuner which takes your learning algorithm and you can even pass in some constant settings here you create an inner resampling or pass in an inner resampling so this is what is being used during hyper parameter tuning and here you specify the metric that you want to tune over you specify your search space you specify your terminator termination criterion and you specify your tuning algorithm and all of this stuff here you have seen before so from here to here this basically defined the tuning instance well I guess this also needed the task we can't pass this in here because this will create an abstract learning thing that can be trained on anything and this here was the tuner but all of these objects you have seen before we just now tie them all together in your autotuner and what you cannot do is you can this autotuner is simply an mlr3 learning algorithm it behaves it's completely connected to what Michel has shown to you so you can run every method that Michel has taught you on this thing for example you can train it and if you train this this will run this very complex process here it will run the tuning first compute the optimal configuration and then fit the model on the task with the optimal configuration and this is what I'm doing I can even now see the optimal settings that were computed by this if you access this here and you can also do this somewhat lengthy call so I guess we should probably create an active binding for this you can access in this you can access the complete archive again so everything that basically led to this optimal configuration while tuning was being run in the autotuner and or you can resample the thing so you can resample your autotuner here on iris with an outer resampling now and store everything that's being computed and that now really computes nested resampling it gives you access to all of the optimal configurations it gives you access to all of the optimal yeah all of the archives here of all outer loops and so on and again some of these calls might be a bit lengthy in the beginning or hard to remember this is all covered on the sheet sheets and so I guess take a look at the book to understand the underlying principles or re-watch what I did here and I let it later on if you just want to remember how you have to kind of yeah all the api looks like have a quick look at the mlr3 tuning sheet it's one page it's all in there I guess the last last parts of the sheet really explain again how you can access all of these results what is being computed here now by this resample function the performance metrics these are guaranteed to be unbiased and proper so let me check how I am in terms of time a little bit slow but nearly okay so michel did you answer some questions already are there some that I should um yes I answered some um by replying to them um there's one because we couldn't yeah you want so I would suggest that we maybe answer that we answer all of them now and then we go into the break okay um so first question is then is there a rule of thumb for the tuning time or for the tuning of course not well this is actually very difficult question so I mean not really we also have a hard time kind of coming up with perfect termination criteria so I guess many people set this to um to I don't know how long they are okay with with waiting yeah so this I think it depends right depending on the space how efficient is your tuning technique okay so I could give you I guess some rules of thumb for some more advanced algorithms which we didn't cover here but I think that's a bit out of scope for this so I think that really basically depends on your preference and what I would probably do is look I mean in practical terms I would think about how much time am I actually okay with being waited with with with waiting and then combine this with this stagnation thing okay so for example I would be okay with 20 hours of tuning but if it's stagnating please then stop now as well I guess that would be my answer the short answer would be depends okay another question maybe more on the mlr3 part what do we do if we want to have a customer implementation of some algorithm do you mean the tuning algorithm and I'm not sure maybe a learner so maybe you can open the book well um chapter sir yes I will um so let me maybe start with the learning algorithm so I'm not going through this now in detail right because that's a bit technical but I'll keep it short by saying it's simple so have a look at chapter six it's about extending so this at least explains to you how you can add in your learning algorithms and there's a template piece of code just use that maybe I guess also talk with us because we have some infrastructure in place to kind of test all of these things very nicely on github and run standard tests on them now which actually helps lots with discovering also floors and bugs in the underlying implementation so not this connector code but the algorithm itself and also explains how you can add new measures that's even simpler that's so that's basically one or two lines which really look like the formula mathematical formula of the measure and for the third thing for pipe operators so that refers to the third part of the tutorial so how how can you add like operators for pre-processing and so on for future extraction and we haven't covered how you can add new tuning algorithms I guess we should put that into the book but it's also not very difficult because the other thing you could do is you can just kind of take a look at what we do in on github so for example we can go to mlr3tuning you can go to the rcode and you can I mean I don't know maybe have a look at I don't know maybe they have a look at gensa I mean wait maybe not the best example oh well um no I guess I think there have all been moved to yeah that was refactored so if you want a new you would have to add this here I guess in this in this instance it's probably easiest to open an issue on github and just request that but just to show that this is not so I mean we have implemented them here because you can use them that them in even more general sense but this is how the code looks like to connect to an optimizer that already exists okay I mean that's five lines okay and these 10 lines are just there to describe the strategy parameters of the algorithm and you can also have a look at our super complex implementation of random search right which is here which is just sampling points randomly obviously that's also very simple in R so what you can just do is you can basically copy this type of code exchange the inner parts with your own tuning algorithm add this locally yeah just load this or source this locally and you would have a new tuning algorithm in mlr3 okay I guess if you're doing this for interesting stuff open an issue and talk to us but if the algorithm already exists somewhere it's easy to connect it I mean that's the whole idea behind the package okay any more suggestions to first have a 15 minute break and so that I can order the questions a little bit yeah and we're also kind of getting nearly out of time yeah quickly I guess and so my suggestion is we meet in 13 minutes so that would be five minutes to seven at least on my I don't know the German clock here um so in 13 minutes we meet again and I guess everyone can just mute their microphones um please I guess it's for not a good idea to leave the zoom call because I'm not sure whether you can get back in again and have a cup of coffee take a deep breath I think we already cover quite a lot see you again in 30 minutes so if you are in a gmt minus five uh we'll have we'll be back at uh 11 55 around that time hello I'm not sure it's not it's very hard to say whether we should wait more or not I guess we just go on right it's it's five minutes to seven and I guess I can hear your kids in the background and so I answered more questions on slido um and except for one that I don't understand because it's very short um I think I answered everything so I would just now continue okay good okay so um wait michael you can see my just checking just making sure that you can see my ask will you know right yes can you kind of read um should I increase the font size this if you could that will be great thank you appearance I can't sure uh so I usually also use studio okay all right and set this to 100 maybe yeah like that yep does that does that work for everyone so okay so what did I do so this is here the second uh post so this is a mlr3 tuning on the german credit data set um like the hdml output is linked in our material page and I also in the zoom chat I also told I mean if you click on this github icon yeah on the gallery post on the top top right corner you can also go to the github repository where the source code of this that is still thingy uh lies and you can also download the rmd so this is an rmd file so I'll just open the rmd here in our studio so that's makes it easier for me kind of copy paste the thing into the art console so michael I think already explained the first part very well right so how to kind of benchmark uh algorithms on the credit data set so now of course we could also use the benchmark function to try out different algorithms to try out different configurations to see which of these works a bit better um and optimize on this but a this is uh inconvenient right and b at some point it would become cheating because you're optimizing on the cross validation so I'll just go through the motions again to show you how to do this with mlr3 tuning before I do this let me just set some chunk options just for layouting uh load some stuff and set some seats um uh so the first thing that's maybe interesting to you is this year so um every package in the mlr3 universe is um kind of operated by the same logging mechanism so we use this log up lgr log up package and you can just always say get log logger package name and then you can set the threshold so you can for so for example because I will now be calling into quite a few packages right especially I will be calling into michael's mlr3 package and I'll so internally I'll use the resound the benchmark function that creates a lot of logging output right which is nice if you just do normal benchmarking but which becomes super annoying if you do lengthy tuning runs I'll set this here to warn so I don't want to see this but for blackbox optimization I'll set the threshold to info so I do see which configurations are evaluated and actually in the gallery post I think this is also set to warn to to reduce output and length a bit but here I guess for the live tutorial I'll set this to info so I'll just again skip through the text so first we'll just note a couple of packages so uh that's not very interesting you'll already know this so of course we load mlr3 we load a bunch of different recommended learning algorithms um we load the tuning package and we load paradox so we can talk about parameter spaces and this also set the theme for ggplot and now we'll start so first of all let's just get our hands on the example task of german credit so I'll just load this into the session I will not enable parallelization you're on my laptop but if you have a faster machine and more course and just want to run the thing faster just enable future here with plan multi process um so before I start with tuning I have to of course now specify how I want to evaluate configurations through resampling and german credit data german credit data task is of medium size so very often our standard here is to use 10-fold cross-creditation so I'll do that I'll also instantiate that so I'll really create all of the splits um so later on because I'll try out different algorithms and different configurations so I'll really make sure that um I will all evaluate them on the same splits to reduce variance in comparison so most of our functions like benchmark and tuning and so on I kind of do this internally so you don't if you forget this I would say in most cases yeah we don't have to worry about this but here just to make sure especially when I want to compare different algorithms I just um yeah I want to make sure that this is instantiated and we're always using the same splits for every uh comparison and I can even take a look at this if I wanted to I can see now in the instance I can see the row IDs so I can see here for example the first fold so which rows row IDs or which observations are in the first fold and I guess if I click here a bit more you can see okay that's the yeah that's the 10th fold and in between are the other guys and as before I will now use um a k nearest neighbor method as my classifier for german credit I will use the k k and n package that's a nice flexible implementation of the k nearest neighbor algorithm and I'll set predict type to prop to predict always probabilities and not only hard classes hard class labels and here you can also see how I'm setting now a specific um constant hyper parameter that I'm not tuning in the beginning at least set the kernel of the k and n method to rectangle um so as the next step um we have to now decide which hyper parameters want to tune and if you don't want to read the documentation of k k and n you can also ask mlr3 michael showed this already too which parameters exist in k and n or for in k k and n I should probably say is you see the k parameter now the number of nearest neighbors you can see um a setting for the distance functions so which norm to use so l1 norm or two norms from a head and distance or a clit and distance or something numerical in between um that's a double parameter um you can see this kernel here that's the kernel function that is based that basically weighs the example in the distance computation and whether you want to scale features before you use k and n usually I guess if you know k and n for a more algorithm or theoretic uh theoretically minded course you know that um it's pretty good idea to scale features before we use k and n and you will also see this later on uh in our empirical results so of course now I have to specify my search space and I want to keep it simple in the beginning so I'll only tune over k so I go from here from 3 to 20 in this example and I will also tune over the distance parameter but I will only do this in yeah as an integer parameter I could have also used the double here because we can use kind of an l l distance type of metric yeah where this distance is uh yeah a real value but here I to keep it simple I want to try out the values of exactly one two as integers so let me define this with this two and now I'll go through this quickly I create the instance as before I've I think I mentioned this now five times pass in the task pass in the learning algorithm pass in the resampling specify your measure so we'll use the classification error to optimize pass in the search space and because later on I will use grid search in the beginning I'll just set the terminator to none so okay let's execute this too and we have seen before um how to set up grid search so we use t and r of grid search I set the resolution to 18 and I set batch size to 36 so that might be interesting so first of all what what does resolution 18 mean so what this means is that we are using for this guy here we're actually using 18 different values for this we can't we can only use two right because it's an integer parameter that only goes from one to two so this means in the cross product of this space there's 18 times two different configurations so there's 36 and because grid search is uh embarrassingly parallel right there's no interplay between the different evaluations because it's such a I don't know stupid simple algorithm I now set batch size to 36 which means if it would enable parallelization everything would be valued in parallel okay because nothing nothing is sequential nothing depends on each other so what this would kind of enable maximal parallelization now so just and on my laptop here doesn't matter because I'm just using one core because it's kind of a small and it's a small laptop um okay and um I can now take a look at this instance here and have a look at the result slot and what I get back from that is something super uninteresting it's just null because um I haven't run the tuning yet okay so just empty so we can now use run the tuner with the optimized function on the tuning scenario and if we do that you can now also see here that it's starting to tune and running and evaluating 36 configurations um yeah I just have to wait a little bit I guess one or two minutes to see how the result looks like I did test this out before so I guess we just have to be a little bit patient here so for longer runs that come later on I have pre-computed with the results but I'm a pretty impatient person so let's see how whether I can make it for 60 seconds is there anything else that we can answer in the meantime Michel no nothing new it's okay it's finished so I guess I could explain also the logging a little bit here you see timestamps you can see this logging level so this info right you can see here we are optimizing to parameters you can see we also print out which optimizer we use the terminator you can see that we have to wait for 36 evaluations and here we can now peek into the results so I guess you can see this here okay something is wrong my what is I don't know I guess I messed up something in our session at the moment so this shouldn't happen so should be printed out so this is how this looks like you can see that I now chose k equals nine so nine nearest neighbors and I'm using Euclidean distance and this is the optimal classification error that I obtained with this configuration which is 25 percent of misclassifications and as I said this is already a bit biased okay now so this is not an unbiased estimator for future performance on new data sorry and I also already showed to you how you can you can access the archive so you peek into the instance you access the archive and then you convert it into a data table with dollar data and you can see here the unnested parameters that we're optimizing over so k a distance so this is our grid so if you're wondering why the parameters are actually ordered maybe a little bit weirdly so we also randomize the order of the grid search for some technical reasons and here you see the classification error in this column so pretty equal and and again we can plot performance and here I'm not only plotting k on the x-axis I'm also coloring the distance parameter so you can see whether that makes a difference right and on the y-axis I obviously have the classification error so you can see here that on average Euclidean distance works well a little bit better a tiny bit better okay so let's make things a little bit more interesting so how how about we now want to search over a larger parameter space so and I want to optimize over k from 3 to 50 and I also want to do this with on on log scale as I explained before I now also want to truly optimize distance so this parameter of the distance metric as a true double and I'm going from one to three as a real number I'm also now optimizing over the categorical parameter distance kernel and I'm optimizing over rectangular gaussian rank and optimal and I will also optimize for the boolean parameter or the logical parameter whether to scale features or not so let me define this so this is now four parameters a bit more kind of challenging as a search space and I will also do this log space transformation that I did before as in the slides 4k and because this parameter space is now a bit larger maybe you don't want to do grid search anymore right because grid search would now exhaustively evaluate everything this might get quite expensive so the next best algorithm if we don't want to do anything too advanced would be random search where we can completely control how many evaluations we want to do because it's just randomly samples our search space also embarrassing the carroll and then I really have now to set a termination criterion so I set this again to 36 evaluations as before for the smaller search space and I set batch size to 36 again so that this is completely paralyzed so let me execute this again and yeah I guess I will skip over this now a little bit because I'm not patient enough to wait again for two minutes so this again now and I did this before in my dry run to prepare the tutorials you can see here how the result looks like so this has optimized now these uh sorry these four parameters here you can see the optimal configuration here you can see that the classification error is now a little bit better obviously we should have probably used the larger budget I guess for four parameters and only 36 and you can see the complete archive here spending again these four columns and all evaluations and as you can see we're also storing all of the resampling results all of the predictions all of the evaluations you can configure this a bit if you think this waste memory can switch this off but if you want to peek into all of the computed results that's definitely possible and it's just in there Michel has taught you how you work with these objects and you just have to access the data table and get it out of this and here again I'm now unnesting for the transformation so if you're annoyed by this k being on lock scale you can unnest here and you can see it on the original scale I guess we would have to round this also uh no actually I'm being stupid it's uh I'm I'm being an idiot so if we are unnesting the x domain I have to click here just the you can see here the unnest at the the um the transformed k on the original scale and of course there's also an integer sorry I missed it there and and you can do now do a nice things again you can um plot with ggplot on the archive so here I'm uh I should I should have done this sorry okay I'll skip that so I guess because I should have computed the other thing before and so this actually let me just kind of directly skip to the most interesting part so um here so I've shown the auto tuner again so there's not a lot of new stuff in there so I think the most interesting stuff in terms of the use case he has in the appendix so in the appendix we are running the same thing so I'm running exactly the same task with exactly the same search space on so on but I'm doing more evaluations so I'm actually doing 3600 evaluations so we are increasing the budget by a factor of 100 and because you don't so you don't have to pre-compute this we have actually stored all of the results here in an rbs file so we pre-computed this a bit I'm not sure how long this takes maybe about an hour or so again structurally the results look exactly the same it's just now 3600 rows but you can now see very nice effects from this if you easily if you just use ggplot on the archive so here if you can for example see k versus classification error but we colorize points by whether to scale or not and as you can see obviously scaling is beneficial and helps right and this is what we teach to you what I teach my students for example if I teach k and n in a university lecture I teach them to um because it's based on this distance you could in distance function usually it's a good idea to kind of become well scale independent if features are measured in different units to scale everything before right and here you can kind of apparently see this right results are much better if we scale so of course I could argue well I know this already should I tune over this you can shortcut there if you know that in some other cases it's less clear and it's at least you see these toy examples it's kind of nice to see this again right I can also colorize points by a kernel and here this the picture is a little bit less clear right so there seems to be some patterns in there so in order to kind of make this a bit more visible we are now also using geom smooth um so this is my kind of more now I'm talking about ggplot and mlr free tuning um you can see here these smooth lines these average lines um I guess you can see that I guess what you can at least see if we zoom in uh is that um there's apparently also well there's there's a certain dependence going on between k and the kernel right so depending on what kernel you use different values of k are optimal and there's also well at least um a rough dependence also between distance and kernel so from these very simple plots we can now already get a lot of information out of this so scaling seems to be very influential setting the scaling to true is beneficial um the distance parameter actually seems to be the least influential and there seems to be an interaction between k and kernel and also well uh to a certain extent not that much actually between distance and kernel um the only part that I skipped here is how to do proper auto tuning um proper nested resampling but the only reason I skipped that is because k looks exactly the same as on the slides and I guess I would stop here with this part and now move over to I don't know in my opinion maybe the most interesting part of the tutorial which would be mlr3 pipelines um michel um have you confirmed that martin is also already here uh I've chatted with him over metamos he should be here yeah he's here martin is here this is great so um I will now move on to mlr3 pipelines so what I should also say before is that what I'm presenting here in terms of the third package is to a very very very large degree martin's work and then I'm not the second author but Florian Fister was another PhD of mine so um is also the reason why martin is presenting the use case so I'm kind of proudly presenting his work um and like I said I think this is really at the moment probably the most exciting part of mlr3 now um because you can do so much interesting stuff with it and combine it with everything else so okay so what do what is mlr3 pipelines about so it's all about machinery workflows so we all know that machinery consists of a lot more nowadays than just I mean taking um I don't know some fixed data set running your learning algorithm on it evaluating it maybe tuning it a little bit so now there weren't losing you um and to get kind of sorry could you repeat the last okay is it better yes now now we can hear you um is it better now because I'm actually behind the cable connection so it should be pretty stable okay sorry okay so let me start over again so mlr3 pipelines is um well about constructing machine learning pipelines or machine learning workflows okay complex machine learning pipelines complex so it's kind of a modeling language for machine learning workflows and um we all know nowadays that this is pretty important that machine learning now applied machine learning is about a lot more than simply taking a kind of completed already externally preprocessed data set and running a single algorithm on it maybe tuning it a bit evaluating it and then you're done right the most important steps of applied machinery nowadays often happen in feature extraction feature preprocessing feature selection uh data is quite dirty yeah nowadays you have to kind of do a lot with it um before you go into your ml algorithm especially if you want to achieve optimal or or good results and um yeah mlr3 pipelines is kind of a design language in order to allow that to you and um if you want to kind of squeeze out the last epsilon of performance if you look at caggle how people usually win competitions um they often use a second ingredient and that's about also humbling so model averaging model stacking and so on and also this is also possible with mlr3 pipelines but mlr3 pipelines actually allows you to do a lot of very complicated things because it's a quite abstract and very general type of well workflow language geared towards machinery so um my kind of running example for the beginning is just simple linear pipelines where we for example do something like scaling of features then we encode factor variables because our ml algorithm can only handle doubles then we do maybe some imputation to handle missing values and then we go into the learner and this is what i would call a linear pipeline and this is what many people are now well i guess are pretty happy with but mlr3 pipelines can do a lot more it can actually work on graphs so how do workflows look like from an abstract point of view so we basically have two building blocks so we have like a computation that's being computed on something and this is what we call a pipe operator this is what you see here as the nodes in the graph so each of these little nodes here these pipe operators takes one or multiple input objects it's comp it computes one operation on them and then well it gives out the results which is usually of one object but it can also be multiple and and you can also see the structure here like the connections these edges that specify in what sequence stuff is happening how the these computational steps are interconnected that's that's a control how information flows and if you well combine nodes and edges what you get is a graph in this case kind of a computational graph right and that that graph we call a pipeline i guess we could have also called an mlr3 graphs but pipelines is the more popular term um boy i guess we could also call pipelines kind of a data flow modeling or data flow programming language because usually on these not not always but most of the time flows on these edges here is some some form of pre-processed data um so let's now start with the pipe operators how do they work and pipe operators in a certain in a certain sense you can think of them as a generalization of an ml algorithm so you've already understood how an ml algorithm looks works like right like a data goes in and then there's a training procedure and a model comes out and then there's a predict procedure where also data goes in and then a prediction comes out yeah and pipe operators are kind of a generalization of that just so that they don't necessarily train models but they do something on data in a certain sense you could say they also learn something on data and they they are objects so they have a constructor again with a nice little sugar sugar so you can use these p o these this p o function here and you can construct your pipe operator and in the beginning i will use the scale operator as a very simple example because i think most of you will understand how scaling works and now the important thing and what makes everything a little bit more complicated is that this operator it's not a single function call okay so it's not just one function it's unfortunately in machinery it's always two functions so it's training and prediction so what happens in training so training data goes in we scale the training data so we scale each feature and we transform it into scale data and while we are scaling the data we also store the scaling factor factors so usual scaling and statistics i mean there's different versions of it what one scaling is subtracting the mean and dividing by the standard deviation for each column and this is what i call the scaling factors here and this is what you could actually call also learned parameters so in a certain sense we are learning them on training data and here we now store them as we store for a learner we store the model inside of it we now store excuse me we store these learned parameters and during prediction something slightly different that happens so training or sorry test data comes along it's again a block of data data table we push it inside of the operator but now we don't learn anything we simply apply this the learned scaling factors from training and transform data again goes out so every pipe operator always has a training function and a prediction function now and there's actually more complicated operators that have multiple inputs and outputs so i think this type of operator here we call a pre-processing operator so for pre-processing so that's kind of the abstract base class so for pre-processing operator always training data goes in transform data goes out during training and during prediction again data goes in and data comes out but we have more complex operators and some of them are actually either take multiple inputs or produce multiple outputs all of them i will show later in more complex examples so i don't want to kind of explain them here prematurely so let's see how this actually looks like in our code so we construct our pipe operator for scaling and now we train it and operators because they can take multiple inputs they are always trained on this so also what i'm showing here is how the thing works as a single computation step later on you really wouldn't do this by hand because you would connect them in a graph okay and in the pipeline just use that showing this here you can understand how it works so the training function always takes a list because it can be multiple objects in this case it's just a simple list of a task and now we train it and now we can peak inside and the output is also again a list because it can be multiple objects and here you can see that it's simply the output is simply the transform data so you can see this is the iris task i think most of you have seen this and you can see some weird feature values in here for iris and the reason that these look a bit non-standard is because i've subtracted the mean now and divided by the standard deviation um also each pipe operator has a so-called state and the state contains exactly these guys here so these learn parameters during training so if you look at the state you can see that i have a list because we're using the scale function from r that i i'm guessing most of you guys also have seen before so that always computes this center list one center vector i should say with the means of each feature and with the scaling here so the standard deviations of each feature and you can peek into this and look into this so this is what we have learned during training and during prediction well again you use your pipe operator you use your prediction function again you use a list and i feel just created a smaller task um yeah some of the extra code is just on there so i can actually squeeze all of the output on the slides this is basically just pure predict from a list on the task and you can see here how i'm now transforming this new data with the learn parameters if you want to get a feeling for what's inside of mlr3 pipelines i guess it's actually best not to look at my slides here but to look at martin's webpage and our book um but you all pipe operators again are in a dictionary you can list this so they don't even fit on the slide anymore there's a lot of stuff in there so i guess i'm not going to read all of this up so there's a lot of pre-processing stuff in their scaling box cox transformations pca ica a lot of imputation techniques a lot of feature selection techniques especially filtering methods categorical data encoding so one hot treatment encoding impact encoding sampling methods sampling for class balance corrections also methods which you will see later on here ranching which is also very complex meta type of thing which allows alternative data flows so um there's some stuff already in it's there's feature union in there which will also show you later for ensemble construction there's text processing in there date processing and you have some planned stuff for spatial temporal data some model stuff and some outline action that's exists on github in various stages of development but not really in pipelines yet okay so let's make it more interesting let's now connect these things into full graphs so in order to connect stuff the package gives you a couple of operators so the most important operator is this pipe operator here so we kind of created a ur operator and that concatenates different pipe operators in a sequential manner or it concatenates graphs so what this basically does is so in the most complex version you have a partial graph here and that graph has a right hand side so it has a couple of nodes which have outgoing edges which are not connected to anything and then you have a partial graph here which has a left hand side and so some of these nodes and incoming edges which are not connected to anything and these are now connected by this pipe operator now in the simplest case you would have just have something like this here we just connect this back to this this to this and then this to this okay but it also works on more complex partial graphs and this is obviously important right so you can build more complex stuff from simpler elements and then there's the g union operator the g union operator is something that's not it's even simpler so it takes two partial graphs and it creates a new graph out of them by not connecting anything and then there's a little helper function that's quite useful for example in osombling so that's called the pipeline or g replicate function and that creates copies of graphs or partial graphs and then simply merges them together calls the union operator on them so you have like 10 different copies of the same element in your graph which is pretty helpful as you will later see an example so let's now have a look at uh oh and of course obviously um I should also mention how you can get machine learning algorithms from mlr3 into pipelines well there's just an operator which is called the pipe of learner and you can put any mlr3 learning algorithm inside of it and then you get this little pipe operator here which applies this machine learning algorithm this mlr3 and what you can also do is um you can wrap each pipeline as a graph learner so that's basically just a little piece of connecting code you'll see this in the examples very quickly so this kind of creates from a pipeline a learning algorithm that's yeah this makes the learning algorithm out of this complete pipeline and allows you to use everything from mlr3 again on this including resampling benchmarking tuning nested resampling so everything that we've shown to you before so how does this now work with linear pipelines super simple write down pipe operator scale write a pipe operator for encoding and always connect them with this pipe operator here sorry with this with this uh pipe thing then something for imputation maybe we impute all n a's with the medium of the column and then use a decision tree this thing here creates a linear pipeline we're basically done you can now wrap this in a graph learner and after you have done this you can now use resample on this you can use tune on this you can use benchmark on it done you are now basically in the world but in michael's world so to speak in mlr3 and you i can of course i can train it and if i train this for example just to make this clear what happens is training data is pushed inside of this we learn the parameters here we learn factor levels here we learn feature medians here and of course we learn our decision tree in the last step and in prediction new data goes in we scale it with the learned factors um we with the scaling factors we do the same encoding for factors we impute data as we've learned it and then we apply the decision tree um some um yeah um extra comments on how to practically work with this so obviously you can set hyperparameters of all of these pipe operators so this is a lengthy piece of code but it's not very difficult to explain so every graph every mlr3 pipeline objects contains a list of pipe operators and this list is named so all of these pipe operators have ids so they have standard ids the scaling operator is called scale but you can rename it you can call it foo okay and then you can access this list here with the id of the operator it has a param set and as for learning algorithms you can access values and set them so you can read them out and you can set them it works exactly in the same manner um as for mlr3 learning algorithms so access this list here with the id and then simply work with the param set if you can do that if you can select from a graph each individual pipe operator you can obviously access the state as i've shown to you before and you can read out the state of each pipe operator after the pipeline has been trade and if you set a little debug option you don't do this by default because it would waste memory you can also access the result of the pipe operator so if you do the computation here so if data for example is pushed through the thing right so i guess yeah for example in training if data is pushed through this thing it gets transformed transformed and transformed and maybe you want to see how the data actually looks like how the object looks like especially if you want to debug a complex pipeline that makes sense so set this debug option and then look at dot result and that gives you the output app so what actually is well just computed for this step which is computed here and then flows into this guy and we can create non-linear pipelines which is i think the most exciting thing so some people already asked in the slider can i do ensembles with mlr3 answer is yes it's actually on this page here how you can for example do your own begging method so here i'm creating a pipe operator with subsampling so i'm randomly selecting like for example 80 percent of my rows of my training data then i'm running a decision tree now i replicate this three times and then use the pipe operator for averaging so this would create a graph that looks like this so i create i take this this guy here replicate it three times and then do majority voting at the end it's my own bagged tree so that's nearly a random forest right that i created with two lines of code and if you train this it looks like this and or you can do stacking so here i'm already doing something pretty complicated i'm taking a linear model i'm taking a support vector machine and i'm using the null operator i'll explain this in a few seconds so and i'm not only creating a pipe operator for the learner i'm actually creating a pipe operator for cross-validating the learner so each time i push data in here the thing gets cross-validated to speed some of you guys already know stacking better and you know that you should actually use cross-validation now in stacking and if you later on create your ensemble on the predictions so this cross-validates this and pushes out the cross-validated predictions the same here and this null operator actually just passes data along so this actually passes along original feature values then i do a feature union so that c binds the predictions of this guy the predictions of this guy with the original features so i can now do my second level stacking model i use a random forest what this actually does it doesn't only combine these two things it can now actually combine them in a feature dependent manner yeah so sometimes in literature that's actually called a gating model gating gated stacked model it's very complex right and that's this code here okay and you're completely free to define variants of this yeah we are not restricting you in any way just use your own creativity implement whatever you want through this mechanism and i guess the last thing i want to mention is branching and branching allows you to control data flow in a complex graph so for example you might have some type of feature extraction feature transformation and you're not sure whether you should do a or b so maybe you want to do either a pca or an ic a to transform your features so you can create a branch um so you can say data should either flow here and we do the pca or data should flow here and we do the ic a and if you use this branch um you can use this ppl thing so ppl is for partial pipeline so a small graph and this is basically an abbreviation so you can construct this very quickly um and this gives you a computational element that has a switch parameter so you have a control parameter you can see this here this control parameter branch with which you can set either to pca to ic a and if you set it to pca data will flow here if you set it to ic a data will flow here why is that interesting one with pca and one with ic a two linear pipelines why do we need this complicated thing here if you construct it like this you will now have a hyperparameter that controls which pre-processing is executed which means because it's a hyperparameter you can have auto ml and hyperparameter tuning techniques um figure it out automatically what works best and you can have it figure it out automatically even what works best with what learning algorithm and so on so branching is actually very nice mechanism to create complex auto ml systems from scratch in mlr3 pipelines and i have a gallery post online that describes this in a bit more detail than i when i can do here in this talk i will skip some stuff now here which shows how you can actually target specific columns with pipe operators so every pipe operator can be restricted to certain column subsets yeah for example by using patterns of id for columns or saying i only want to apply this to to real valued columns only to factor columns that's all possible and it's very simple um yeah i have also um can you hear me again yes it's working again nice okay yes but you dropped out for four for 20 minutes right four few seconds i dropped out for how long okay okay i guess few seconds are okay um so most of this in this summary we have already shown before so we have uh abbreviations for certain pattern patterns in pipelines so this is this ppl thing so um this actually yeah creates partial graphs we haven't compiled many of these things but we'll add more and more sugar elements we don't have to create like certain elements from scratch so certain things that everyone always wants to use we'll kind of store in the package so you get these partial graphs and can directly work on that so at the moment we have stuff for example for class balancing and chunking and other stuff um and yeah i guess i'll keep this now very short because i'm probably already a lot out of time so you can i already said you can do a lot of nice things with mlr3 for auto ml so you can kind of also see mlr3 pipelines as a design language to build automail systems we are also ourselves very interested in doing that so um you can actually now create difficult structures and then have your hyper parameter optimizer figure out what works best so very often you want to construct stuff like that so different preprocessing methods with different hyper parameters different machine learning algorithms with different hyper parameters uh uh different feature extraction methods you can use branching yeah to kind of say either that or that or that and all of these things have hyper parameters and at the end you get a kind of a joint space of all hyper parameters of this pipeline and any tuner cannot work on this yeah so you can work with random search or grid search on this but obviously for these more complex things you would probably want to use something more efficient like basing optimization and so on to speed up the process in these high-dimensional tuning spaces but from a conceptual point of view it's not different than what i've shown to you before in tuning and maybe let me directly go to the example here so i guess hopefully that convinces you that you can do pretty cool stuff with this so what i'm doing here is i'm basically writing down this so there's also a nice plotting function for pipelines in the package so i'm writing down something where i do either pca for feature extraction or nothing okay so i stay with your original features then i do a feature filter i do ANOVA to filter down my potentially high-dimension data structure then i do branching again and i either use a support vector machine i use gradient boosting or use a random forest and then i'm done and in order to specify this pipeline for the complete thing this is that code you have to write branching for pca your feature filter for ANOVA filtering branching for the three learning algorithms with some ids here that i manually define and some static variables that i define and then these three partial graphs i simply pipe together to my complete graph i put it in the graph learner and i'm now in mlr3 in the mlr3 world and i can also use mlr3 tuning on this and i specifically now want to use mlr3 tuning on this so i can write down a very very lengthy parameter set i can do i can say which feature extraction i want to use i can tune the parameters of my ANOVA filter of my random forest of gradient boosting of the support vector machine i can add actually dependencies in there for hierarchical space i can do feature parameter transformations okay this is potentially a bit more complex but this is very powerful so if you kind of see these two slides together they've kind of defined a pretty complex automail system already and this is the hyper parameter tuning for everything okay so okay that probably takes a few minutes to come up with this but this is it's two pages of code and this does something pretty cool you can fully paralyze this you can do necessary something on this and you can change it as much as you want and yeah it's super definable and flexible and i know i was probably not fast enough so apologies to martin but this ends my presentation here and i would now mute myself martin and that is the beginning of my presentation i guess i still need to share my screen see if this works just up to and uh share the you can see my screen great yes so we we've seen that mlr3 pipeline is can can do a lot of things and you can go full out and have like stacking and alternative path branching and so on one of the first reasons why you would need pipelines though is pre-processing and pre-processing you have to do when your when your data is in some way broken or you learn like your learning algorithm cannot really work with it and you need to adjust it to make it com compatible with your data so this this is one step before you do pre-processing to actually enhance your features where you want to extract something here in this example i'm showing we're working with data but it's not really working with the machine learning algorithm we want to use so this this example you can also see online so we um as you will probably also get the link this this page where everything this is already rendered and you can also follow follow the code so what's happening here is but we are still using the general credit data that you've seen before but in this code block that you don't really need to understand everything that's happening here is that we're introducing missing values so we're setting some of our data to na so if we look at our our task now so the task we have created this has missing values now i'm also going to initialize some some seed in a resampling instance so when we compare stuff we get a fair comparison so let's start with mlr three pipelines so we've seen the quite a few pipe operations so pipe ops question is what which ones do we have and we can just ask the mlr pipe ops dictionary which is the dictionary that lists basically all of them or if we if we are in a hurry we can also just write p o parentheses because this one just prints us the dictionary so let's start with this we have a learning algorithm that we want to use in this case the ranger random forest and we want to train it on our task but once we do this we get an error message so there's missing data and our learner cannot handle this so now we ask ourselves what are we going to do we can just impute so we can ask our dictionary of possible operations that we have stored somewhere what so what which ones of like of you can we use for imputation and here it gives us a list we have we can impute all the constant value or with a mode or out of range you would have to look at these maybe if you if you want to make an educated decision and the way you do this is that you ask your help so you ask question mark mlr pipe ops and then impute test for example so in our studio this opens for very nice browser head page and this explains to us what it does what are the parameters that we can set doing construction and and everything we would might want to know about it so here we're using the out of range impute so this one we also get with this quick access function that Ben already mentioned so we just say okay pipe up that is called impute out of range or impute or and this gives us a very nice object so this is our pipe operation it tells us some things about itself so what its name is what hyper parameters does it have so what things are already set and also the input and output so this thing accepts a task and outputs a task so it's a pre-processing operation we can like this operation is as Ben already said is basically learning algorithm like a generalized learning algorithm so we can train this on data and now we get the imputed data so if we remember task dala missings we had some missing values so this is the command that tells us if we have missing values in the task if we ask the result of this operation here um well I can just run this this has no missing values anymore so here the values were imputed and so what does out of range do out of range for categorical parameters basically adds a new category that is out of range of the old categories it adds a category missing so before we had the credit history has a few levels of factor factor feature and here we had a missing value now it added the new feature missing this is different from having a missing value which the learner doesn't know what to do with this is actually a new category and that is called missing and the venom forest will have no problem working with that we can now create a learner from this pipeline so we we can get this graph which does first the imputation and then makes well fits a learner and this we again have to turn into a learner so here we are in mlr3 pipelines country we have a graph and we can add new operations to it for example but we want to do and use the mlr3 tools so we want to go back to mlr3 country so we use creating a graph learner so if we look at what this thing in here is so so this thing is a graph it has it tells us well it consists of operations and they are connected in some ways and but what we want is a learner so this thing is a learner and we can use it for example to train and see now we don't have any errors and we have a fitted learner now we can even re-sample this thing so this is going to well it has a certain performance you can I mean we could also do benchmarks with this and for example see what different imputation methods work better or worse in this in this thing and the nice thing is that you can use as a more advanced feature this branching thing that Ben showed where we have have the actual imputation of it that has been used as a hyper parameter so we can make tuning over the method that does imputation I'm going to skip the feature filtering and I'm going to go straight to the robustification so because it fits better into this into this narrative so another way in which your learner can fail is that it is trained on on features that are constant for example or that it tries to predict a feature that has or that tries to predict on a level that has not been seen before on a factorial level so I'm creating a task here but it's a very small subset of the whole training task so what why am I doing this it's a certain feature of this specific data that in the first few features there are things that are constant so for example this feature the first 30 not this one but a few features of these they they're basically the same value for the first few rows um no sorry they are not constant but the problem is the first few 30 rows they have some there are some features that occur in the task that are not seen in the first 30 rows so what we're trying to do here is we're trying to predict a value that has not been seen before so the and the random forest complaints the logistic regression complaints so I'm trying to predict it on a feature that we haven't seen so also and now here we see the constant thing suppose we just have just two rows and we're trying to train a model on this one obviously it was destined to fail for various reasons but one of the reasons is well we have some features that are not changing at all so and here we get the error message there are some constant features and logistic regression doesn't like that because it gives the well zero determinant matrix so what do we do in this in this case what we can do here is we just collect all the different kind of operations we have that can fix the individual issues so and we we because it makes the learner more robust we call this robustified so we say okay we we might have a problem of missing of features that occur in the prediction side but we haven't seen doing training and the learner will fail so we use the type of fixed factors which removes features that we're not seeing during training and turns them into missing values next we have to remove constant values and remove constant features so and there's a type of for that as well it's like an iphone advertising for everything you have a certain small tool that can do exactly what you want and in the end what we also have to do is we have to impute because we had fixed factors removed some features as some some values now we have missing values so we also need to impute these and here we have to be careful we cannot impute out of range because that would just be us introducing another set of missing values again of unseen factor levels so a feature I'm showing now is that well we have this graph I printed the graph before already but so this is the representation it doesn't it's not very convenient so we don't really know what's happening so we have the plot function and the plot function well gives us a very nice representation and here we have the linear graph this can obviously get much more scary as you've seen in the slides from bend you can you can have lots of different um well paths that go parallel and so on a nicer way to represent this is to have html equals true because this gives us an html representation this is um a well this is javascript doing some fancy things and you can play around with it it's it just looks nicer and you can also zoom so now we have this robustified pipeline but there's all these things so it does fixed factors remove constants impute sample and we feed this into logistic regression so now we can train this on our completely broken task I mean this thing had like two two rows and so on and but we can we can train on this now because our constant features were removed and we can predict on it without trouble even though we trained on a completely broken task that was not only had only two two rows with um constant features but also the prediction task has lots of features a lot lots of lots of values that were not seen before we have a training on a data set with two rows gives you a good predictive performance as a different question so and now we've seen well you have to do quite a few things to get this to get everything to work so how nice would it be if well our package gave you a way to just build these things automatically I'm not having to remember what everything to do to robust makes the robust and here we actually have something this is also a band already shot the partial pipelines here we have a few prepackaged solutions for problems that you might encounter and one of these is obviously you want to make something robust so and how does this work so we can just ask okay give us something to make stuff robust and we get the pipelines but this is a very big and complicated pipeline why is that because this thing tries to be ready for all the eventualities so you say well maybe my learner cannot handle missing values so I have to do imputation maybe my learner cannot handle factor values so we have to encode to get um to get numeric values numeric features instead we can also give this thing a few arguments we can say well we give you this task we want to handle we give you the learner that we have and um we run this and this gives us a much smaller graph now so it's not this huge anymore it's just noticing well the few things it actually has to do so it's it doesn't need to encode stuff anymore because our learner can handle factor features instead it only does the necessary things and this one we can also train into prediction and we so I think we're very close to finishing maybe there are a few questions we should answer um there was the question if we can if we can implement our own type operations so and I already answered with the link but I think it's it's nice to show this all of this works with r6 so um it is it is um possible to implement your own operation just by inheriting from this class and then the only functions you actually have to implement are like training and the prediction function and you can look at our documentation and see it does well it looks scary because it's so long but actually this is just we give you lots of examples to learn from the principle is very easy um any other questions um maybe someone else can help me Ben I think there were a couple in slider I think I see one it says uh because pca is now super busy answering questions and okay I see I'm answering everything by typing already I just I'm sorry I'm doing okay well um a few things to say maybe so this is obviously not enough time to go through all the depths of what is possible I think it's nice to show like a very cool feature we have which is that we can do the stacking and how this actually looks in practice so all you do is just well I want to do stacking so what what what learners do want to do stacking with and I just create all the type operations for these learners maybe I also want to do other things like preprocessing and now all that we have to do is put this before a future union operator and as before we can we can plot this and we get these graphs and we can add even more so we can do a second level of stacking all we have to do is just to build this graph unit by unit and then put them together so you you can you can work with this like you would with lego bricks you would just build everything together and in this way kind of create your own machine learning algorithms from its building blocks martin if I can answer two questions from slider because that would be easier easier to easier to do it orally so one question was is it is it like safe or recommended currently to switch over from mlr2 to mlr3 and my answer is that it's actually now I think kind of the perfect point in time to do this because the package has is mature now so the last kind of big api change we did about one or two months together with mlr3 tuning that was basically renaming two or three functions a bit and introducing vvo2k so that did break api a bit because of the renaming but the concepts didn't change and after we did that we won't do anything else the packages are well tested already we have a lot of unit tests online we test very very heavily and I would even say that I mean we use it also I mean in my research group we begin using it now for our own experiments and our own research and I would heavily recommend to make the switch now if there's not some specific individual reasons for you to because you have like hundreds of thousands of codes with the old system I will definitely now switch um you can we can discuss what is like worse in mlr3 compared to mlr2 maybe martin and michel have to help me here a little bit so from my perspective so what is worse I really hate that we don't have patient optimization at the moment but we work on that and depending how urgent you need that and martin has also wrote a little connector package that connects mlr mbo which is mlr2 with mlr3 pipelines that's possible it's a little bit weird how we do that and it's in a certain sense it's a bit suboptimate because we like mixing two different package versions so that that is an existing downside and I want to work heavily on this we have a sprint next week and that will be an important topic of the sprint to get patient optimization into place of course we don't have 160 learning algorithms currently in mlr3 but the mlr3 learners package plus the mlr org in the background have a lot of interesting stuff already there and like migrating and a learning algorithm from mlr2 to mlr3 is can be about one hour of work so if you're missing I think the most important stuff is there I guess nobody has used these 160 algorithms together or exhaustively and the important stuff we already migrated and if you're missing something very specific migrating that yourself or talking to us is also super simple and beyond that I mean Michel and martin I'm not sure whether you can come up with something that mlr2 is a lot better maybe maybe it's not it's not directly related but it's very important to say this you will run into trouble if you have both the old and the new mlr loaded into this same r-session so that is something to watch out for so you unfortunately you cannot use the one and the other directly next to each other you would you would have to if you run like one script or like if you have your r-session open you should have either the one or the other otherwise you run into trouble yeah you have the same problems if you're loaded mlr2 and carriage I guess because the names are just identical so both packages implement your train function yeah the name special those are one shadows the other and then I mean but I guess you're okay with me like mentioning these downsides and anything else there's not much right and there's much is already a lot better so parallelization is a lot better pipelines is a lot better and I was also asked what's kind of the most important feature that mlr3 has that kind of stands out is if you compare it to carrot uh to tidy models and other stuff my answer was basically I guess pipelines so I think I don't think many other packages have this super flexible cool system that we have there and like the flexible combination with everything else and I would say so I know Andreas Müller from scikit learned very well we actually talked about this topic so I think if you try to publish a paper on this if you compare like the pipeline that we can currently do in mlr3 pipelines compared to what you can do with normal scikit learn I would say I would argue that we are actually we are not that mature and like not that complete in terms of provided functionality but in terms of flexibility and underlying structure and mechanism I would say we are quite a bit more flexible now in general maybe another upside to mlr3 compared to the old mlrs that you can because we use our our six it's very easy to find out what you can do with an object just in our studio you write object in the dollar and then you have this um tap completion that shows you all the things you can have like you have a task in dollar and then you have the option of getting the data of filtering of selecting roles in in mlr you sometimes have this trouble that well you can do a lot of with these objects but you always have to remember what the function names were yeah and if I can add to that I think martin is completely spot on here I would I would phrase this I mean I wouldn't rephrase it but we would just add one important point and that's like container objects right so like what you get back from the functions so in mlr2 it was always off it was often a complicated like nested type of list in list in list types of things so in mlr3 it's nearly always a data table that you can directly program on and that always has the same structure so everything basically looks like a resampled slash benchmark result object so it's a it's a task it's a learner it's maybe a configuration and symmetric and so on so it it immediately makes sense and if you go from like resampling to benchmarking to tuning just more columns are added that are the same structure so you don't have to remember much just have to remember that structure and then I don't know be be a good data table user I guess yeah or the player also fine that's also fine yeah you have to you have to know how to work with tables and I well I guess every good art programmer should really know how to work with tables and we think that's a really good interface to stuff yeah because in data science everything is kind of a table not everything that many things sometimes things also graphs so I think we are like we are over time right now already right okay we are kind of perfectly in time because we started like eight minutes late yes that's why so I just wanted to thank you all burn michelle and martin for giving this webinar and for allowing us to see as well so these webinar was recorded and it will be made available for you too so thank you again and if you have any questions you can put them in the middle and maybe I can pass them along to either of you so we can answer them for you okay so thank you very much can I yeah can I add can I add two sentences so of course you can pass the questions along and we are very active on github and very active online so I think the easiest way to get in contact with us is go on the the various issue trackers talk to us there we have a matter most it's also possible to join that you can ask questions to us on stack overflow it's all positive and thanks to your patients I think I mean it was a it was three hours which was long and was a lot of material so I guess we could have also done like five hours so apologies for maybe overdoing it a bit but we really wanted to show the full strength of the thing and we have a lot of stuff online where you can kind of in your own time go through this again so apologies if this was quite stressful oh no it was great at least you know it's a good like starting point and I'm sure like many people will go through the recording again more slowly maybe yeah sorry that's okay thank you again so you have anything else to add Michelle no thank you thanks for everything else you're welcome thank you so I will be uh ending the uh meeting now thank you very much thanks bye bye bye thank you bye bye