 The floor is yours. Good afternoon, everybody. My name is Sasha Gzerosky. I come from Ljubljana, which is really not far from Trieste. It's about an hour's drive. My primary affiliation is with the Jozhev-Stefan Institute. This is a large research institute in natural and technical sciences. It has at least three departments that are working on artificial intelligence and at least five departments that are working on material sciences from nanostructured materials to ceramics and whatnot. And so far, there hasn't really been all that much interaction between these. And there have not been many applications of machine learning to materials in Ljubljana. Special thanks to the organizers for inviting me because actually that has changed in the last month or so since I was invited to give the talk here. I will talk today about some of the methods that my group has developed, some machine learning methods that actually produce explainable models. And they work on relatively complex data, molecules, structured objects. And actually, you need to deal with structured objects to learn predictive models on them. But first things first. So QSAR stands for Quantitative Structure Activity Relationships Modeling. So here we are learning models for predicting one or more properties of chemical compounds. And the relation to this workshop is that actually we could do pretty much the same for materials. And as you can see, we are actually doing it. Now just to illustrate this, the task of QSAR modeling is to predict the property of chemical compounds. For example, activity against a particular pathogen, say tuberculosis. And you can see on this slide already that the compounds are naturally represented by graphs. And the property that we want to predict might be, let's say, a real valued number. So this is actually regression on structured objects. Now when I say complex data, it's useful to have a reference. What is simple and what are we comparing complex to? And the simple data form that I take as baseline is what most people are used to actually from statistics from several hundreds of years ago. But you have a single table of data where you have a target property or dependent variable that you want to predict on one hand. And then you have a bunch of independent variables that you want to use in the predictive models. And we can talk about regression tasks if all of the columns are real valued. But if the target column that we want to predict is discrete, let's say, has just values yes or no, we are talking about classification problems. So this is our baseline. This is what I will call simple data. And when we talk about complex data, there is actually quite a few additional dimensions of complexity that you might consider. There's been a lot of talk about big data. I'm sure you have heard the terms. And already there, there is a number of dimensions along which you consider the data big or complex. So one of these can be the volume. You can have a large number of columns in these data tables, which I showed. And here you need to do some selection on those or maybe estimating the importance. Then you can have a large number of rows or massive data. And the extreme of this is the paradigm of data streams where you have potentially infinite number of data points and the data just keep coming and keep coming and you cannot count on loading all of the data before you start processing. But you really need to do incremental learning from the data as the data arrives. And then you need to throw the data away and focus on the new data that is coming. But another important aspect is actually the variety of the data. And this is the dimension of complexity where we compare it to the simple tables which we saw on the previous slide. And we can have structured inputs as a type of complexity. We can also have structured outputs as a type of complexity. So this slide you already saw where the data are not really just throws in the table but naturally have a structure. For example, you can represent molecules as graphs. This is the case of structured inputs and I will show a bit later other examples of structured inputs. But then there is also structured outputs where the value that you want to predict is not just a simple scalar. This can be a vector but maybe even a hierarchical structure, let's say in the case of hierarchical classification. So here we have the two basic tasks of multi-target classification and multi-target regression where you need to predict a number of discrete variables or a number of continuous variables simultaneously. And the very natural example of such a task is actually weather prediction. Of course, certain aspects of the weather if we would just want to predict the outlook, whether it will be sunny or overcast, this is a classification problem. If we want to predict the temperature, let's say degree Celsius, this is a regression problem, but really weather is a complex phenomenon. You should really take a look at the outlook, the temperature, the humidity, the potential quantity of precipitation, the wind, the direction of the wind, the magnitude of the wind. So these are all, there's a whole number of variables that describe the weather and that are interrelated and they make a lot of sense, it makes a lot of sense to predict them all together rather than each of them separately. So this is the motivation for developing methods for multi-target prediction, okay? And actually we have a whole taxonomy of multi-target prediction tasks. We can have, of course, we already saw multi-target regression and multi-target classification, but you can actually have additional layers of structure on the target variables that you want to predict. A case is important, this hierarchical multi-label classification where the labels that you want to predict have some relations between them and these relations are hierarchically structured. For example, this can be the case if you want to predict the species, let's say present and a certain sample of water, they can, let's say from the Adriatic Sea, just a few minutes from here. And there the living organisms are organized in a taxonomy and actually in each sample, you can have more than one animal or plant present. So this is clearly a task of multi-target prediction and taking the taxonomy of living organisms into account. It's a task of hierarchical multi-label classification where multi-label classification means that each target can only be binary, okay? So this is an example of a multi-label classification task. The data in this table do not come from the Adriatic Sea, they come from reverse in Slovenia, but they are real data measured in the course of monitoring water quality in Slovenia. And we have a number of descriptive variables which describe the water, let's say the water temperature, the concentration of different pollutants like nitrates, concentration of oxygen and so on and so forth. And on the target side, we have a number of organisms for we want to predict the presence or absence, okay? And here we have an example of hierarchical multi-label classification task where we do have this taxonomy of living organisms taken into account. And an additional dimension of complexity, even though you might not consider it as such in the first instance is missing labels in the data. Normally, if you want to solve a regression task or a classification task and you have a data point where the label is missing, you don't know the value of the dependent variable at that data point, you cannot use that data point in the construction of the model. And in semi-supervised learning, actually the task is to take into account both data points with labels for the target and data points which are unlabeled in the learning process. And it's clear that there are benefits to using such unlabeled data points, especially when the labels are not easy to acquire expensive laborious and such like. And a case in point is in fact, structure activity relationships modeling. If we want to test a particular chemical compound for its activity, let's say against tuberculosis or salmonella, we need to perform an experiment in the lab to determine that. We grow a cell culture, we infect the cell culture with salmonella, let's say, and then we try different chemical compounds to see whether they reduce the number of bacteria and each additional compound that we want to test means extra time, extra labor, extra money. So this is why it is very desirable to be able to also use unlabeled data in the process of constructing predictive models, okay? And of course, we can get exactly the same problem, not only for classification and regression, but we can get it for multi-label classification, hierarchical multi-label classification and multi-target regression. And in fact, here it is even more likely that we will get this problem because the labels are more complex and more expensive to acquire. It's of course more expensive to test a compound for its activity against tuberculosis, salmonella, lepra and what else pathogens rather than just for a single one. So the labels are more complex and more expensive to acquire which means we are more likely to have data points where not all of the properties of interests are measured. And you can expect the same actually for materials, you might synthesize a material and then measure just some of its properties and not all of its properties that might interest you. So for some of the data points that you would actually then have missing values. And these target values might be missing for all of the targets, but they might be missing just for some of the targets. Here we have just the value of the first target missing and for the other two we have the measured values, okay? So a case in point which I would also like to mention here and you already had quite a few talks about neural networks, some of them today and deep neural networks as well. And in fact, deep neural networks are very good at learning from complex data, including graphs, you have graph neural networks, but a problem that they have is that they cannot easily explain their thinking, their predictions. And another problem they have is that they are very data hungry and in principle, they need lots of labeled data and cannot learn much from few data points. Granted, there are approaches like transfer learning, you can do all sorts of tricks to try to alleviate these problems, but naturally they have this inherent problem. And the kind of explainability I'm referring to is that, okay, you might want to explain the predictions, but you might also want to look at the model that has been learned and see what is the importance of different variables, possibly their interconnections, their interactions. And this is very, very important in medicine and it is also important in different kinds of science, including material science. So we have been focusing in my group on learning explainable models and of course, a very common form of explainable models are classification trees on one hand and classification rules on the other hand. And I will not go here into the history of AI, but really the expert systems, the first kind of artifacts that define the appearance of artificial intelligence or focus around expert systems which contain if-then rules to represent the knowledge. And lots of the developments that took place in machine learning were highly motivated by the fact that humans cannot just easily state all of the rules they follow, but rather such rules will learn from examples of people's behavior and problem solving. In any case, from tables of data like the ones we saw earlier, and this is another table of real-world data from Slovenia. Slovenia has a very large and lively population of brown bears. In fact, one morning there was report on the news that a baby brown bear was found just outside Ljubljana Zoo and it did not come from the zoo. So we have been doing some work on modeling habitats, usability and also population dynamics for the brown bears. And these are data collected by our forestry colleagues which describe different locations in terms of predominant land cover forest grassland, proximity to settlements and also the forest abundance. And from this we can learn very simple models like the model here at the bottom in the form of if-then rule which says that bears like forests and it should be dense forest and also bears like they are peace and quiet. So it's about a mile outside of settlements that they feel comfortable. Of course, this is only for the male bears. Mama bears are very, very fussy. When it comes about the cubs they want the specific type of forest. It can't be just standing forest, it has to be beach forest, but I'm digressing. So the point is that we have explainable models that we can take a look at. I was just explaining to the models about the brown bears and we were making jokes about it. And in a very similar fashion we can explain also models for multi-label classification. So here we have a decision tree for multi-label classification. It looks at nitrates, concentration of nitrates. This is chemical oxygen demand. And if both are high we come to a prediction that we have Niciapalea and Tupefax present. Niciapalea is an algae which is very tolerant to pollution. And Tupefax are worms and they really like dirty water. And so we have high oxygen demand, lots of nitrates and we have bio indicator species which are indicative of such conditions. And similarly we can have decision trees that predict the composition of the structure of living organisms that can be found in particular sites here for Slovenian rivers. Okay, and we can have such trees for predicting one or multiple targets also learned on relational data. I was showing you earlier data about chemical compounds. These are data from another domain. These are about movies. This is actually from the IMDB. And we have movies and we have ratings for the movies made by users and we have information on the users, the nations, what regions the nations are in. And we can then learn models like this one. This one says that, okay, if a movie has at most 213 ratings it's a drama else if the average age of users who rated this movie is above 42.2 is a comedy or else it's a thriller. And these features have been learned automatically by the decision tree. There has been no process of constructing embeddings and whatnot and then learning decision tree. The decision tree natively operates with this relational representation which is very powerful. Okay, and in particular for QSAR models we typically have structured data which tells us about atoms in the particular molecules and the connections between the atoms, ring structures and so on and so forth. Okay, so the last item that I would like to mention here is that of course individual trees potentially easy to interpret but they are not always the most accurate. So people quite often resort to learning ensembles of trees and in the ensembles we combine the predictions of the different trees to obtain better predictions. But then of course we lose the interpretability which is why feature importance estimation approaches used together with three ensembles are very, very important. And in a nutshell what we have been doing in my group we have been adapting these methods for learning decision trees and three ensembles to A, do structured output prediction to do multi-target regression to do multi-target classification to do hierarchical multi-label classification. So to address all of these different tasks of predicting not just a scholar but rather a data structure. And then we have been adopting them to handle semi-supervised learning the task of handling missing labels. Of course three ensembles and then finally also producing feature rankings based on these three ensembles and then some other feature ranking methods. And all of these are implemented and publicly available in a software package which is called CLUS which can do all of these things and in fact rather the Cartesian product. So we can do semi-supervised hierarchical multi-label classification and we can do of course three ensembles for that and feature ranking for that. And we have a relatively more recent software which can learn relational classification trees, ensembles and also produce rankings in this relational context. And I will not talk more about this more complicated case about the relational trees. I will now just very briefly give you an idea about how you can learn these trees for multi-target prediction and ensembles and what are the key things that you need to do to extend the classical algorithm for top-down induction of decision trees to do multi-target prediction or also potentially hierarchical multi-label classification. So the algorithm for constructing a decision tree from a training set, first it looks at a boundary condition and it says if all of the examples in the data set have a low variance of the target, okay? So this means they are all very close together in the target space. And in that case really we can make a leaf and take the average of their targets or the prototype as the prediction, okay? So this is the boundary condition. In the case we have a single target variable we are having regression. This really is the boundary case which is also used in ordinary decision trees. Now you will notice that this notion of variance here can be actually generalized and it can be defined over multiple targets. It is the same notion of variance that is used in clustering. You know, in clustering you want clusters which are compact, which have low within cluster or intra cluster variance. And it's exactly the same notion of clustering. So predictive clustering trees use this notion of variance from clustering which is extended to the multi-dimensional case. As opposed to just the notion of variance of a single real value variable, okay? And if the data are not very homogeneous then we have to select an attribute an independent variable to put in the root of the tree. And we do this based on the reduction of variance that is done by putting that test in the tree. So whichever attribute, whichever test reduces the most the variance of the target is selected to put in the tree. The data partitioned according to the test and then we recursively construct sub trees for the data sets that resulted from the partition. So really this is the same algorithm for learning decision trees except that this notion of variance is now generalized. And of course, when you calculate the prediction to put in the leaf, it should really be a multi-dimensional average or other form of prototype or a representative object that has to go in the leaf as a prediction. And this is just what I was explaining that this notion of variance is really the notion of intra cluster or within cluster variance from clustering which is really the average squared distance between each example and the average, the centroid of the cluster. And there are differences of course between doing predictive modeling, single target between doing clustering in between doing predictive clustering. So in predictive modeling, you can see that your clusters really are compact only along one dimension. And this is the dimension of the target because you only care about the prediction. In predictive modeling, one target, the clusters really need to be compact only along this Y dimension. And along the other dimensions, they can be arbitrarily spread. And here we have a decision tree, very small one which produces two such clusters. And of course there is an explanation of each of the clusters which is if B is satisfied, then we go to this one. If it's not, we go to the other one. In clustering, in contrast, we do not have a condition on which to split the data. We have the two clusters and a defining feature is they need to be compact and they need to be compact along all of the dimensions considered. So really we are looking at hyperspheres that need to be compact, not too large volume. And at the bottom here, we have actually the notion of predictive clustering which combines the two paradigms. You get three which partitions the data into clusters and then the clusters need to be compact along more than one dimension. Should you choose, and in fact, the predictive clustering trees are general. If you say that you only take into account a single target dimension, you get ordinary decision trees. If you say that you want to take all of the dimensions into account, you get standard clustering. And if you select a subset of the measures that you want to predict, you are somewhere in between, okay? And the algorithm is really very much the same as the usual algorithm for learning decision trees except that this variance function here, which I mentioned, if you want to do multi-target regression, you sum the variances along each of the targets. And this is your new variance. So your variance is now more complex than it used to be. Okay, and you might wonder, why would you want to build now a tree which predicts all of the targets at the same time? You could just as well build a single tree for each of the targets. And you wouldn't need to listen to my lecture here, okay? But the point is that if you try this in practice on average across a number of domains, you actually get a clear advantage of predicting the targets simultaneously as opposed to each of them separately. And here we have an example from three actually domains. These are all from ecology. The first one is Slovenian rivers. The second one is Danish farms. And the third one is Australian vegetation. And you actually have quite a large number of targets to predict in each of these cases. So for Danish farms, I think we had the smallest number somewhere around 50, 60 targets to predict. In Slovenia, we had maybe about 500 targets to predict. And in Australia, more than 3000 targets to predict. And if you take the results on average, the left-most column is for single target prediction, we build a separate model for each of the targets. The next one is multi-target prediction. We build one model predicting all of the targets. And the third column is hierarchical multi-label classification because these are all community prediction problems. All of the species are living organisms. They can be placed within the taxonomy of living organisms. So we can do hierarchical multi-label classification rather than just multi-label classification. And there is a clear gradient of improvement if you go from single target to multi-label from multi-label to hierarchical multi-label. And the last step here is if you use ensembles in the end on top of everything. And the key to this, why this happens, is that actually you do not overfit as much when you build multi-target prediction trees as you can overfit when you build a single tree. When you build a tree for a single target, you can fit and you can adapt to that target a lot, especially if you have a lot of data. Then you can build very large trees, very branched. However, if you need to predict more than one target, if you overfit one, you will be underfitting the other one. So getting a balance, three balanced in the sense that it predicts all the targets well, is a much more difficult task. And to do that, you must not overfit. And we were clearly actually measuring these overfitting scores, which are shown here at the bottom, which demonstrated that it's really the prevention of overfitting that is making the multi-target prediction models perform better. And next, I would like to say a few words about how we can adapt these approaches to semi-supervised learning. It's actually very, very easy. In this function, which is calculating the variance, we don't take into account just the variance across the targets. The y's are the targets. And the x's are also the independent variables. So actually to this variance function, instead of the targets or in addition to the targets, we consider also the inputs with some weight. And this weight actually can be tuned. And we usually tune it by internal cross-validation. And with that, actually, we can very, very well handle unlabeled examples. Because for the unlabeled examples, we can calculate the variances here. Of course, for them, we cannot calculate the variance on the targets if the data on the targets are easy. At least the contribution of these data points to the variance. And this, actually, you don't need to read this slide, but you will notice lots of red numbers. And the red numbers means that semi-supervised learning performs better than fully supervised learning. And semi-supervised learning can take into account the unlabeled data. And it clearly performs better than supervised learning. What is even more important is that actually, by using semi-supervised learning, you can learn smaller models that are just as accurate, or maybe even more accurate than the ones you got from fully supervised learning. And I have here an example which is a favorite for me, because the three learned in fully supervised learning, it only uses 50 labeled examples. You see it has 11 nodes and accuracy 81%. The three on the right-hand side was learned using the exact same 50 labeled data points. But also, the learning algorithm had some 2,000 more unlabeled data points. And by learning from both the labeled and unlabeled data points, we get a tree which is smaller. Not by much, but still with these two nodes less. And much, much more accurate. So the accuracy goes from 81% to 92%. So this is why I actually like this a lot. So I don't know how to turn this. OK, good. And the last thing I will mention here is that we can actually also produce feature rankings, because we can learn trees. And we can learn three ensembles for multi-target prediction. And if we build an ensemble, we can then compute their feature rankings by using, let's say, the random forest score, or Gini3 score, or some other scores adopted from single-target prediction. So we can have feature ranking for all of the different types of multi-target prediction. And also, this actually works for ensembles of relational trees. And we are the first to actually consider even feature importance estimation in the context of learning relational models. So before concluding, I would like to show a couple of examples of how we use this for QSAR modeling for virtual compound screening. In fact, the first one was a collaboration with Leiden University. We were looking at host-directed drugs for tuberculosis and salmonella. And the second one was actually a collaboration with Trieste, another research institute, ICGEB. It's in Padrecano. And there we were looking for potential drugs that would help recover after heart attack. So here in QSAR modeling for virtual compound screening, you actually construct some descriptive variables to describe the compounds. And this can be of structural nature. But they could also describe the compounds in terms of proteins they target from some public databases such as PubCamp. And here you can see that actually this is a relational problem. You can have the structure of the compounds in terms of atoms and bonds, but also functional groups and also the protein targets. These are all one-to-many relationships. This is not straightforward to convert into a table. So as I mentioned, we were looking here for host-targeted drugs for tuberculosis and salmonella. Our colleagues from Leiden tried out this library of pharmacologically active compounds, more than 1,200 compounds. And they measured the reduction in bacterial load. So this is a measure how effective the compound this is in killing the bacteria, essentially. But of course, you also have to worry not only about killing the bacteria, it's not OK to kill the host cell either. If you just kill everything and you kill both the bacteria and the host cell, not good. You want the host to survive. So we therefore look at both bacterial load reduction and host cell viability. And we build multi-target, in particular, multi-target regression models for predicting these two properties of the compounds. And in fact, here we were able to greatly increase the proportion of heat compounds. So we build model from these 1,200-plus compounds. And then we apply these models to the whole of PAPCAM, to millions of compounds. And from these, we get a small number of candidates, let's say on the order of 10s, or maybe 100 at most. And if you think our collaborators just went for the top five and tested them, this is not how it worked. They went through the list and they said, maybe at position 12. I said, oh, I have this one in the fridge. So do pass that one. And then at position 27, they said, oh, yeah, the colleague from the other institute has that. Oh, no, this one is expensive. Let's go for this other cheaper one. And even so, even though they did not take the best rank compounds we predicted, they still got a much, much better success rate of heat compounds than just screening the large number of compounds. Okay, and the second example, which I had mentioned, and this was actually an example of so-called high-content screening. Here the effect of a compound on cellular culture is actually assessed by taking a photograph of the cellular culture under the microscope and then extracting some feature of that photograph using some image processing. Okay, the task here was to identify drugs to reduce fibrosis in myocardial infarction. So this is after heart attack. And here we were actually quite successful. And very recently we had a nice publication in cell death and disease. We identified one compound which was working very well. And then colleagues in synthetic chemistry modified it to be more suited for delivery to the chest, to the lungs. So that was a very nice example of collaboration. And in fact, the other one, excuse me, how do you go back? Okay, yeah. The other one for tuberculosis salmonella, this was actually also published in nature communications of quite nice publications. But what I would like to say is that you can do more or less exactly the same for predicting properties of materials. And there you typically have multiple properties of materials that you would want to predict. And typically you have missing values of the targets for at least some of the materials that have been synthesized and characterized. And we were able to experience this firsthand and I'm using this opportunity to invite you to visit the posters. Today, there is one where we were looking at corrosion inhibitors. Then there is another one where we were looking into foam glass and properties of foam glass. And then one more for learning actually to predict properties of electro-catalysts. And this will all be presented at the poster session today. I hope to see you there. And tomorrow, Cynthia has an oral presentation about mechanical compounds, about mechanical properties of tungsten-based composites. These are used actually in fusion reactors. And we have also very nice results here. And I hope you will be here for Cynthia's lecture tomorrow. Thank you so much. Thank you very much for the nice talk. I see already a question here. I'm gonna give you the priority. Thanks for a different talk. And I'll start actually about the size of the data. Do you hear me? Is it echoing? Okay, so I'll repeat again. Okay. So the question is about the size of the data. Yeah, and the sensitivity towards the prediction of the model. Because it's like whatever, I mean like in medicinal sciences and all, whatever I have seen, change of the, I mean like some of the features, numbers and all, doesn't affect a lot of the prediction of the notes and all. So it's your guess about if somebody's trying to replicate that for material science designs and all. So the size of the data definitely matters. And for some of the data sets that we analyzed for the poster presentations tonight, the data sets were very small, you know, and we had maybe like tens of compounds. And there, of course, you cannot expect great results, but still, if you evaluate the results properly, when you test the predictive power of the models, if you do cross-validations rather than just testing on the training set, and if you actually have realistic scenarios of how you will use this in practice and adapt the evaluation to that, I still think the machine learning methods can be very useful. And I mean, and what's your, I mean, guess it will be when you are, I mean, going down the notes, trying to find some unique patterns about two different like variables or features that may be in case of material science design. So what you will say, when we'll start to lose the interpretability of the model, because as we go down, we'll find some new things that we haven't seen on the data, but how much down we can go before we... Of course, a tree, a tree which is 30 levels deep is not very interpretable, but with the, and this is the same problem with the three ensembles, but you have this feature importance estimates, which are actually very, very helpful. And our domain experts typically do look at the feature rankings and see if this has meanings to them. And for example, this can influence their experimental design when synthesizing materials. We experienced in one of the applications in material sciences that will be presented tomorrow, for example, and another one in the posters. So if you have a domain experts which understands the data well on one hand, and you have some interpretability, I'm not claiming that the ensembles you will get to the bottom of it, but still the feature importance estimates tell you quite some of the story, then you can get useful feedback for modifying the experimental design and exploring further alternatives. If I'm allowed one last one. Okay. I'm around all the way to the end on Friday. One question. Sorry, you can, yeah. Somebody else? Well then if there's no one else, I will give you the mic. Okay, so if you can go back to the slide number 24, maybe one before, okay, I got lost then. So you talked about that, I mean like when we move from the, I mean like say, yeah, this one, yeah. So from STP to NTP, when you're increasing, I see that for Slovenia rivers to these farms, to these vegetation, different types of data set that you have, the increase from STP to MTP isn't always that great. Yes, and there's a very good explanation for that. For the Danish farms, you have very few targets relatively. And if you have two targets, you know, there is only so much room for improvement. And also the room for improvement is smaller if you are looking at ensemble models rather than individual trees. So where there is room for improvement, you know, this multi-target prediction improves it. But there is no room for improvement or there's very little room, you just get that little extra increment that is possible to get. It's still, it really helps you squeeze out the most that there is in your data. Yeah, it achieves the achievable, I get it. Thanks. Okay. So I think we have no time left, but thank you very much again. Yeah, thank you very much.