 Welcome, everyone. My name is Olga Leshevska and I work as a postdoc in Ireland. So I will going to show you how machine learning can be applied in sciences. And after previous talk, if you've been here, we had some nice introduction about all kinds of ensemble methods. So here I'm going to show you one specific case on gradient boosting. Okay, so here's a background of the problem. In the past 60 years we observed decline in the size of fish by about four centimeters on average. So we think about herring, which is about 20 centimeters long, four centimeters is a lot of reduction. So we would like to find out what's the problem? Why is it happening? And we're going to use machine learning to answer this question. So why is it a problem? It's because herring is a very important species for consumption. And we know that if it does decrease, it has the consequences for further stock production. So it means there will be less fish in the future, so we can consume less. And we don't know what's causing decline, but we are presuming there is an interactive effect of various factors such as sea surface temperature may change. I'm not sure what's happening. Zoop like in abundance may change. Fish abundance may change or fish in pressure. I'm not sure what's happening. Okay, so to answer this question, I'm going to use data from for the past 60 years from 1959 to 2012. And the data is spread throughout the year. Okay, so I'm going to use this data and the way data has been collected is it has been collected from commercial vessels from taken at random 50 to 100 samples at a time and total sample size is about 15,000 individual fishes. So imagine a data set of 15,000 rows. Okay, so study area, this is where the data comes from. It's called Celtic Seas just on the south of the Ireland and it's bounded by St. George Channel, English Channel, and so it's just you can imagine where we are now. So it's about study area size. And so objective is to identify important factors which underlies this problem and to answer this question, I'm going to use a gradient boosting regression trees, which is one of the ensemble algorithms which are available these days. Why ensemble is because we don't have a collection of, we don't have one tree, but we have a collection of trees. So and the final model is curious of the final models improved because we have a collection of interlinked trees. So in this case, as opposed to other methods such as bagging or regression or random forest where trees are independent. In this method, all trees are dependent in a way that residuals of one tree, so unexplained part of the model is enters as an input into the next tree. So we have a sequence of interconnected trees, which is a nice feature. It allows to reduce variance. It allows to reduce bias. The only problem with this is because of their interlinked and sequential, we can't parallelize our algorithm because they all depend on each other. Okay, so advantages of gradient boosting regression trees are basically more or less the same as those of other ensemble methods, which means just to mention a few, we can detect nonlinear feature interaction. It's just because of the underlying feature selection which is going on in the algorithm. It is resistant to inclusion of irrelevant features, which means we can include as many variables as we like and if they're irrelevant, they won't be selected, so we don't care. Okay, so which is nice. It is good if we deal with data on a different scale. We don't have to standardize data. We may wish to standardize, but we don't have to because they are a boost. And if we, for instance, use a normal linear regression, our model will explode. So in this case, this is a really good advantage. Also a boost to outliers, so if there are any data points which are not feeding data, maybe because it's mistake or maybe some special event, we don't care at all. It's more accurate and we can use different loss functions like, for instance, least square or others, which is implementation gradient boosting regression trees, which is nice. Okay, disadvantage is it requires careful tuning. It takes a lot of time to get the good models. It's slow to train, but very fast to predict. And I'll show you after I finish my part of my talk, I'll show you implementation in the Python notebook, how I did it. Okay, so a little bit of equations here. So the formal specification of the model, it is an additive model. So we have a sequence of trees and each tree is weighted so that it, as we get a sample of trees, they all combine through this gamma weight as you can see here, okay? And each individual tree is shown as on this part of the equation. And then we build an additive model in the forward stage of our sessions. So as I said, we add each tree sequentially with this parameter epsilon, which is shrinkage, also known as learning rate. You know, we all talk about learning rate. This is a learning rate. So learning rate allows to control speed, how fast we descend along the gradient. And finally, at each stage, the weak learner is chosen to minimize some loss function. In my case, I took least square because it's a natural choice, but it can be any other function which you can differentiate. And this part of the model is evaluated by negative gradient descent, okay? I won't go into details of that, but that's all about formality in my talk. Okay, so parameters which I finally selected, in my case, I needed about 500 iterations and learning rate of about 0.05. These two parameters I refer to as regularization parameters, okay? And they affect degree of fit and therefore they affect the value of each other, which is a bit complicated because if I increase number of iterations, let's say by a factor of 10, it doesn't mean that learning rate will decrease by a factor of 10. It's not proportional, so which is difficult. You may increase iterations, but your learning rate might decrease by a different proportion, and that's why it's getting tricky. Okay, so next parameter is maximum three depth, which is, in my case, six. For this particular algorithm, it's known from theory and from different simulation models that three trunks, so it means that this one split only perform best, okay? Which is nice, so we don't need any deep trees, but in some cases, you may need from four to six or maximum eight splits, okay? In my case, it's six. It means that my model can accommodate up to five interactions, okay? This is what it means. Okay, next parameter is subsample. In my case, it's 75%. It's an optional. If you specify anything less than one, it means that you get a stochastic model. So we introduce some randomness. It can be nice because it allows to reduce variance and reduce bias, and, practically, I found out that this was a better result, therefore, I introduced. So, basically, my model is stochastic gradient boosting regression tree, to be precise. Okay, and loss function is least squares. As I mentioned, it's a natural choice, nice to start with, easy to interpret, but it can be any other loss functions, and they're nicely implemented in scikit-learn, and it's very easy to change it. Okay, so, if we estimate our model, in this case, I split my data in three parts. If I have enough time, I'll show you how I did it to split in two parts. I also have results, and they're very similar, which is nice, shows the robustness of my model. But in this case, I split data 50% for train, 25 tests, 25 validations. There is no particular reason why, because I have 50,000 rows I can just scan. If you have less data, you don't. You may choose for maybe one live out, or cross-validation, or some other methods, which are more specific for smaller data sets, but I have a big data set. And you can see I have MSC's mean squared error, which is degree of accuracy. Well, it's rather low, so I'm happy enough with my model, and I can see that after some interactions, my model flattened out, so there is no big, there is no change in MSC, which means that I have enough iterations. And R-square tells me a proportion of variance, which is explained by model. And for train set, it's slightly higher, which may indicate a bit of overfitting, but it's not a big gap in between them, so I'm satisfied. But this all follows each other very closely, so it means that on average, my model is doing a good job. Okay, and there is some... If I reduce variability in data, I see that R-square goes up, so there is basically effect of that. So a little bit of results. So I plot here length of the fish. On X-axis, you can see that it's maybe around from 20 to 30 centimeters, imagine, and my model predicts fish from 22 to 28, so basically it is what it says, on average, we give a correct value. If you have extremes too small or too big, they won't be predicted correctly. So it's 50% of the R-square, what's reflected on this graph. Okay, and if you want to find out which variables play a role in my model, this is what I wanted to find out. The way it's performed is each variable is used... Well, the most important one is used to split a tree. More often it's used to split a tree, if we count times it's used, we can say, okay, then it means it's more important. In this case, I have a color coding here, so this first is a trend, it's basically months, okay, so we know there is some trend in data, and as soon as I've translated it, I could see in 100% cases it has been used. After that, we have sea surface temperature, which is... I'll show you next graph how it's affected, it's basically there is some relationship. And other things are food availability, so if there is a dark food in the sea and abundance of fish, so how much is the population, et cetera. So most important message here is to remember that trend is important one, and after that we have sea surface temperature and food, okay? So if we further visualize those three variables in partial dependence plots, so the first row here is one way partial dependence plots, basically where I plot each feature against my dependent variable, which is the length of the fish, we can see that we can't really see any particular relationship here. It doesn't say relationship, but it shows a degree of interaction in a way how it's dependent on each other. So we don't really pick up any dependence here, but we do pick up here, so I highlighted here the circles, these two areas. It means that maybe if you can see here about 14 degrees, so if sea surface temperature is below 14 degrees, there is a positive relationship, so fish gets larger. So fish likes temperature up to 14 degrees in this case. If it gets too warm, there is a negative relationship, so it definitely shows some kind of dependence between length of the fish and the temperature. Well, I don't want to talk about climate change here because it's a very debatable issue, but you can imagine if temperature, you know, global warming, if temperature goes up, it may have an effect on the fish, and on us eventually because we can't consume fish we like, okay? So this is an interesting message, and finally here, this is one of the food sources in this particular case is phytoplankton is what fish eats. If you'll focus on this area, well, why I focus here, not focus here because most of my data is concentrated over here as you see because these little ticks are desiles, so it's where data is concentrated. It may go up to here just because I have some outlier, but I don't care because I know my model is robust, so I don't interpret this part. So if I look at this part, I don't see any dependence. I think it's just because in this case it's not a limiting factor. Obviously if you have less food it will affect, but in case of Celtics, there is a lot of phytoplankton, so fish is not dependent on that. Okay, and then the second row here, we have a two-way interaction plot where I plot each feature against each other just to see if I can pick up any interaction between those. I'm sorry. Okay, so we can see here it's basically the same story. We see sort of temperature about 14 degrees here. We see that something is happening. So what it says, this analysis tells me, well, I know there are important features, but I can't really say why is it. So by the fact that trend is important, it tells me that I might need to go and use maybe time series modeling to find out the way it depends. So I can't answer these questions machine learning. All I can do is to pick up these features out of a bunch of other features on a big data set and it's as far as it goes. So there are limitations to how you can apply it. And so conclude. We see that there are three important features which are in this case trend, which is time trend, surface temperature and food availability. Something is going on with temperature, which is clearly about 14 degrees, and there is a high degree of interaction between these features. And remember that with this method, we can't find the cause-effect relationship, but we have a relative importance of the variable. So from a bunch of variables, I picked up the ones which are more important. And I can take it with me for the next type of analysis. Okay, so this is the first part of my talk and I'm not sure how much time I have. I would like to show you a little bit how it has been implemented. Okay, I have five minutes. So it's basically the first part of this is what I've shown you in my presentation. It's a three-way split of my data set. So I'll go a bit quicker here. Is it large enough? Okay, so I'm sure it's all familiar to you. I import all libraries to be reproducible because I work in sciences. I need to set seed because I want to run it again and get some results. Okay, I read the data and I see here what about 50,000 rows and about 15 features in my case. I haven't discussed this, but I do check multicollinearity, which means if I have two features which are really dependent for normal regression, like where I have one tree only, it may, it's not made, it will for sure blow your model. You can't allow that in your model for a sample method for this particular algorithm. It doesn't matter. But if you can detect multicollinearity, it's better to take out variables which are, you know, which are multicollinear. So it's basically how I do it here. I construct a matrix of person, well, product moment correlation coefficient. It's what's called, and I get it here and I can find out which variable. So the higher multicollinearity the more intensive color. It's basically, there is no rule, but everything above 80% or 0.8 may indicate multicollinearity. So I see here this, it's either red or dark color. So it's basically those variables I just took out of my model, okay? And this one as well. Okay, so I removed them and I do three way split. So 50% 25, 25 for each part of the model and I fit my model. Okay, so this is the final parameters, but it took me a few iterations for sure to be satisfied with what I have and how I found out how many estimators I need here because the usual rule is to set learning rate as low as possible and to get a number of estimators with number three as high as possible. And if you do that, your model will run forever but you for sure end up with something visible and then you can start playing around by reducing, okay? How I found out this 500 is I used apply algorithm which is called early stopping and it's available in scikit-learn. It comes a little bit later on, okay? I'm so sorry, I'm not sure what's coming. Okay, I'm sorry guys, I'm not sure I don't touch it. Okay, I just touch a button and just push down, so. Okay, and it's the same graph again, you've seen it before. And again, I think what's interesting is to show quickly other part of, so this is early stopping as I mentioned earlier, other part when we do a two-way split because two-way split is something which is done, to my opinion, more often than three-way split. In two-way split, you only have train and test. You don't have validation set, okay? And to identify parameters for this part I used the grid search. So I specified a range of parameters. You can specify for all parameters you like but I only took regularization parameters because those are most difficult ones. So I specify here maximum depths which I know I had six, so I took one to the right, one to the left and I know from theories it shouldn't be higher than eight, so I don't go there, okay? And learning rate I have 0.5 now, 0.05 and I want to increase or decrease and see how it works. So what happens is that we get a confusion matrix, not confusion, but we have a matrix of different combination parameters and each time we fit the model, and eventually the one which gives highest accuracy is chosen and it tells me which parameters I should take, so you can see here output is best hyperparameters. It says that learning rate should be 10% or 0.1 instead of 0.05 and max depth can be a bit more shallow, but it's very close, okay? And close if I fit those parameters and all other parameters I keep the same. What I get is very similar results again here you can see it's about 50% or 52% for train and test data which is good, okay? So again we see the same graph which is good it means I have same algorithm but I apply it to different types of data partitioning one time I did three way split with early stopping to find a number of iterations and otherwise I split in two parts I used grid search to find the best parameters I change parameters and still my model does I think I have to finish my model does give similar results which is good it means I'll get a Mirzra boost, okay? I think I will stop here because it's not doing very well, so thank you very much for your attention. Do you have questions? Okay. Did you compare the results with any random forest trees? I ran some results with just normal random forest and mean square there is slightly higher I didn't tell here because I was a bit stressed with all this thing but I did a normal least square regression and again it does show that model does is less accurate so yes I did compare. And do you have the data or the notebook available? Yes it's on my github you can see my surname it's over there, yes. Thank you. Another question? No? Okay, maybe. So you said there was a link between the temperature and the fish? Yes. Does that help you then get another grant to do more research? Is that kind of the aim of this? Yes, basically this was a kind of a press study we are interested to find out if there are because it's a time series data for 60 years which is very unique in the data storm collection so we wanted to find out which variables are more important so that we can reduce our data set to most relevant ones and then I could take and do some kind of multivariate time series analysis, yes. No one? Okay, thank you. I'm sorry for quality of... Is that your fault? I know it's computer. Always blame machines.