 Hi, yes, my name is Anna. I'm going to talk to you about CatBoost today. CatBoost is a Gradient Boosting Library. It's in open source. A few days ago, I think on 18th of July, we had our first birthday, so we are one year in open source. One day before that, we reached 3,000 of stars, and I was really, really happy about that. So the project is growing. We have made many releases, and we are working on the project actively. And many people start to use the project. So first of all, let's start with the problems that we are trying to solve using this library. So a Gradient Boosting is a machine learning technique that works well for heterogeneous data. There is homogeneous data, like images, sound, video, text. For all this data, you need to use neural networks to get a good result. Yes, sure. So with neural networks, for those kinds of tasks, you need to use neural networks, and there is structured data. Like for example, for predicting credit score, if the person will pay the credit, you have a table of the data where each row is a person, and each column is a feature, and those features do not have that much internal structures between them. Like in the image, two pictures near each other have a lot of internal structure, and here you do not have this situation. For this type of data Gradient Boosting usually gives the best results. The next thing is it's very easy to use. You can use Gradient Boosting model actually as a black box model. You give to Gradient Boosting your data. It trains the model and gives you a good result. You cannot do that with neural networks because you really need to be an expert to build a good architecture. And it also works well if you have not a lot of data, and that is something that happens many times in life, that you do not have huge amounts of labeled data. But Gradient Boosting will still give you good result. Because of this whole set of reasons, it's used in production in many companies. It can be used in finance for predicting credit scoring, for example. It can be used in recommendation systems for understanding what are the best songs that the person will like, or it can be used for sales predictions. It is also used heavily on Kaggle. There are many machine learning competitions, and for this type of data the winning solutions in many cases are based on Gradient Boosting. Now neural networks are very powerful, and it would be really cool to use neural networks together with Gradient Boosting, and that is something that we are doing in Yandex. So Yandex is a very large Russian company. We do search, we have taxi, we have self-driving cars, we have many, many different technologies. It's like a Silicon Valley of Russia. What we are doing for many tasks, we are using together neural networks and Gradient Boosting. For example, if you have a query and you want to select images to show to the person, then you first calculate neural features on these images, then you combine these neural features with other knowledge that you have about this image. For example, about the site on which the image was, and then these features you give to Gradient Boosting. So it is a good idea to combine those methods. They are not contradicting. They can work very well together. Gradient Boosting is an iterative algorithm. It works like that. It usually builds decision trees. First it builds one decision tree so that the training error is large, then it builds another decision tree so that the training error reduces, and then it does that for hundreds or thousands of times so that the training error is very small and you already can find complicated dependencies in your data. That is about Gradient Boosting. Now about CatBoost. The main reason why you should be interested in CatBoost is on the slide and it's quality comparison. There are several libraries in open source. There are LightGBM and main competitors are LightGBM and XGBoost. There is also the H2O algorithm. And as you can see on the slide, here CatBoost wins on the set of publicly available datasets. All these libraries by different amounts, sometimes by really a lot, like on Amazon. And that is the comparison after parameter tuning. You have the algorithm and you need to tune parameters to get the best quality. This is after parameter tuning. We also have a comparison without parameter tuning. And CatBoost with no parameter tuning beats all other algorithms on these datasets with parameter tuning in all cases except one where LightGBM outperforms CatBoost by a little. So that is a quality comparison. It's a good reason to try the library. Now I will deep dive into the differences between CatBoost and other libraries. The first difference is the kind of trees that CatBoost is building. Different algorithms build different kinds of trees. LightGBM builds trees node by node and can get very deep node symmetric trees. X2Boost builds trees layer by layer and the trees cannot get very deep but they are not symmetric. CatBoost builds symmetric trees. An example of the symmetric tree you can see on the slide and that is not an error on the second level. The feature is the same. It's the weight on the second level in all the nodes on this layer. On the next layer, if the tree would be deeper, there would be four nodes with the same feature. This type of trees we observe that this type of trees helps a lot with hyperparameters. Because when you are using this type of trees, the result in quality does not change by a lot. When you are changing the hyperparameters so the algorithm is stable to hyperparameter changes and because of that it gives very good results from the first run. So you don't really need to put a lot of effort into parameter tuning. You can make sure that the algorithm has converged so you had enough iterations and then you just get the first model that the algorithm has given to you. Yeah. Okay. That is about hyperparameters and the next thing is about prediction. With this type of trees, the prediction can be done very fast and I will talk about this later. That is the first difference. The second difference is the type of data that we are able to work with. There is this numerical data, like height or weight, like numerical features and it is clear how to work with numerical features if you are using gradient boosting condition trees. You put a feature into the tree and if the value is less than something then you go to the one side. If it's greater than this value then you go to another side. So this is the way to use numerical features in decision trees. It is not that obvious how to use categorical features in decision trees. Categorical feature is a feature with discrete set of values where values are not necessarily comparable with each other by greater or less. An example of that would be occupation or there could be examples with high cardinality categorical features, like user ID for example. And the high cardinality categorical features are the features where it is the hardest work of them in an optimal way. So what do we do with categorical features? The first thing that we are doing is very simple. It's one hot encoding and that is something that other libraries also do. What is this? Instead of one categorical feature, like occupation, that a person is a manager or a cook or an engineer, you have three binary features. Is the person a manager? Is the person a cook? Is the person in PR or what was the third thing I said? So instead of one categorical feature you have many binary features. You could do this during the preprocessing, but if you do this during the preprocessing then your dataset grows, you have a very large dataset and also the training time will grow by a lot. So the good way to have one hot encoding is to let the algorithm do one hot encoding for you. So you just say this is the categorical feature, please do one hot encoding. And the algorithm does it for you. It will be better in terms of speed and it also will be better in terms of quality. So there are details of the algorithm that allow for that. I don't have enough time to explain everything. So the first thing is one hot encoding. Other libraries also do that. And then we have this whole set of things that we are doing with categorical features that are more sophisticated. And these things give a very large boost in quality. So one hot encoding we do for features that have little amount of values. For high-quality features we do the following. We do calculate statistics based on label values of the objects with this category value. The simplest thing you could do is the following. Let's say you have this dataset that we have on the slide and there is this categorical feature that is occupation with two possible values, software development engineer, SDE and PR. Now instead of one categorical feature, this categorical feature we introduce one new numerical feature that is equal to average label values of all objects with this categorical feature. Instead of SDE, we will have three divided by four. It's the average label value. There are three ones and one zero. The average label value is three divided by four. This is called target encoding. This could work, but the problem is it doesn't work. It doesn't work because it leads to overfitting. Because it leads to target leakage. An example where you can understand that would be let's say you have the single object with the category value. Like you have in this dataset you would have only one SDE. Then this SDE will have a label value one and your new numerical feature value will be just equal to your label value. So during training the algorithm remembers that it has a very good feature that is equal to target. And it makes all decisions based on that. But during the prediction you will not have this magic feature that is equal to label. Because of that you should not do this. What are we doing? We are doing the following. We are doing a random permutation of all the data. And now the data is permitted and you are looking on the object with some category feature value, the ETH object. And now you are calculating the same averages but not including this object. You are only looking on the object before this one in this permutation. So for this object the new feature value will be two divided by three. Because there are three objects with category feature value SDE before this one. Now what else we can do? We also can use priors. For the first object you don't have any object before this one so it will be zero divided by zero. To not have these problems we introduce priors. We add a prior in denominator and denominator as you see in the formula here. And it gives a boosting quality to try different priors and to try to find out which prior is a good one for this particular feature. So what we are doing we are calculating those averages. We are enumerating different priors. What else you could do? You could try different random permutations. But you cannot use two random permutations to train one model because it will lead to target leakage in the same way as the average on all data set. What you can do and what we are doing you can train simultaneously several models. We are training simultaneously four models and on each iteration when we are selecting three structures we drop a coin, select one of those models. Each model has its own permutation so we select one of these models and one of these permutations. For this model we select the three structure. Then after that we give these three structure to all four models that we have and then we calculate leaf values based on one more permutation. This gives a good high boosting quality and important thing is you cannot do this during preprocessing. That is something that you only can do if you are writing this inside the library. The next thing you can do, you can look on feature combinations. What are categorical feature combinations? Let's say you have two categorical features, a pet and a color and your new categorical feature that will be a combination of those two features will have the following values. Blue cat, black cat, blue dog, black dog. So it's a new categorical feature that is a combination of features. The problem here is if you have several categorical features, the number of possible combinations grows exponentially with the number of features. So you cannot really calculate those statistics for each combination. What we are doing inside the algorithm, we are enumerating combinations in a greedy fashion so we do not enumerate all of them but we are trying to enumerate through some best of them. When we are building the next tree, on the first level we try only combinations of size one and then on the next level we are trying combinations of size two by adding features to the feature that we have already selected. And we also calculate other statistics like frequency of the category in the data set, it also helps. That is the big thing about categorical features. Now the next thing that we are doing that gives boosting in quality is called ordered boosting. Classical boosting is prone to overfitting and that means that the resulting model will lack in quality. That is because when you are building the tree, when you are building the leaf value, the leaf value is the estimate of the gradient of all the objects that would be in this leaf. And this estimate in classical boosting is biased because you are making this estimate on the same objects that you have built the model on. This is easier to see if you look on the error. If you are trying to estimate the error in the leaf, if you estimate the error on the same objects that you have built the model on, then the error will be less, so it will be biased. The same thing happens with gradients. To overcome this problem we use the same idea that we have used for categorical features. We use these random permutations. Now you have your random permutation and when you are building the tree structure, then for each object you are making the estimates based on the model that has never seen this object. You are making the estimate based on the objects before given one in the permutation. That gives the boosting quality in case if you have small data set or noisy data set, in case if you know that there might be overfitting, it really helps. Now I told you about the main algorithmic things that we have in the library. Now let me tell you about the modes that the algorithm is working on. There are three main modes, classification, regression and ranking. Those three modes are in all gradient boosting libraries. First one is classification. There are binary classification and multi classification. Binary classification problem is, for example, if you want to predict if the person will pay the credit or not. In your training data set you will have labels, one if the person has paid the credit, zero if the person has not paid the credit or you might have probabilities there. And for multi classification, multi classification is when you have more than two possible answers. For example, if you want to predict weather for tomorrow, you want to predict type of clouds, and there are like six or nine possible types of clouds, and for that you can use multi classification. The regression is when you want to predict numerical value. For example when you want to predict taxi drive duration or to predict exchange rate. Those are regression problems. And there is also ranking. Ranking is a little bit more tricky. An example of ranking problem would be for this particular city like say, give me top-end hotels in this city. Let's say that your input data has ratings. So for each hotel you have a rating and you want, for some hotels you do not have the ratings, you need to predict the ratings first. And then rank the hotels and select top-end of them. How you would solve this problem? One way to solve this problem would be to solve this using regression. That means you really will try to predict a rating for each hotel. And then you will sort the hotels by rating and select top-end. But you don't need to do this. In this case let's say in city A all the hotels are really good. In city B all hotels are really bad. And what you're trying to force your algorithm to do is your algorithm will try to say that in this city all hotels are worse than in this city. And that is not cheap and you don't need to do this to understand top-end here and top-end here. You don't need to compare them with a chart. So you don't need to learn the real rating. And because of them what you are doing you are grouping the objects into groups. Like here you will group the objects by the cities and you are trying to rank objects only inside each group. That is ranking. We use ranking a lot in Yandex because we have search, we have ads, we have recommendations from music and video. We have very many places where we need ranking. And because of that we have many very powerful ranking modes which XGBoost and LightGBm do not have. The first one let's say it's just ranking is in case if you have something like ratings in your data set. Or relevance as if you have a search query and the documents and then for each document an assessor will write a number which will be a relevance of this document. That is ranking. We have two modes for ranking. They are called pairwise and the difference between them is that the first one is really fast and the second one is really powerful but slow. In most cases we use yeti rank pairwise. The next mode is called pairwise mode and that is the mode that you use if you do not have any ratings. You only have the pairs of objects and for each pair you know that first object is better than the second object. Third is better than fourth and so on. You only have pairs as the input. We have two modes for pairwise ranking and the difference between them is the same. And we also have three other modes. One is the mix between ranking and classification which might be useful if you want to do ranking but you have the zero and one labels as the input. Another is ranking plus regression and one more is specific for the task when you want to select top one best candidate. That is also a ranking task but there might be the case when you are not really interested in top end you only are interested in top one. So we have this whole set of ranking modes and if you are interested in ranking I would strongly recommend you to try them. They work really good and they are used in production in different services in Yandex. Now what else? We have talked about the algorithm, how it works, we talked how to use it, which modes are there now about the speed. Important things are CPU speed, GPU speed and prediction speed. CPU speed one we just released Katwis was really slow and everyone told us that and we worked a lot to make speed ups and currently the situation is the following. On most of the dataset we will be two to three times slower than light GBM. Light GBM is the fastest one and with XGBoost we might be the same or two times slower something like that. So the difference is not that much but we still are a bit slower than other libraries. With GPU the situation is completely different. We are very, very fast in GPU. We are about 20 times faster than XGBoost and 2 to 3 times faster than light GBM and the important thing is that Katboost is super easy to use as opposed to light GBM. Katboost is easy to use as opposed to light GBM. You just pip install the library, there is a flag task type equals to GPU and you use it. So it's really fast and it's really easy to use. Important thing about GPU is that GPU speed up comes with amount of data. The more data you have the more speed up. For very large data if you have millions of objects you will have the speed up up to 40 to 50 times. And even though that will be on USGPU on older GPU the speed up will be about four times. About prediction time. We care about prediction time also and Katboost prediction time is 30 to 60 times faster than XGBoost and GBM. It's not that everyone cares about speed up but we care and we are proud that we are so fast. And also I wanted to mention a few other things and that is how to explore your model. If you want to understand what your model is doing you need to look on feature importances so which feature are the most important ones. You can look on feature interaction which pairs of features work really well together. There is also per object feature importances. For this object which feature are the most important. For that we are using the shop values and there is a shop library that shows the virtualization for that. So for each feature has the importance it might have positive importance because of that the cost, because that feature the cost grows because of that feature cost is lower and so on. There are different plots you can look on to understand more about your features. There is also the way to understand what objects are the most important ones. So let's say you have this object and you want to see because of which objects in the training dataset the result is the following. There is influential documents for that and that there is also a way to understand the importance of if the feature is statistically significant. For that we have feature evaluation. We have tutorials for all of that and I recommend you to look into these tutorials. There is a bunch of different features also in the library except for training. There is a lot of visualization you can look on how your error changes in the training using Jupyter Notebook using a separate CatBoost user or using TensorFlow board. You can also try training on dataset with missing features. You can use cross-validation. We have also visualization for cross-validation that is also running inside Jupyter Notebook. So there is a bunch of stuff to try. Yeah, and I just want to mention a few important parameters. So if you want to tune your algorithm, if you want to get the best quality these are the parameters that you want to tune. And we have the tutorials how to tune and documentation how to tune parameters for quality. We also just published a tutorial how to change parameters if you want to get the most of the speed. I encourage you to look into this documentation. Yes, and here are a few links. The last one is called GitHub CatBoost Tutorials. We also have published a tutorial with homework. So if you want to try how to use gradient boosting you can try this tutorial with homework and do the task there, answer the questions. And we have the tutorials for all the functionality that I told you about. So if you run through these tutorials you will learn everything. And with that I am ready to answer the questions. Questions from the audience? Can I also ask the question? How many of you have heard or used CatBoost before this talk? Okay, that's half, half. Hi, thanks very much for the talk. You mentioned at the start that it's possible to combine neural networks and gradient boosting. Can you give us some example applications? First of all we have a tutorial how to use neural networks on text together with gradient boosting. The idea is the following. You train a separate neural network for example a neural network that will compare the image with text and then from that you get a numerical feature like distances. And those distances you combine with other features. You have the image and you have a lot of other information about this image not only what you see in the image, you have information from the site, from how many people clicked there and so on. And then you combine this all together and then you put this into gradient boosting. Thank you. I have a question about the categorical features you mentioned. How did you use those statistics when you do prediction? Because you don't know the true value. I'm not sure I understood the question. So how we use those statistics during training? During prediction. So what we are doing, we are reading the training data set and for each category that you have seen you have a value that is calculated based on all the training data set. Then you write this value to the hash table and when you are predicting what you are doing is you are adding this test object to the end of your training data set. And then the feature value for this object will be the average based on all the objects before given one. That means on all the training data set. And this is the value that you have saved in the hash table. So basically you use the average value of the training data set. Yes, exactly. Thanks a lot for the talk and for the library. I was just wondering why would you use Scikit-learn anymore if you have something like this? Like do you have any? No, Scikit-learn has a lot of stuff including gradient boosting. Gradient boosting all catboost and IGBM and XGBoost work better than SCALearn gradient boosting. So if you want to use gradient boosting you should better go with a different library than Scikit-learn. But there is a bunch of stuff in Scikit-learn that is very useful. Yeah, that's true. But for anything like predicting classification and regression and all of this you would only use catboost at Yandex, right? So we use catboost for many, many different tasks. I don't think we are training any Scikit-learn classifier for production purposes. But we not only use gradient boosting there are other also useful algorithms. There are neural networks, there are linear models, nearest neighbors, we use all of that. Gradient boosting is a very important algorithm. I was just wondering, do you have any tools within the library to extract the prediction path of a particular instance, similar to the tree interpreter for Random Forest? Could you repeat, please? So when you make prediction for a given instance is it possible, do you have some tools to extract the path this particular instance took to make that... What path is it? So we are currently working on providing the JSON model and that will be the way to discover the paths. I have a quick question. How well do you integrate with Scikit-learn? Because I've been using Exhiboos for ranking problems and it's insanely hard to not use the basic Exhiboos thing if you want to cross validation or anything with ranking. So how well do you integrate with the rest of the ecosystem or do I have to use cat boosting all the way? So we are trying to integrate as best as we can. We are very much compatible with Exhiboos, so if you are already using Exhiboos it will be not that hard to switch. There are some methods like... We do not support all the methods in Scikit-learn but we know about one of them that does fail and we plan to fix it soon and if you see anything that doesn't work you can just make the issue on GitHub and we will fix it. Thank you for your talk. Do you have any experience about applying cat boost to natural language processing use case or data sets? Yes, so for natural language processing we usually use neural networks and we sometimes do on top of what we are doing using neural networks, we do use on top of that some gradient boosting for example we have the dialogue assistant that generates the answers you ask something then there is this neural generator for the answers of the assistant to you and it generates many answers and then on top of that there is a rerunker that is based on gradient boosting that extracts some features from each of the answers and then reruns them. But the core is usually based on neural networks. Thank you for the talk. In production we use neural networks and light GBM a couple of months ago we tried to use cat boost and basically our use cases we create word embedding, some other features and one hot encoding and then we pass that to light GBM and we spent a lot of time fine tuning light GBM but when we tried to substitute with cat boost the results weren't better so how easy it is to fine tune cat boost do we need to spend again two months in search and so on? There is a set of parameters that you can change to try to improve the quality and those parameters I listed here it's here so you can try to fine tune these parameters two of them, okay one of them learning rate and iterations you don't really need to fine tune you just need to find this point of convergence and these other ones you probably have to fine tune with depth the situation is the following, you don't need to enumerate through all the depths you basically try the depth six that is the default one and for some data sets it is important to have bigger depth so you try six, you try ten and then if ten is better then you need to fine tune between eight and nine and something like that so those are things to fine tune and one more thing about one important coding that you told it might lead to a slow training for cat boost so if it is possible for you to not do it during processing and give the algorithm possibility to do it for you then it will be probably better and one more thing about word embeddings it usually does not work perfect if you give the word embedding to gradient boosting in a row like this 300 numerical features you give it to gradient boosting usually it's better to try to find some distances I don't do ML as a part of my day job but I've used cat boost as a key ingredient in many of the examples in machine learning contests with great success so thank you for that so my question is related to the previous question do you have any tips on hyper parameter optimization so I mean the usual recommendation is great search but usually I use something a little more intelligent when it comes to optimizing hyper parameters for neural nets and things like that anything on those lines for cat boost? Yeah there is a hyper opt which works better probably I would recommend this library if you have yeah I incidentally use hyper opt for neural nets is that what you recommend for parameter tuning on cat boost it's not particularly for cat boost it's quite generic thank you thanks for a nice talk and for open sourcing the library I have two quick questions one of the bullet points you showed previously was that the library handles missing values which essentially means you can fit in the nas without the imputation which is very useful yes and the other question is do you support sparse matrices? Yes that is one thing that we do not support yet and we are currently working on that the plan for the next few releases is the following we will be adding the sparse matrix support we are adding the spark the distributed training on spark we are also working on that and we are adding multi classification on GPU it will come really soon and we are adding the JSON model and improvements in our package that is the plan for next releases but the sparse data is a very long it's a very large thing to do so it's not yet ready but we are working on that I found the quantile regression option in gradient boosting quite useful for my applications so do you have something similar in cat boost which could provide the prediction intervals the quantile regression option yes we do a quantile regression so you could derive prediction intervals based on that but we do not provide prediction intervals no but so there is a loss function for quantile regression but there are no prediction intervals we have time for a few more questions have the microphone and I am very open to feature requests now about the imbalanced data so there is always a problem with imbalanced data we have the possibility of reweighting the objects scale pos weight parameter the same for xg boost and light gbm but we that is the only thing that we are doing specifically for we know some data sets where it works really good like amazon that was there on the slide and some data sets where it does not work that well because your tree structure are balanced it's very sensitive to imbalanced data I don't think that this tree structure is worse than other tree structure for imbalanced data sets but we we know that there is this problem with imbalanced data sets and we are trying to figure out how to fight this other libraries also do have the same problem thanks for the talk I have a question about those parameters there okay that is a very good question the begging temperature so when you are selecting the tree structure or when you are selecting the next tree you are doing some begging so what you could do you could do Bernoulli sampling we select some objects and do not select some other objects but you also can do other assembling what we are doing we are by default in regression and classification we are sampling from exponential distribution and what we want to do there we want to balance between having no sampling at all and having sampling from exponential distribution or heavier and this begging temperature changes that so it's set to 0 then all the weights are equal to 1 if it's set to 1 then you have sampling from exponential distribution and there is this balance the random strength is one more parameter when we are selecting the tree structure we are trying to put a split in the tree for each possible split we are trying to put in the tree and this split gets a score and this score is how much this split improves my ensemble what we are doing we are adding to this score a random and normally distributed random variable and this random strength is the multiplier for this variable and this helps a lot to overcome overfitting that is one more surprising hack that helps to improve the quality I want to thank Anna again for this enlightening talk and thank you very much for the audience