 In this video, I would like to introduce you to the paper I wrote together with Marcus Yeager, Stefan Krugel, Jochen Palenrock and Peter Schrander, in particular, how we use the R-language for the research. First, let me talk about my project, the project where this research started. It's called Finnet. It's a mercury that allows you to do your research project in a company. It started in Florence and now it's moving the project to New York. Before I was associated with the research at the machine learning optimization lab at the least instituted in Romania. If you have any questions about this research, you can write me in the comments. In the paper, we propose a pipeline for testing heuristic strategies that goes beyond the usual way of testing strategies in finance. In particular, we wanted to use to exploit the machine learning explanations to understand better these strategies. The strategies that we wanted to study are risk-based strategies. We focused on two in particular. One is called the equal risk contribution. It's a more traditional strategy that focuses on equally distributed risk among the assets. We wanted to compare it with the hierarchy of risk party that is instead a machine learning based strategy that uses hierarchical clustering for distributing the asset allocation. On top of it, we also use the volatility target. So how we usually study the strategies in finance, you usually test them. So you look for historical data and you see how they behave the strategies on the data. In our case, we constructed a back test engine in R that monthly rebalance the portfolio and place the strategy and write on this the performance. The portfolio unit is quite interesting. It's 18 futures and it's a very long multivariate time series. It covers 20 years. It represents different asset classes. For this portfolio, we can back test it. These are the outcomes. So you can see here the standard measures that are used in finance and the performance is about a year and on the 20 years. Clearly, you want to know more about the behavior of these strategies. You don't want to test them only on one particular past. And to do this, there is a technique in finance that is widely used. It's called the block bootstrap. In practice, it's a data augmentation that from one time series you can construct as many time series as you like. How does it work? You take your multivariate time series and you extract block of fixed length. You sample them randomly and then you compose any time series. And we did it 100,000 times creating 100,000 CST files. If you see here the one empirical dataset, so the original time series has a period that is highly volatile here corresponding to around the 2008 crisis. You can see here in the random bootstrap time series, this happens twice. In this way, the second portfolio is more volatile. You can study situations in which, for example, there is a higher volatility or volatility that lasts for long. So how did we back test all these time series? We used our back test engine and this back test engine takes one CSV file and writes the performance on this. To do it, to scale 100,000 bootstrap files, we decide to assign to every R process a single CPU. So every R process will look for one CSV and produce the output and will use only one CPU. In this way, we could analyze all the bootstrap files and to scale up, we used AWS computational instances. Particularly, we used very big instances like C5 with 96 CPUs. Actually, we didn't use just one of these large instances. We used 15 of them. So how did we distribute the task among the 15 instances? One strategy was to use a short file system among all the instances where all the bootstrap data were written and where all the performances could be written down. However, the starting of the process, you have 1400 different processes, all at the same time, reading the CSV. And this created a big bottleneck from the bandwidth of the network. The bottleneck for this process here. So a better strategy has been to replicate the data set on the local instances, the 15 local instances on their own solid disk storage. And this way, every instance will read locally and write locally on the EDS disk assigned to it. And at every step, for every bootstrap CSV, we recorded the performances and some features that we developed for the particular portfolios. That could be interesting for the analysis. At the end, when all the processes are concluded, we gather all the samples in a single machine and we proceed the analysis. For example, all the following analysis is done on one instance. So the outcome is the following. So we can see here the difference in the performances between the equally as contribution and the hierarchy of this party. In general, here we can see that the hierarchy of this party is better. It has lower water humidity, slightly higher returns. And for example, there are these more complex measures like the karma where the hierarchy of this party is slightly better. But this is just a statistical analysis. We want to go beyond that. And see what happens, what makes this outcomes. We focus on one measure, the karma. The karma ratio is a particular measure that is the ratio between the returns and the max drawdown. So it's a quite nonlinear measure. How do we study it? We have all these performances with all these bootstrap data with all these features constructed on top of all these data. So we train an XGBoost from these features to evaluate the difference between the karma ratio of the hierarchy of this party minus the karma ratio of the equally as contribution. In this way, we can exploit the machine learning explanations to understand what is the role of each feature. What are machine learning explanations? Now it's a modern topic and quite interesting. I don't go into details. This quantity is quite promising. It's called sharp value that quantifies the contribution of a single feature to the model prediction to the outcome of the model. I don't go into detail. You can look at the references here or the paper. So if you take these values and you average them together with absolute value, you can get what is called variable importance. So the variables that contributed the most to the outcome. And here we can see that there are traditional measures like for example the mean of the returns. So the return of the equal weight portfolio that are maximally important to discriminate between equal risk contribution and the hierarchy disparity. We introduced those on trivial, standard quantities like cluster coefficient that is a coefficient that you use the library now. That's maybe how cluster ball has a certain quantity. In particular, hierarchy disparity uses hierarchy class things. But we can go more into details. Here we use the shop for extra boost library. And you can see here that you can go beyond the overall contribution and break it down and see, for example, that a higher return suggests the model to reduce the difference of the calmer ratio. So in some sense that this quantity pushes the outcome towards equal risk contribution. And so on. You can look also, for example, in correlation, where if the correlation matrix among the 10 series is higher, then it looks like a equal risk contribution that has better performances, or at least it goes in that direction. This way we can we can construct, inspect and understand what features of the pool portfolio dataset are really matters for these strategies. Just to summary what we found. So we used machine learning explanation to put beyond the single testing of strategies that is usually performed in finance, and we constructed the pipeline to link the performance with the properties of the portfolio universe. This way, we can understand better what are the situations in which strategy gives in a certain way and the situation is better for other strategies. Thank you for your attention.