 efficient computation of sparse matrix functions for large-scale electronic structure calculation as implemented in the chess library. Okay, thank you very much for giving me the possibility to present our work. So I will talk about the chess library that we developed during the past months and years at the museum collaboration with various people also from abroad, in particular from France and from Japan. So the outline of the talk, first I will talk a little bit about the motivation and the applicability of chess in electronic structure codes. Then I will briefly talk about the theory behind chess, a few words about sparsity and truncation, and then several performance figures, so accuracy, scaling, parallel scaling, and finally also comparison with other methods. As you all know, linear algebra is a very important topic in electronic structure codes, sooner or later all codes will have to deal with this. And you can do it in standard way, for instance with lap-hacks, scale-up-hacks, standard libraries, but they can become a bottleneck because in the worst case they scale cubically, and so your codes will hit the wall. And it's even worse if you have sparse matrices because in this way if you have sparse matrices and use these standard tools, then you just multiply a lot of zeros and you will waste a lot of compute time. And sparse matrices are actually quite abundant. You will find them if you have a specific basis set that incorporates localization or also if you have intrinsic localization properties of your matrices. And to deal with this we created a standalone library for sparse linear algebra, in particular tailored for electronic structure codes. We called it chess, which stands for Jebyshev sparse solvers. I will show you later how it works just to start. I can already tell you that chess works very well if your matrices exhibit a very small spectral width. So it's really tailored for this purpose. Another question is do we encounter such situations in electronic structure codes? And of course it depends on the basis that you use that you need it's possible. I show you here a few systems, all of them are very large and very sparse. They are calculated with the localized basis as implemented in the big DFT code, which is a code that was first coupled with chess. And I show you here for the overlap matrix, the condition number and you see the basis is really very nice and leads to an addition number that are only of the order of two. And also if you look at the spectral width of the Hamiltonians, these are electron volts. You can see that we are of the order of 32, 40, maybe 50 electron volts. And if you are in such a setup, chess performance extremely well as I will show you later. So the basic idea of chess is not new. Well, it was developed, the basic idea about 20 years ago. So the idea is to develop our matrix function in terms of Chebyshev polynomials like this. Well, this is the matrix that you want to develop. You have to scale and shift it since these polynomials are already defined from minus 1 to plus 1. And the expansion coefficients, it looks complicated but it's not. This is the standard textbook formula. The most important point and also the heaviest is the calculation of the Chebyshev polynomial matrices down here. But they have a very nice recursion relation. So you have just to iteratively apply a matrix to another matrix or actually a matrix to a vector. Because it turns out if you look at this formula that each column of your matrix is independent. So this is an algorithm that you can parallelize very well because you just put several columns to one processor and then each processor does its work independently. As I told you, chess at the moment is tailored for electronic structure code. So we implemented the functions that you need to perform density functional theory. In particular, this is the calculation of the density matrix. The function that you want to expand is just a Fermi function. You can also calculate the energy density matrix, which is very similar. Or matrix powers. This is the most general that we have. So with the power here, the A can be anything, also non-inject value. So you can also calculate square roots, inverse square roots, etc. But it's very easy to generalize this. So we could also include more functions. The only thing that we have to change to implement here is the calculation of these coefficients, which is very easy to do. As I told you, chess works with sparse matrices, and if you have sparse matrices you have to choose a sparsity pattern. And we decided to work with a fixed sparsity map pattern that is used to define. So a user in the beginning has to define what is your sparsity pattern and then chess does the calculation within this pattern. So we have in chess three patterns. So first of all is the pattern of the original matrix M that's usually defined by your setup, by your base function. Then we have to sparse this pattern for the matrix function which is usually a bit larger so that the function has a space to expand. And then there's a third auxiliary pattern for the multiplication. That's the technical detail. And here I show you how it looks like in a typical setup. So we have here a sparse matrix. You can see in gray this is the entries that are set to zero. And here on the right hand side, this is the inverse of this matrix. You can calculate it with any constraint just with a dense calculation. And you can see the matrix spreads out a little bit but still there's a lot of white space meaning at that tier there are a lot of zeros. So it's justified to do this calculation here also within a sparse pattern. And indeed that is what we did here. So this is the calculation with chess where in gray we again define our enlarged sparsity pattern. And as you can see we really capture very well the entries that are non-zero. And this is the difference between this and this one on a different scale. You can see the typical error is of the order of 10 to the minus 10. So really really small. Now let's talk a bit about the accuracy. And as you can imagine there are two factors that affect the accuracy. So first it's the error that you introduce by the sparsity pattern. And secondly it's the error that you introduce by the Chebyshev feed. So it's not really clear what is actually the correct solution. The most straightforward way would be to define the correct solution as the one that you obtain if you do a calculation with for instance LAPAC without any constraints and sparsity constraints and then you just junk it. But then the problem is that this solution usually does not fulfill this identity. So if you calculate the matrix function and then apply the inverse you should again get back the original function. But with this approach it is not true. So we think that the better way is to define the correct solution as the one that you directly calculate with in the sparsity pattern. And then you ask for the fulfillment of this identity here. This is how we define our correct solution. Because in this way we can get rid of the bias that you introduce by the sparsity pattern. One more slide about sparsity. So as I told you we work with a fixed sparsity pattern but an alternative would be to use a dynamic approach so that during the calculation you set to zero all elements that are below a given threshold. The advantage of this is that it's very flexible and you can control the error that you have. But on the other hand the sparsity of your matrix can decrease a lot and then your calculation cost can explode. And this is a test that we did with the purification method which has such a dynamic sparsity control. And you can see the start here with a sparsity of more than 98%. But then we have a very strong fill in and at iteration 10, 11 we already have a sparsity of only 93%. So you can imagine that the cost here is much, much higher than here. And this is very hard to know beforehand. That's why we think it's better to work with a strict, with a predefined sparsity pattern and in this way we already know in the beginning what will be the cost of your calculation. So now about the accuracy. I show you here two benchmarks. First for the inverse and here are two errors. As I said before this is the error down here which is related to the Chebyshev fit. And you can see that it's negligible of the order of 10 to the minus 11. So the Chebyshev fit is really, really accurate. And up here you see the error that is introduced due to the sparsity pattern. It's a bit larger but still of the order of 10 to the minus 7 which in practice doesn't matter at all. Looking at the density matrix here I calculate the energy the trace of the curl. Here's the matrix times the Hamiltonian matrix and we can see that the relative error is of the order of 0.01% which in practice also is very, very small and negligible. Now thanks to the fact that we can exploit sparsity our chess library scales linearly with respect to the number of non-zero elements. This is shown in this plot here. So I show you the runtime to invert the matrix. So I have various matrices from 6,000 times 6,000 up to 36,000 square. But as you can see the runtime does not depend on the total matrix size but only on the number of non-zero elements. So whether you have a small matrix or a big matrix the only important thing number that actually matters is the number of non-zero elements and it shows that the library exploits sparsity really well. I've already told you that it's very sensitive to the spectrary and this is shown here. This is the runtime and the polynomial degree that we need to invert the matrix for various condition numbers. So on the left hand side we see we saw the condition number of about 2 which leads to a polynomial degree of 50 and the runtime of a few seconds. But then if you increase the condition number and we go here on the right hand side where we have a few thousand you can see that it explodes. So with the polynomial degree that we need to well approximate the function is already of the order of 2, 3,000 and the runtime already of 10 minutes. You can see if we are able to work in this regime it's really really an efficient algorithm. If we work here then we should better use another approach. This is also here visible for the calculation of the density matrix. Again we can see that for the large of the spectral it's more expensive. It's a calculation so we can see here for a spectrary of 150 electron volts we need more than 2,000 polynomials whereas if you have a spectrary of 50 we can bring it down to the order of a few hundreds. For the density matrix another important point is the gap, the homo-lumo gap. If you imagine if you have a small, well a gap less system then the Fermi function is more or less a solution which is very hard to approximate with the polynomial and also the cost becomes very high. So the bottom line of this library is very good for systems with a gap, large band gap and with a small spectral width. Finally before comparing with other methods one word about parallelization. As I've told you the algorithm of chess well it's really can be parallelized in a very efficient way and we also need a lot of effort in this direction to reach a really a good scaling. This is again the calculation of the inverse you can see here the number of cores and this is speed up and you can see even the matrices are not very big but even for more than 2,000 cores we are very, very close to the perfect scaling and on the right hand side these are the run times you can see if we use many cores we can invert the matrices of the order of 30,000 square within only a few seconds which is not so bad. We also went further we wanted to really know what's extreme scaling so we went up to 16,000 cores and of course here now it becomes a bit flat but still if the matrix that we looked at the 96,000 squared is not really huge so using 16,000 cores to invert this matrix and still getting the speed up is not bad and you've noticed that this speed up is relative with respect to the first point which we already used in 1,500 cores so overall we get here a speed up of the order of 5,6,7,000 which is quite good. Now to conclude we compare chess to two other methods so first of all we compare it to selected inversion which is another method that can calculate the inverse and exploit sparsity and then we also compared it with two dense standard approaches namely scale up back and lay back. We have here five different matrices so from 6,000 squared to 30,000 squared and we did this benchmark for various condition numbers here from 2 to about 150. So first let's look at the dense solutions these are the yellow and the blue lines here so first of all they are constant so there is no dependence on the spectral width and as you will expect their cost explodes because they are scaling well up uniquely. If you look at selected inversion also selected inversion shows no dependence on spectral width and you can gain a lot compared to these dense approaches but you can see here the purple curve is chess library and if you are working with small spectral width it is the fastest method. Compared with selected inversion the crossover with respect to condition number is at about 150. So if you have a matrix with the condition number below this value well you can use chess to save some time otherwise I think it's better to go to selected inversion. Finally we also compared the calculation of density matrix here so first of all we don't compare it with lay back because it's obviously worse. But we compared with pexi that's a library that was developed here by Lin Lin and that can also use exploit as varsity and we compared two different setups. Let's first look at the MPI only version we did this because pexi is very efficient with MPI parallelization so these are the blue and the yellow curve here for these small matrices pexi here is the fastest in this setup. Later on there's a switch so chess becomes more efficient because chess really scales linearly whereas pexi in this case because it's a three dimensional system has a quadratic scale. On the other hand if you use a hybrid scheme so MPI and open MP parallelization we can see that in all cases chess is the most efficient method but this is not the problem of pexi because the bad algorithm just because pexi is not very well parallelized with open MP that's why we gain here but you can see here that really with chess you can calculate the density matrix density kernel within a few seconds even for very big systems. Last slide now presented to chess with the library to calculate the density matrix algebra but it's not the only one there are also libraries. I've talked about pexi but there is also and sometimes it might be cumbersome for a code developer to interface with all of these codes. Imagine every time that someone changes you have to change again your interface etc. And therefore some people in the US started this LC project and their idea is to unify all of these libraries. This means that in the electronic structure code you only have to interface with LC and then internally LC interfaces with all of these various drivers. So in this way for a code developer for maintenance it's much easier to keep up to date with all the developments that are going on down here. At the moment chess is not yet in LC but we are working with them and we're hoping that in the future chess will also be included in LC so if your code already interfaces LC you will get chess for free without any work. Let me conclude so I present you chess which is a flexible tool for the calculation of matrix function tailored specifically for density functional theory but it can also be extended to other functionalities for the purposes. It can explore the sparsities of the matrices in this way reach a linear scaling with respect to the system size. It works very well for small spectral width of the matrices so it was specifically tailored for this purpose and last but not least it exhibits a very good parallel scaling. We have a hybrid scheme which works with both MPI and OpenMP. That's all. Thank you very much for your attention.