 So, let me start things off by thanking Martin to accept our invitation to give representation on the Flexiblass library that he has been working on for quite a couple of years already. I'll let him do the introduction on what Flexiblass is. So, Martin works at the Max Planck Institute and is the main author of Flexiblass wrapper library. Go ahead, Martin. Yeah, thanks for the invitation and giving me the possibility to present something about our software library in your community. So, first of all, let's give a short introduction myself because so when I see the list on participants, there's nobody I know. So I think on vice versa, it's the same. So I was doing my diploma in mathematics now nearly 10, 11 years ago at the Technical University in Chemnitz, so in Eastern Germany. Then I'm working as an HPC administrator since 2011 and at the Max Planck Institute in Magdeburg, so mostly administrator, partly researcher. So until now, I had to take care about two cluster machines. One was the first one that was on the Westmere-Magnicur system. So now we have one with more than 100 nodes on Skylake bases. We have several different strange meshes with power 8, power 9 CPUs, with GPU acceleration. I think there's one Xeon 5 still around and all strange stuff. I'm trying to get my colleagues to do continuous integration for their scientific software. So I'm trying to provide environments for this. In my research, mostly efficient algorithms for generalized metrics equations or generalized value problems. And I develop all my software mostly with respect to getting production ready stuff from my research or developing tools, whatever helps me in my everyday workflow at work. So that's mostly what I'm doing. So cancel. So now, yeah, that was the reason why we had to do this. So yeah, on a short announcement, I handed in my PhD thesis and hopefully mid of spring there would be the defense, so I don't have time to give any other talks. So I'm the co-author of several software packages. So most of all, Flexiblast and the metrics equations sparse solver package. And sparse metrics equation solver, I think it will be integrated in one of the next versions of MATLAB. So then this U of get library that's an SQL lite database access tool for the UF sparse matrix collection. So for those who are dealing with slingas sparse algebra stuff, you can easily access and run your examples on all the matrices available there, just selecting them on SQL lite way. And I'm together with a colleague, the maintainer of the QR update and G package that's a maintenance fork of this QR update package. It happened because we are no longer satisfied with the maintenance of this code and we found some bugs and yeah, nobody responded. And so we started off ourselves. And if you search for me, I'm not so active on social media stuff. You'll find me on GitHub with this crazy dragon name. So the remaining guys of our team are first of all, Jens Sark, he is mostly responsible for testing, has some ideas. And when we started up with this problem, he was very annoyed about recompiling all the scale was over again to system again, every time he tried to explore or explore a new blast library on a system wide basis. Then Christian Hymper, he was also involved in the development and the testing and ideas in the documentation. And he was joining us when he was on the way searching the fastest software stack for his Octave installation. And he's mostly announcing our new releases on his Twitter channel. Then we had for many years, a student assistant, John Papenburg, he did some of the coding works, some special details about doing strange profiling stuff in there. And of course, myself, I'm involved as well. And so what are we dealing about? I think most of the people here around know what BLAST is. So it's this basic linear, I can process programs package. So it's more or less, it's a mixture between a reference software implementation for some linear algebra operations. And on the other hand, let's say a standard how the AP calls look like. So most of them can be categorized in these three levels. So it's all around, you can deal with vectors, with matrices and vectors, and with matrix, matrix stuff. So adding vectors, computing norms, computing matrix vector products, computing matrix, matrix products, rank K updates. So that's all the really basic binding blocks of all linear algebra stuff. APEC is then the first package which directly is spied on top of this algorithms. And yeah, it plays a role in many software packages. So Octav is using it, Matlab is using it. I think nearly every FAM software is using it. So all simulation stuff. And I think there's even the cross dependency when you install the GNU image manipulation program with some special plugins. I think there's one as well. So yeah, the typical stuff, this Netlab implementation, that's the one which gives the reference, who's spied on top of the stuff they published between end of the 1970s and end of the 1980s. That's the reference implementation and defines the AP. Then there's the open-blast library, formerly known as go-to-blasts. It's an assembly-tuned stuff. Then there was this Atlas framework. It's something to automatically tune all the stuff. I think it's no longer a really active development, but it was the standard blast library used for many years. Then the, yeah, rather new one, the Blizz library. It's some blast-like stuff, but they also provide the same interfaces as the Netlab stuff. And also there are a set of hardware vendor implementations, mostly done by Intel, by IBM, by Apple. The RM crowd has this RM performance libraries. AMD has this math library, but it's still now discontinued. I think they're only now contributing to Blizz. But they also question why do we need just another blast library? So typically, there should be something for everybody in this list. And yeah, it started in 2013, I think, with a problem where we started compiling our code. So for example, the CMOS library. And we tried to debug something. And in general, the final applications somehow depend on blast, our APEC, Libium FPEC is a sparse matrix solver, lib.rpac, that's a sparse eigenvalue solver. And those three libraries are directly depend on the blast as well. Normally, they have this, their DSO, this connector dependency inside. And I link this application with this line, like you normally would do it. So when you will take a look on it, everything works fine. Every library is included to what we want. Nothing strange inside. So I was working on a Debian or Ubuntu-based system. So all system inside libraries like APEC, UMFPEC, Lib.rpac are depending on blast managed by the update alternative framework. OK, then I start trying, OK, just a quick test with another blast library. Let's see how it behaves with open blast. I only changed the linking line, added open blast instead of the standard blast library. And that leads to a similar resolution conflict. So normally, I thought OK, the application was running. And then I was in sec fault. I started GDB. And I was seeing, OK, I linked open blast. But the simple which got to resolve tests is often in a glass. What happened? Yeah, a close look to LDD shows up the problem. So we mixed in two libraries. And yeah, we need a researchful solution for this problem. So that was the point where we were coming from. And then we take a look around what's on the let's say on the market, what are people using for this problem. And the first one was, OK, move everything in you need with this LD preload and the LD library path. But it was not possible with this multi file implementations like the MKL or the early uplast versions. Then some people proposed, yeah, use static libraries. So you don't have the stuff with the dependencies inside the libraries. Yeah, it ends up. You have to track the library dependencies yourself. It's complicated linking mostly painful and large projects when you have more than, let's say, two, three, four libraries you are depending on. Was also not a possibility for us. Then, yeah, as I said, I'm using Debian or Ubuntu mostly. Yeah, update alternatives is a good framework. But it requires super user privileges. And that's something you don't want to give to your users. And there are some restrictions with this multi-lib implementations similar to the one you have with LD library path and LD preload. For example, if you install this lib MKL stuff, which is provided by the Ubuntu guys, they ask you if you want to use lib MKL RT as one of these blast providers. And this library is super dangerous because all of the other MKL or ways how to link MKL are not possible with update alternatives directly. Yeah, other systems like Gen2 or they're dealing with this E-select or package config system, same problem again. Requires super user privileges and it's at compile time only. So when you think of you have large projects on your computer like Octave and so on, only trying open blast with another threading. So for example, open MP instead of P threads, yeah, it recompiles everything again. And then so the BSD world has a problem, yeah. If they link against lib blast, if it's not installed, then they choose to want to maintain our likes. So I think when you install Octave, you have the open blast on it, if you install something else, you have the reference blast on it. So it's quite not directly visible which one you are using. Then we had some encounter some other issues and there was this transition between this G Fortran style and this GNU Fortran 77 or Intel style interface. So with the way they return the complex values and the Intel callings style and the Intel AD, so say it's okay, complex return parameter has to edit as a new first function parameter and they return a pointer there. In blast it's only for function effected by this problem. But if you are not really sure what your blast library is doing and which interface they have and you're linking the stuff against your software that can't get some trouble, then we encountered that there are some missing routines. For example, some component-wise absolute values. So the one norm of a complex number, that's not in Atlas and old Apple Accelerate versions and AMD and ACML, they seem to depend on the same trick Atlas is doing and this library call is missing. It's not an official routine from the blast but it's provided by the reference implementation. The other really bad guy in this area is the ESSL from EBM. So they say they have a full LAPEC implementation or yeah, you take a look under documentation but they only provide a few routines of them. So only the really high end routines they provide, the other ones you can find inside the SO file but you're not able to call them. So we have to fill up the missing ones automatically. And then there's something new stuff happening. So there are now some extensions. For example, this alpha times a vector plus beta times a vector routine, copy routines that are not part of the official blast library but in the MKL are included in MKL and open blast. So yeah, so we have to somehow deal with this compatibility issues so that we can ensure the users linking every time against the same library. And no matter what blast they are using in the back, they have all the routines available. Then yeah, more or less for the guys compiling software and administrating the software packages. Yeah, detecting the blast library is also not something completely trivial. So in my beginning time as administrator, I found plenty of huge set of 40 other tool strips same as with this fine blast module from CMake that's especially with old versions, it was horrible. Then especially scientific software of people that don't care about such tools like open like CMake or auto tools, you find hot coded library paths, non-standard library locations. So in the end, we want to end up with something only one path, one library to link. And that's the way where we are very... or which we go to get to this stuff. As I said, our FlexiLab blast library, which somehow wants to handle those problems. We had done this initial idea in summer, 2013. As I said, I was struggling around with this linking issue. So getting different blast implementations into one project, although I only want to have one. Then in December, 2013, we have the first version ready. So only with blast and C-blast in the beginning. One more. Then over the years, we developed the second version there we wrap all the LAPEG routines that implies that we have an automatic code generator and we allow the switching of the blast library from insider and application at runtime. And then in 2020, I got contacted by Naki Yuka from the Fedora project. And he asked if FlexiBuzz is still under development and it was in a time where I started preparing version 3.0. And there we added the idea to get hooks installed around blast codes such that we can get a hand on how they are called to get the analysis stuff in to capture some information about how the blast stuff is called. And finally in October with a set of changes required by the integration into Fedora, we are now the default blast in Fedora, the analytics since version 33. Here we have interfaces for blast, C-blast and LAPEG. As I said, since version 3.0, since version two, we have an automatic code generator for the wrapper functions. We have inside the API interface for new Octave so that you can do this switching even from inside your Octave scripts. So the similar interface is available for R and in last October, we released the last version in 3.04 and for those who are about licensing stuff, it's GPL version three code but we are included in linking exception similar to the one from the GCC which allows to link even proprietary software against flexi blast without doing in GPL violation. So you can use it in whatever code you want as long as you link it, then everything is fine and you don't violate the GPL stuff. So I think that's important for people using working mostly in the partner world and using MIT or BSD license software so that may don't make any trouble with your stuff with this linking exception. And now how does this work? So how do we get to solution to this linking issue, to this compatibility issue and a way to switch the back ends very fast? And our solution was to develop a plugin like framework and only employing the stuff which is around in the POSIX standard. So the reason was that sticking to POSIX that we get as portable as possible at least in the let's say Linux HPC world and to not include any assembly hex modifying the ELF libraries or modifying the global offset table or the PIT stuff. So it should be portable and still understandable what's happening there. And yeah, for those who are not familiar to the stuff, this DL star family, it centers a star standardized in the POSIX 2001 standard. It includes four routines, one to open and close and shared object, one to search for symbols and one error handling stuff. So it's really not that much. And yeah, so we did it, write some code, create a first plugin framework, did some tests and there were random assignments to some issues. So first one is the DL open library does not work like the normal linker. It only provides you and handle to the library and does not integrate anything at all. So we need to initialize every last call separately. Good, something which could be done. So the second thing is runtime loaded symbols cannot be resolved by linking the program. So every symbol we need afterwards for normal linking that behaves like the last library has to be already be available at runtime, at compile time. And still we have this, yeah, multi-file implementation issue which I said with the MKL library or with the Atlas once, for example. And so that are very basic free issues we have to deal with. And yeah, the initialization was with this typical way shared objects are linked and they are one could specify a routine which has a special attribute or in former times you should use this underscore init function and that's something we started which is started before even the main function of the program starts. And so we can include everything we need for initialization, for opening stuff, for looking up all the symbols there without changing any code to the user. So the user don't have to call something like flexiblas initialize also. So in this way, we can get our initialization routine in without requiring the users to change their code. The same thing with the destructor routine that's executed after the main function exits. So that was the point where we include closing stuff, print profiling data, starting some analysis targets, whatever we have integrated else. So that was the first problem with the stuff. Then the case, as I said, we need all routines at compiler or link time before. So we have to create wrapper functions. The wrapper work in this way. So we start with the application it calls for example, the dot product routine we have in wrapper inside. This one takes a look at this backend selection, backend means here, that's the blast library you really want to execute. Then from there it gets a function pointer and directly passes everything back to the backend that could be MKL, open blast, the netlib reference implementation plus whatever. And if you have such a framework with wrapper libraries, it's easy to integrate something which we go, say overloaded functions or we call it hooks. Here you can put stuff around the blast core actually. So what we did with the stuff, I will give you later some examples for it. And we have done this for all blasts in the APEC routines. So blast is around 130 routines at APEC are 1800 at the moment. So how does this work? Because programmers are normally lazy guys. So everything which is repeated work when you do it more than one time, you try to automate. And so we find the perfect tool for this in this F2Py routine or interface from NumPy. And in this way you can parse Fortran 77 on Fortran 90 function headers that creates a JSON file with all the header informations and then we have a script which translates all this header information into a Fortran AB combatable C functions. And that looks like this. So we have the subroutine written in Fortran. It extracts all the necessary meta information. For example, there's an integer, double precision value, an array, an integer value, an array and an integer again and creates such a wrapper function out of it. And that's done for all the nearly 2000 routines. And inside this wrapper function, we have the possibility to take a look up if there's such a hook or not. And if there's one, then this additional surrounding function is called. Then the only stuff we which left over from this issue list is how we handle this multi file implementations like MKL or all the versions Atlas. And when one take a look in the stuff the MKL installs there's some tools folder inside and there is a tool that calls MKL wilder. And that's a very tricky make file which employs LD to create an zero gate library which includes all blast symbols and references everything correctly in this multi file implementation of the MKL. The same trick could be done with Atlas and stuff and we have it included for the MKL stuff and also for other libraries but you also can use the stuff set up with the MKL. So if Flexiblast detects a compile time that there's a blast library which consists of multiple SO files that somehow need to interact then it can place such a zero gate library which only job is to reference all the symbols from the multi file from the multiple SO files. One cave at it is never uses MKL or TI already set in this before that this library is dangerous because that's a library which allows to change the binary interface at runtime. So for example, the width of the integers and that's something we could not handle until now so I have some ideas how to handle this problem but I think the way into made it and doing this having this possibility to change for example, integer width at runtime, it's not that good. So how it is used afterwards, somebody have installed it that provides such a tool called Flexiblast similar to this E-select war stuff some of you might know from the Gen2 system and we're using lists, you can take a look on what blast libraries are installed on the system so that are mostly the ones found at compile time but you can also add your own ones and then you can say Flexiblast default and set the back end and then that's the default blast library which is used for every subsequent program you execute. Otherwise, if you want to do some quick tests you only have to change the environment where you have the complexiblast they either give a name out of this list or you can pass every blast library an APEC library which is contained in a single SO file there and then it's loaded to your environment and used for all the computations afterwards. There's a bunch of config files you have a system wide, a user wide and then because I'm mostly working on systems where I have a shared home directory between many systems I have the possibility to have per host configuration file so that on every host where I execute my code a different blast library could be loaded for example on AMD machines I load a bliss library on Intel machines I load the MKL on my small desktop computer I load for example the open blast and I only have them three config files for the different systems and automatically the host one is chosen and it helps when using and debugging software and doing some performance tests yeah the code could be influenced with some environment variables for example you can turn on make it a bit more for verbose you can turn off loading LAPEC stuff so then only LAPEC is from the net lab reference implementations used for debugging purposes sometimes good to deactivate the color output because of this terminal artifacts so if the color codes and so on you can specify in different paths for the configuration file and add additional library search paths to it so quite usual stuff the more interesting question is it's a lab wrapper library so there is some overhead and the question is how large is this and for this case I tested this on my desktop computer it's a relatively modern core i5 system with the current Ubuntu setup I'm normally using open blast with open mp support and I measured the shortest return path you can have from the blast library with which is successful so which does not run into this xrblah arrow message system so the shortest successful core and that is if you pass in inputs of size zero so starting something for a vector of length zero or of a matrix of dimension zero times zero so that's the shortest successful return path and then I'll access this RTSC counter that's this clock cycle counter from the Intel CPUs and I did an average over 100 million runs for free blast calls and in this table you see how long it takes until the blast library returns for example code as I said with zero size input for the vector addition for the generalized matrix vector product and for generalized matrix matrix product and the difference is only between five and 16 clock cycles and I think there should be some really good connection between this difference and the number of functional arguments you have or you pass in so the longer the list of functional arguments gets for example there's huge SVD stuff routines where you have around 20 to 40 input parameters then I think the difference would be around 20 to 30 cycles but when you think of what blast is normally doing where the performance is coming from it's mostly coming from this level three routines like jam and this few clock cycles over here are nothing against the computational workers that's done there then as I said we have this possibility to overload functions that's around you can install up to 255 steps so you can means you can include 255 different overloaded function for each blast core so okay that gets slow in the end that's nothing somebody to want to do but for what kind of problems did we include this so the first thing is you can buy easier profiling frameworks like measuring runtime counting how often the function is called without doing this minus G or PG stuff when compiling the code we already did some experiments to dynamically upload code to SLRators so you can check the function arguments for example if the matrices are large enough then directly move them to SLRator devices like NVIDIA GPUs or there's the AMD GPUs and if their matrices are not large enough it's passed back to the blast library you have loaded for some debugging purpose I already did stuff with 40 behavior for example the stuff I presented down there so by changing the result of one norm of a vector I don't know the absolute values of a vector so I included to see what happens if there's some extreme round of errors so that's something you can include the overall stuff as I said could be changed so I could set up many hooks calling each other until the final one is loaded and this one then executes the original blast library and what one can do with some tricky additional stuff that's not completely ready until now you can call the original blast library in every case but I want to do the stuff that you can call the netlib implementation in every case we are a separate pointer as well and we will later see why this could be useful for debugging purpose and yeah how to use the stuff at the moment we only provide one of these hooks for the easy profiling stuff so combative runtime of each blast call and counting the number of each blast call and you do it with this variable called flexi blast hook there you either give the SO file where the hook is inside or select it out of the list and that's a colon separated list and the hooks are executed in a way you specify them in this list so if you want to have 10 hooks loaded you have to give you 10 SO files and they are executed in the order you have give them there and the hooks are not required to provide all the blast library calls it's if you only want to modify one of them you only buy it a hook which includes for example the gem operation and nothing more and only this routine is modified so what we have done with the stuff with this easy profiling we find some misuse of a blast function inside octav so for those who are a bit in numerical integer algebra that's a short implementation of the conjugate gradient algorithm some linear equations over for sparse mostly sparse systems with symmetric positive definite matrices and it consists only of matrix vector operations so A is matrix, B is a vector, X is vector as well so we have here matrix matrix product here a scalar product with the vector itself here a scalar product with the vector with two vectors here matrix matrix product matrix vector additions vector scaling so yeah and we plugged it in into this profiling framework and I take a look what happened there which routines are executed and we see the deducted as I said the scalar product called for 1,000 times yeah but I have here one scalar product here's one, here's one so in the loop is going for 1,000 iterations so at least there must be 2,000 of them okay there's something missing the vector additions here and here are nowhere a set of matrix vector products okay we have it here for 1,000 times one here one in the preparing it fits and then we have this symmetric rank K update that's something here it's a matrix value stuff there you update a matrix with a symmetric matrix with a rank K update that's something that should be not inside in CG code so if you implement CG yourself you will never call this routine why this in there then closer look takes it the symmetric rank K update either computes the first or the second term and when we now take a look what happens with counting the function or looking at the function arguments passed to the symmetric rank K update we see it's the second case with a matrix of size one times one so it's a scalar value and the size of this update is one times 1,000 and it's exactly the scalar product with the vector itself so correct function was to compute a square two norm there and that would be faster and less overhead from the side of the blasphemy that was one bug we found then we ran up into trouble with open blast when I implemented some sophisticated QR algorithm so I want to compute an orthogonal decomposition of a matrix and I implement some high performance algorithm and then I started around this DTR and V routine it computes the matrix vector product where T is lower or upper triangle last so only half of the matrix is filled up with values the other ones are zero and I recognized that when I store this matrix with a leading dimension of 64 so that means the next entry in the same row is 64 elements away in memory and I recognized if the dimension of the problem is larger than 16 I get some put up values and if the dimension of this matrix is larger than 32 it gets completely wrong so what I have done I wrote such a hook which on the one hand calls open blast so I selected open blast as backend so it calls the backend and on the other hand it executes the netlib implementation with the same input-out data and then I compare the output and that's what's coming out here there so we have C until the fourth argument is the size argument and that's the leading dimension argument so we see beginning with 17 we have some error inside with 32 the error is still there and already recognizable and with 33 as dimension we have a complete wrong result so what we found afterwards with discussing with the open blast guys yeah, there was a threading error inside and with the race condition the bug seems to exist in more than 10 years so nobody detected it after so 60, 70 comments in the discussion we don't have the solution for this but we figured out deactivating threading for this routine was at least successful and this race condition was around in the high level code not in the kernels of open blast because when we got our first power eight system I could reproduce the bug and the matrix, the normal vector addition was somehow involved in this problem as well and if you have this possibility to directly build up something where you can compute the solution again with a well-known implementation where I know that it's hopefully error three then you can easily detect such bugs I think in my carrier I had two or three times this case with open blasts where with some strange input parameters with some sizes, there was such a bug in the similar to this one so that's basically what flexi blast is doing so what we have planned for the next release is we have a meta data locking hook already available but we don't do any cooler or don't buy any cool tools out of it meta data logging means in this case we collect all non-array arguments of all blast calls in an SQLite database so you can say afterwards how many times is the, for example the matrix, matrix product called with which arguments so the transpose flags, the dimension arguments and even the leading dimensions and that could be done for all APEC for all blast routines and with this information we are on the one hand want to develop such an correctness and accuracy checking tool so doing this, which are what I presented for open blasts in a larger way so that you can directly say, okay that were the problems the application did there was an error in now do all the computations again at least mostly it's the problem with the dimension arguments and blast optimization so do the same stuff again with problems of the same size with open blasts and with reference implementation or with MKL and the reference implementation on the other hand, when you try to tune an application for high performance you have to mostly the problem yeah, you're developing the blast library you have some glue where to do, how to do your benchmarks but they are mostly artificial and so we want to use this data we collect from the application to reformulate and tune our benchmarks in a way that we can use the real problem sizes from the application and no longer with the artificial ones we thought of in developing or writing the software then we have the macOS X support already in testing so the problem was that macOS is not using F-banneries they are using Mach banners and they have, as I said we tried to stick to POSIX stuff but during linking the linker behaves in a bit strange way and that was the large problem there to get all the linker sequences in the CMAC scripts in a way that macOS accepts the stuff then we are planning to do suffix symbols so that means to add, for example, a 64 for the 64 integer blast implementation the junior guys are doing this but in a non-consistent way for example, they add underscore 64 for all routines on a simple level but not on programming language level and so you have done a problem that the way you call them from Fortran and from C diverges for example, with the C plus routines and that looks a bit strange because it ended up that you have to add an underscore and a C AP function which completely looks strange then we tried to do and per application blast selection as I said, until now we could do it system wide per user and per host with this configuration files and from Enoch Yuka so the Fedora guide that comes this idea how to do it per application now we're thinking about reading the stuff from the PROC file system to get up which one is used then as output of this application driven performance optimization stuff we plan to do in per routine blast selection so then when we see what the user is doing which routines they execute we try to check for which routine and which time problem dimensions which blast library is the fastest on average for example, it's could be that most for example, matrix vector products are mostly called with small data so that using a threaded blast library there causes too large overhead on the other hand, the matrix matrix product benefits very well from the threading so that one use for example, some of the matrix vector products from a single threaded library and the matrix matrix products from a multi-threaded one. Then we have no LAPAC easel pod at the moment so that's the CC++ interface for LAPAC which is a wrapper library that you don't have to do the stuff of the adjusting the pointers and the calling sequence between C and Fortran and we did already some work with giving a operating a proper way to handle these two error handling routines from blast it's called XERBLA and the CBLAS XERBLA the problem is that they where they are behaving and they are linked thus coming from the time where static linking was quite usual and normally I would expect something like an AP call like set the error, set error handler and that's missing there I already did some experiments for CBLAS and let's see how far I'm coming with this and I think in this autumn up to the end of the year we will do a new release with at least some of the planned features hopefully included them, yeah. Some early details about it were in a preprint on at least on the LAPAC working nodes and that was it from my side so thank you for your attention and third information could be found on our project website we have a mirror on GitHub yeah, on the GitHub mirror it's you don't will not see the huge new development stuff so it's mostly the release mirror but if you found an issue set up write it down there, we can discuss it yeah and that's the stuff from my side. Okay, great, thank you very much Martin if there are any questions for the people in the Zoom please raise your hand and we'll allow you to unmute so you can raise the question or drop the question in the Zoom chat or in Slack we already have a question from Thomas so how easy is it to mix and match different backends and functions especially in cases where there's a really good but not complete LAPAC library can we fall back on another implementation for calls which are not in the first library so can you like chain libraries after each other so at the moment we realized it in a way that you have to have to specify a fullback library at compile time so if you don't specify it the latest LAPAC version is used but you can redefine this fullback library at compile time and it works in a way that first we load the fullback library fill up all the symbols and then the desired BLAST library and the APAC library is loaded all symbols that could be found there are updated Okay, so you're stuck to a single fallback library but at the moment, yes but that could be something we can integrate in the next version so for me so this library was developed for some cases where I needed it and so that was an idea which I didn't have under the name so good thing to integrate and maybe a follow-up question from me I don't see any others coming in can you also do something like this in a hook where from all the libraries that are supported by Flexiblas or known by Flexiblas that you implement a hook yourself and based on some conditions you pick which libraries is used to actually do the call at the moment not that's the stuff we want to do with the yeah, so it's possible you can access this APWO which we have an octave as well where you can switch to BLAST libraries and that's also available in the hooks as well so so you can actually something I had in mind when we've been comparing OpenBLAST and BLIST for example on different architectures on Skylake and on Rome and we see mixed results so on especially on the Intel systems we're looking at MKL as well their MKL is really good and BLIST is fairly close for some stuff but behind a lot behind for other stuff also based on whether you're doing single threaded or a couple of threads so staying within the same Numa domain or chiplet or if you have lots of threads we see big differences there as well so I was wondering if you could have a more dynamic switching thing where you look at how many threads you're running which function you're running on what platform you're running so whether it's in AMD or on Intel and if all of that will be possible in hooks but it sounds like it could be so the hooks provide an initialization routine as well so if you have some really expensive initialization phase for example where to recognize on which platform you're running and so on that's something you only need to do once when the program starts so you can place it there and already select pre-select the BLAST libraries which are the ones you want to switch afterwards and then the switch of the BLAST library is done with only changing one pointer okay so so the overhead is even though we have additional logic as long as you can do some of that logic at the very start it's still quite limited yeah I think it could be that we have to include one or two special routines for this case to make it more efficient but in general it's not too bad at the moment but it's so as the way I implemented it it could even get faster and we only had to do one or two routines or include one or two routines there to get it even faster but for let's say for realistic small operations so if you say let's say 10 by 10 matrix and you do a matrix matrix operation whatever it is how much overhead is there percentage wise is it like one percent is it five so it's still in the one digit area yeah okay even for very small yeah yeah so I think when you are going to this case four times four which could be realized in five six CPU cycles with modern vector units then no longer but that's something where so from my research and my application development I would say calling BLAST with problems that are typically could be done on a sheet of paper by hand and that's the case where you should directly implement it in your application and not calling to BLAST because then the overhead with preparing the function execution like putting everything on the stack start calling the function coming back from the function call is even more than the computational work but that's not really that's not really related to flexi-blast that's it no that's a general problem in general when you're dealing with BLAST it's more problems so they have to make sure you actually benefit from from using yeah I think and then you write when you write your code in Fortran there is now a mutt-mult function and with GCM with G Fortran you can specify at the switching point when this function call should be directed to BLAST and when it should be written out by the compiler as inline stuff so there's a follow-up question by Yuri so somebody still needs to prepare a landscape of architecture routine problem size combinations I guess to select the best choice on the fly so in terms of picking a different back end based on some conditions and he's asking if there's any concrete plans for doing this and I'm guessing he's maybe interested in helping out so yeah so the plan is lying around in my head for a long time that's why we already implemented this meta data logging but for the remaining stuff I didn't find time in the last two years due to trying to get my PhD thesis finished but I think when this is done I will take a look more inside this because I'm due to the huge sue of different computers we have in our institute I'm also interested in this problem okay so I guess people can reach out if they're interested in working on something yeah so drop me a line and we can discuss it okay there's also a question I mentioned it to you as well the the and I'm yeah you're fully aware of it the Julia alternative so they have this lipblast trampoline thing yeah you those sent it all to me on I think Friday yeah yeah this blast trampoline I take a look at how they how they made it so it's a similar approach it's would get even a smaller overhead than ours but it's not portable because they try to modify the procedural location table and that's something I think that's very dangerous because during the development of flexi blast we found on the the power pc little indian platform to code generator bugs from GCC which are related to the procedural location table and I think it's still dangerous to modify move in stuff there and as far as I saw that code you need an assembly kernel for each new architecture appearing and even for the old big indian stuff they don't have any glue to do this at the moment okay see if there's other yeah there's another question so batched blasts calls so I understand there's some libraries like like plasma and magma have support for packing operations together before actually going ahead and doing them to to improve performance a bit is that out of scope for flexi blast or um for all plasma I know that they map it directly to blast at the moment and let's see when it's appearing in one or two of this fast of of this efficient blast libraries when they have an interface for it and it sticks to one interface where I could say that's stable then we can integrate it so I would not say out of scope but it's a relatively new operation and not included in the reference one and so we have to wait until there's a stable ap around for this for this type of operation see if there's any other questions people who have questions do share them in the slack chat for example I think I had one more okay so two two questions related to installing and building flexi blast so when you're when you're building it you can give it a list essentially of blast libraries it should build against but you can also specify stuff at runtime at least for the single aso blast libraries as there are benefits in doing it at compile time so it was this one at compile time was more or less if you have somebody with a standard desktop common language system and who want to use it it collects all the blast libraries he has installed on the system and preconfigures everything for example from the maintainer point from the distribution that's also a bit easier but you can do it at compile time at runtime however you want so and even you can in the system right configuration file there's a like for other many other software projects system configuration file name dot d a directory and all the files are in there are scanned as well so if you have new blast libraries you only place one piece of configuration file there in its scan so the system administrator could add as many blast libraries here she wants there to write them to the users even after installing it or after compiling and installing it but there's no additional runtime overhead compared to doing no so from the runtime point of view there's no difference it's only more or less configuring and make installing make install stuff and it does not change anything inside flexi blast if you compile it only with the netlap reference back in or if you compile it with netlap open blast bliss mkl so from the binary point of view there's no change inside flexi blast okay and but for for stuff like mkl can you also do it just runtime because there you have the problem with the different asos right yeah there's this mkl biter tool this thing in which is normally provided by the mkl installation there you create this wrapper library yourself and then move this so file to some place where it could be found and then it works as well okay there's a you don't you don't have documented it on github in an issue and then the next release via extender we extend the documentation documentation for this case okay and then what about the the fallback to to netlap blast because that's a bad idea from a performance point of view at least i i understand it's interesting when you showed one of the the cases where you can do comparisons and you can catch bugs or or do things like this but is it possible to disable that as well the flexi blast never falls back to netlap um as i said one can select another uh fallback implementation at compile time i think it's could or should be possible at runtime as well and this fallback is really rarely happening so it's only for this so for the standard blast quotes only this one norm of a complex number is affected and that does not make any performance in influence and with la pack for example you load an mkl library which only has an a pack 3.7 integrated but uh flexi blast is compiled with the latest la pack version of 3.9 at the moment then there are a few symbols missing and if you call one of these routines that are not in the mkl it uses the la pack implementation out of the reference but the blast quotes inside this la implementation are mapped to mkl so it's not that uh missing la pack function in the back end results in the case where you use the whole netlip stack you only then use this la pack part but the blast quotes are still going to the selected back end and that's true for mostly all of the highly optimized la pack routines in the render implementations for example when you take a look at open blast they provide a a pack but they only improve the performance in a handful of operations and most of the la pack stuff stays untouched and i think that's also the case in mkl because i don't think that they are adjusting this 1800 routines everything by hand okay yeah so there's no in practice there's no problem with having netlip blast in there it's only used as a last resort and it's yeah it's really only performance impact yeah so normally at least this 140 blast routines are provided by all implementations yeah okay um there's another question by by sam is the metadata logging hook available somewhere already so can we find not yet but i think that could be something i can make available through the next week's slide i want to for a source i want to move it to a separate project so such that i can develop the hooks independently of flexi blast and when my student assistant wrote this thing he directly moves it into the source tree nah okay i have to get it out there so i guess maybe sam can fire you an email if he's interested yeah of course i'm playing with that okay yeah okay um i don't see any other questions so um yeah i guess we can wrap it up here thank you very much martin this was very interesting um and it looks like a good fit for some of the problems we have in the easy build community with indeed to switch switching between different blast libraries and especially with amd systems now being back in view uh which make us look at other blast libraries it looks very interesting so if you want to research in this way what's the best blast library that's something i'm interested as well and so if there are routines missing if you find something where we could could accelerate the mechanisms inside flexi blast i'm feel free to ask and i think we will find them a good solution and to integrate stuff very fast such that we can get there to nice results