 Hi, welcome back to all. Next talk is from Field G Van Zee from the University of Texas at Austin. He will be giving us an overview to the bliss framework. Over to you Field. Yes, thank you. It's good to be here. Thank you for having me. My name is Field Van Zee and I'm from the Science of High Performance Computing Group at the University of Texas of Austin. And today I'm going to be talking about this framework. So let's jump right in. What is bliss? Bliss is an acronym. It stands for Bloss-like library instantiation software. It is a framework for quickly instantiating high performance Bloss and Bloss-like libraries. And you might be thinking why Bloss-like? What's wrong with the Bloss? Well, we'll get to that in a minute, but for now just assume that Bloss-like is synonymous with Bloss. So what are the Bloss? The Bloss are also an acronym for the basic linear algebra sub-programs. It was unrolled in three families, which they call levels, level one, two, and three, which address different types of dense linear algebra operations. So the level one operations were vector operations exclusively. Level two were operations that deal with mixed matrix and vector operands. And level three, which are the most computationally intensive, are operations where all operands are matrices. So why are the Bloss important? The Bloss constitute the bottom of the food chain, as we like to say sometimes, for most dense linear algebra applications, as well as some libraries such as LA-PAC and the flip-flin. The idea is pretty simple. It's that if if the Bloss interface is standardized, or at least agreed upon, and if an optimized implementation exists for your hardware, then higher level applications can portably access high performance. It was really a key contribution for the time. And just to give you a sense of what operations are available in each of the levels, an example of a level one operation would be a vector copy or a dot product or a so-called AXP operation. Level two operations constitute things like general matrix vector multiplication or Hermitian rank one update. And level three, which are probably the most famous and certainly most computationally intensive, are things like general matrix multiply and triangle matrix multiply, triangular solve with multiple right hand sides. I would say that the level three operations form the bread and butter of most dense linear algebra applications and related high performance computing applications. There are plenty of implementations available through both vendors and open source projects. So why do we need Blis? There's actually two questions here. The first is, why do we need Blis? And the second is, why should we want Blis even if we don't need it? So we'll look at the first question first. Why do we need Blis? Well, unfortunately, the Blas interface, the API itself, is quite limiting for some applications. And this should not surprise anyone because it was finalized 30 years ago. So I'll give you some examples of how the interface is. The interface itself, all levels one, two and three, or rather two and three, when you're dealing with an operation that takes a matrix for one of its operands, that matrix must be stored in column major format. So for those of you who don't intuitively understand column major storage, it's simply a mapping of matrix elements to memory, such that columns are contiguous. There's also row major storage, which is not directly supported by the Blas, although it is supported by the seed Blas, which is a layer that can sit on top of the Blas option. And we want to be able to support that as well. We also want to be able to support so-called general stride storage. And general stride storage, to try to put it succinctly, is simply storage where there is not contiguity in either dimension. All elements are non-contiguous with all other elements. And further yet, we want to be able to mix the storage formats within a single operation call. So maybe we want to multiply a column major matrix by row major matrix and then use that product to update a general stride matrix. By the way, why do we even need general stride storage? It's certainly not the kind of storage you would reach for for modern computers given the emphasis on contiguous access. But suppose you have a three-dimensional tensor, which I've tried to illustrate here, just a stack of column major matrices. And we want to take an arbitrary slice and the slice that we want happens to be in this plane. Well, if you look carefully, all of those elements are non-contiguous. So how would we even refer to such a matrix in the Blas? The short answer is you can't. You would have to make a temporary copy of your matrix. And that copy would have to be stored in column major format. And then you could proceed with your computation. And then afterward, you would have to update this original matrix if that was the idea. Another way the Blas interface is limiting is that it has incomplete support for complex operations. This is getting really into the weeds. But to put it simply, there are no instances in the Blas where you can simply conjugate an operand implicitly as part of the operation. You can transpose. You can Hermitian transpose, which is a conjugation and a transposition. But you can't just conjugate. And I cannot for the life of me imagine why that was left out, aside from that it's not used very often. But for those who need it, certainly that statistic does not really come with any comfort. The Blas API is also opaque. There's no agreed upon way to access the low level kernels. It's really going to be heavily implementation dependent. So maybe the library will export access to those APIs. Maybe it doesn't. But there's no standard there. Some examples of why you might want to access these kernels are maybe you're trying to optimize a higher level library, or maybe you're trying to implement a new Blas-like operation without reinventing everything from scratch. Or maybe you're trying to conduct research on the kernels themselves to maybe perform some performance measurements. Operation support also has not changed in these last three decades. It's really just the same Blas operations that were available from the start that are present in modern Blas libraries. There was a forum around 2000, 2001, I believe, that attempted to ratify some improvements. But those improvements were largely ignored by subsequent implementers, both open source and commercial. I say largely because I think there are some select operations that were picked up here and there. But certainly nothing standardized. And we think that this was simply because there was no reference implementation for all of these extensions that were proposed. But that's just our speculation. So why does any of this mean we need Blis? Well, the Blas API is static and can't be improved. We can't gain access to a better API by building a better Blas. We need something else altogether. When building Blas, the only choice you have is how you implement it on the API that you export. So this was one of the primary motivations for developing Blis is that we wanted to improve not only the implementation, but the interface as well. So Blis addresses these interface issues in the following ways, which address the very shortcomings that we discussed a few moments ago. We now have independent row and column stride properties. So we can store matrices in column row or general stride with equal ease. Any input offering can be conjugated. You can access low level packing and computation kernels if you so choose, if that is of interest to you. And operation support can grow over time as needed. And this is why we referred to the B in Blis as Blas-like is that we don't see ourselves wedded to just what was in the original Blas. This is why Blis needs to exist. These features are simply absent by and large from other Blas implementations. There is no other game in town if these are the features that you need. So now let's move on to the second question. Why should we want Blis, even if we don't need it? Well, you still might want to use it even if the features aren't absolutely critical to you. If you're an end user, you would gain access to improved APIs. And if you change your mind, didn't want to use APIs, you could still use the compatibility layer, which exports classic Blas and C++. And as a developer, Blis would make it easier to implement new libraries on new hardware. So most of the work would have been done for you. You just have to plug in a few missing pieces and parameters off you go. So how does Blis make it easier to implement high-performance Blas libraries? Let's start by looking at how General Matrix Multiplication, which is sort of the prototypical classic operation of the Level 3 family, let's look at how that operation was implemented by Kazushige Goto in the Goto Blas. And for those of you who don't know, the Goto Blas was the predecessor to Open Blas. And Kazushige Goto actually worked here at the University of Texas at Elkston for a period of time in the aughts, I believe maybe from 2003 or 2004 until the end of that decade. And this research was first published in an ACM Toms article in 2009. So Goto's algorithm, we now call it the Goto algorithm. Goto's algorithm for matrix multiplication was built to top a top what he called an inner kernel, or sometimes he just called it a kernel. And that kernel consisted of three loops around a tiny outer product. And the whole thing was written in assembly. And on top of that kernel were three other loops. So I've tried to illustrate that. Unfortunately, the Goto algorithm, as performant as it was at the time, it came with several drawbacks, especially when looked at through the lens of implementing an entire Blas library, specifically the entire Level 3 family. And we'll go over each one of these items a little in more detail. But just to go through them quickly, you couldn't take this one inner kernel for gem and recycle it easily for every other Level 3 operation. And this kernel was quite large. It consists of three loops more or less. It's a little more complicated than that, but just say it's three loops. And the footprint was, it wasn't, it's not fun to read. I invite any of you if you're curious, just open open Blas or Goto Blas and look inside the kernel and you'll understand what I'm talking about. It's very difficult for newcomers to come along and learn how it works or debug it or anything like that. Edge cases also had to be handled explicitly. And you can't really parallelize anything in here because it's all assembly. I mean, I suppose maybe you could, you could do so, but I don't know anyone who likes to do parallelism in assembly regions. So let's look at the first item, recycling kernels. Why can't we recycle that kernel? It actually has to do with the differences between Level 3 operations. So all of the Level 3 operations, except for the last one, are matrix multiplication, but they're different kinds of matrix multiplication. So for example, in the case of triangular matrix multiplication, the idea is to multiply a triangular matrix, which means we have useful entries on and in this case below the diagonal in zeros above. We want to multiply this matrix by another dense matrix, but we want to do so without computing with these zeros. We don't want to waste those computations because we know anything multiplied by zero is zero. So in theory, this should take roughly half the time, half the flops of a gem, a comparable gem. So it turns out that we need variants for lower and upper triangular because you can store an either triangle. And the triangle that you store in depends on what direction you move through the matrix. And it turns out that when you're multiplying with these blocks along the diagonal, the innermost loop bound in this direction, it depends on where you are in this loop. So that adds extra complexity and new blocks there that simply is not a part of the original gem. Okay, let's look at the assembly code for it. Why is it so big? Well, it's three loops in assembly, and there's lots of unrolling, and there's lots of edge case handling. And it gets complicated in a hurry. I don't really know how else to convey that without going into gory details, but just trust me there. Or look at the kernel yourself if you care. So these edge cases, the inner kernels has a fundamental size. These register block sizes we call MR and NR, and they're based on the register allocation that we use in this innermost loop as well as the subsequent packing widths that are sort of implied here. Some examples are in modern day Intel and AMD architectures, you might see an MR of 6 and NR. So these are quite small. But these block sizes, the reason they're germane to the edge cases is because once you define the sort of blocking parameter, you have to say, well, what happens when the matrix is not a clean multiple of those block sizes, you end up with these leftover regions at the end where you could have an edge case. And these edge cases, so the normal case with no edge, I've illustrated here as a 6 by 8, but you have to have lots of logic to handle these cases where maybe your NR is not 8, maybe you have a leftover of 4, or maybe you have a leftover of 5 and you handle it by composing the 6 by 4 and the 6 by 1. And of course this because it happened in the M dimension as well. And then everything in between. So just ballparking it, I could venture to say that the go to inner kernel is less than 50% belonging to the interior case. I think more than half of it is an edge case. And that of course results in a lot of assembly code bloat when it's all written. Moving on to parallelization. This inner kernel is an indivisible unit. You can't really break it up for parallelism. You have to obtain your parallelism at a higher level in the loops that surround this kernel. Okay, so we've talked a little bit about the go to approach and its drawbacks. So let's now talk about bliss. How does bliss make any of this easier? Hopefully it does. And in case you're curious, we published a paper in ACM Tom's in 2015 that introduced bliss. So the first and biggest change in the bliss implementation of the go to algorithm because it does use the same overall algorithm is that it isolates a smaller kernel and we call it the micro kernel. And once again, it's just it's a single loop around this very tiny outer product. This micro kernel design confers several benefits over go tos inner kernel. And those are simply represented by the four topics we just discussed. The recyclability is there because we now have a smaller kernel. The micro kernel itself is actually can be recycled between the various level three operations, which I'll get to a little bit. It requires fewer lines of assembly because it encompasses a smaller unit of computation. It allows edge cases to be handled portably. And it exposes more opportunities for parallelism. So let's go through these. These outer two loops were once buried in assembly code in the go to approach. And we've now factored these two loops out of the assembly region. And we now express them in C, just regular C code. And it turns out that virtually all of the differences between the level three operations existed within these two loops. So by factoring them out into C99, we no longer need different assembly kernels for different level three operations. All of those differences fall away. And so now our micro kernel has shrunken down to about 2000 lines of assembly. Whereas before it was about 5000. And the two loops that we factored out, they're quite small. We're talking about 200 lines of C. And much of that is white space and comments and things like that. So this was really a no brainer for us. It's a win, win, win. And the micro kernel, it only consists of one loop and no edge cases. So you might be thinking, well, how do we do that? If the edge cases aren't handled in the micro kernel, how are they handled? Which is a good question. So the bliss framework, it only requires the kernel developer to focus on this main case in this situation, the six by eight. When an edge case is encountered in a level three operation, bliss uses the following set of steps to handle them. So it takes the small edge case, it copies it to a full tile. It zero fills what the data that corresponds to data that actually did not ever exist in the original matrix. It then computes the micro kernel normally. And these panels have already been zero filled during the packing stuff. And then it takes the result and copies it back. This does come with a small performance penalty, maybe a couple percent. And it should be pretty obvious because these copies are not free. The zero filling is not free. And these extra computations with zeros, they don't buy you anything. You don't get any credit for doing those computations with zeros. They don't add to the flop count of the operation. But we do this anyways, because we think this trade off between performance and productivity is worth it, allowing the developer to focus on the one micro kernel. And that's all he or she needs to get up and running. We think that's a worthwhile trade. Moving on to parallelism. It turns out that this micro kernel design also in exposing these two loops, it also exposes more opportunities for parallelism. Whereas this whole unit used to be in assembly, now you can optionally parallelize either one of these loops in addition to the loops above it. And having more opportunities for parallelism is always better in densely or algevages. It tends to lead to better load balance and smoother results and also better scalability. Now the micro kernel, let's just briefly talk how is it implemented. So I've depicted it here. We have our small tile and we're multiplying it by these long panels. It's really quite straightforward. You drop into a loop and within that loop, you perform some vector loads on matrix A to load into a vector or vectors. And then you similarly load elements of B into registers. You perform this outer product. The result of this outer product is then accumulated into a set of accumulator registers, which roughly correspond to the elements of the C matrix that you're updating. These are kept in registers as you iterate through the loop. And after the loop, you take these values and you write them back up to memory. Some of you may have noticed that the algorithm here, this little diagram, it contains lots of blocking parameters. I've highlighted 12 here, but there's really just five. There's NC, there's KC, MC, NR, and MR. And the alphabet soup is decoded as follows. The MNK simply refers to the dimension that you are blocking over. And the C or the R refers to whether it's a cache block size or a register block size. So these three are all cache block sizes and these two are the register block sizes. How are they chosen? You might think that the first thing we reach for is some sort of empirical search. And this is actually the approach that Atlas took for a long time. But one of our collaborators showed in a paper in ACM-TOMS in 2016 that you could build an analytical model, something that you scribble on the whiteboard or you punch into a spreadsheet that will give you very good approximations for these blocking parameters that give you high performance. And sometimes they can be tweaked beyond there, but it definitely puts you very close to what we believe is optimal. I'm not going to spend too much time on packing other than to say that Bliss handles packing of all the types of matrices that would be in the laws, general, symmetric, Hermitian, and triangular. And they have to be highly parameterized because there's different situations where you would need to pack them slightly differently or the operation uses them slightly differently. And it turns out that having the row and column strides allows us to collapse a lot of that complexity. So it's another way that we know we made the right design decision by having the row and column strides. So in summary, Bliss factors out as much complexity as possible from the performance sensitive kernel code leaving only the micropernel. It significantly reduces the size and complexity of the kernels that have to be optimized. It provides generic, affordable instances of factored codes, as well as the higher level blocking algorithms, and it provides all the packing functionality with no modification required beyond just parameterization. So for those of you still with me, I'm ready to show you some performance. Thank you for your patience on that. We were grateful enough or lucky enough for other to have access to the Epic 7742, which is their top of the line 128 core server. And so I'm going to show you results that I gathered on that server. But before I show you those results, I want to review how to interpret our graphs. So here I have an example graph on the right. In this particular graph, the x-axis shows the problem size, where all matrix dimensions are equal. So we have square matrices here. The y-axis is our preferred measure of performance, expressed in billions of floating point operations per second or gigaphones. And we scale the y-axis so that the top of the graph is the theoretical peak performance of the machine. So we'll never hit the top of the graph, but the idea is to get as close as possible. And we compare Bliss to some other implementations provided by OpenBlaz, Eigan, and Kail. The nice thing about the way we scale the y-axis is that not only can we compare each implementation to one another and get a relative sense of their performance, but we can also compare each implementation to the theoretical peak and get an absolute sense of how well they're doing. And we're going to do this for a representative sample of the level three operations on all four floating point data types. Yes, Bart. Yeah, we can take questions now or we can wait till the end field. That's up to you. That's actually probably good. So Bart, if you don't mind, could you write down your question and hold it till the end because we'll have plenty of time at the end, I think. Thank you. Yes, it makes sense. Okay, so now that you understand how to interpret one of these graphs, and of course here the alphabet soup is the first character is the data type, the precision, and the domain of the matrices. And then the rest is the operation name. Okay, so here we have single threaded results. We have gem symmetric and Hermitian matrix multiply symmetric and Hermitian rate K update, triangular matrix multiplication and triangular solid multiple answers for single precision real, double precision real, single precision complex, double precision complex. Okay, so I put the legend only in one graph just to declutter the whole picture for you. So it's probably common knowledge that MKL is a bit crippled on AMD hardware, and this is intentional. It appears, I don't know why Intel made that decision, but one exception is DGEM. They're okay with MKL to the left, DGEM. It seems that they are less okay with it doing well for the other three operations. And then for other data types, you know, it just, they completely fall back to generic kernels. We think it's simply because they are interpreting the CPU ID instruction. And the first thing they see is that it's not, that the vendor field is not Intel. And then at that point, they just sort of throw up their hands and say, we're going to just pretend like it's a, you know, 2006 era SSE system and use SSE kernels. But the moral of the story here for this graph is that Bliss performs consistently well across all operations for all data types. There are no, there are no Achilles heels with Bliss where you point to an operation and say, oh, you forgot to optimize that operation because of our holistic approach where we try to encode as much functionality with as little code as possible. A natural byproduct of that is that we get to hit everything equally. There are no favorites really. We try to optimize every operation by virtue of the fact that we optimize the, the matrix multiplication itself. Well, okay. So next multi-threading, we might be thinking, well, single-threaded is all fine and good, but show me multiple cores. So Bliss definitely supports multi-threading. We have four loops that are eligible for parallelism. This innermost loop is, is much, much too low level to parallelize. So we just call for that. And then this loop right here, we have not parallelized it yet because it would require some synchronization. For those of you who are a little more familiar with density and algebra, each of these sub-problems is updating the same matrix. So we would have to have a way for the threads to cooperatively not step on each other's toes when updating the same matrix. And it's possible we just haven't gotten around to it. And you can parallelize more than one loop at one time. And, and we do. And to give you a sense, parallelism can be controlled simply by settings of environment or just beforehand. And we support both open MP and pause exhibits. We prefer open MP because we get the, we get some thread affinity functionality that is particularly useful. So I'm going to use the same hardware as before. This is a two-socket 64 core socket per socket system. The top of the graph still represents peak performance, but we're now showing you giga-funnels per core. So we've normalized the y-axis and we're comparing to the same implementations as before. So once again, right off the bat, you can see bliss dominates consistently, whereas the others struggled it. Something else you might notice is that Eigen is only present on the gem graphs, and that is no mistake. Eigen does not appear to, at least last I checked, does not appear to parallelize any of the other level three operations. And so this, we have 64 threads here. This is, this is one socket, fully engaged, all cores. And here I have some alphabet soup that encodes how many ways of parallelism we got from each loop. So we got four ways of parallelism from the JC loop or from the IC loop or from the JR loop. The total number of threads is the product of those ways of parallelism. And so I think the most interesting part of this graph is not MKL, but rather OpenBlaz because we consider OpenBlaz sort of our primary competitors since we're on even footing. We're both open source projects, unlike MKL, which is a commercial product. And sometimes OpenBlaz does quite well, but other times they do less well. And I could not really venture to guess as to why they struggle with some of these operations. But, but it's interesting. And then now we have 128 threads. So both sockets fully engaged, all cores. You know, the system is firing the mouse cylinders now. And in this configuration, the other, the competition starts to fall away. You know, the only one who gets kind of close is MKL for DGEM. You know, we've discussed already that they don't seem to be opposed to DGEM performing decently on AMD hardware, but for the rest, I don't know. So this may be that these other operations are not optimized. This could be a NUMA effect. I'm not really sure. But it's not difficult to get this performance with Bliss. You can certainly reproduce it. Using the information I've provided on the web. So to wind things down a little bit now, we have multiple publications that report on many of the innovations that I've touched on today. We have seven mainline Bliss papers, papers that deal what we would call directly with Bliss. We also have a number of publications that were authored by collaborators of ours. Friends and collaborators. And, you know, one of the great things about Bliss that we're quite proud of is that not only has it allowed us to create this new library, that it itself, by itself, is quite useful to many people. But it has allowed and facilitated other related research in adjacent fields. So there's now a tensor Bliss library. It's been used to tackle some machine learning kernels, the Caneers neighbor kernel. And it's also been applied to Strassen's algorithm, which many of you are probably familiar with. We've been fortunate enough to have funding, generous funding from the National Science Foundation. We have several awards. And this has been sort of the bread and butter of our funding. But we've also received some gifts and grants and hardware from industry over the years. So the takeaway for today that I'd like you to have is that Bliss is more than just Bloss. It certainly can be used as Bloss, but Bliss is really a super set of Bloss. In some ways, I would like to think that if we were going to do Bloss over, go back in time and try to do it over again, Bliss is closer to what it would look like. It benefits even the most basic event users through more flexible interface. It benefits developers by providing a portable framework that allows them to quickly instantiate libraries on new hardware with only focusing on a small amount of assembly code that needs to be optimized and contained. And it also allows infrastructure for implementing new operations, if that's what they need. It also benefits researchers and experts by providing low level access to kernels and providing a platform for experimentation and prototyping. And we even have a foundation for mixed domain and mixed precision operations, which I haven't even discussed today. And Bliss benefits everyone through facilitating high performance, providing what we would like to think is reasonably compact and readable code. And of course, it's free and open source software available under a 3-coloss BSD license. So there's no hiding things under the covers trying to keep it proprietary. It's open for anyone to use, for anyone to look at. And of course, we're very happy with the community that is coalesced around Bliss over the years. So that's it for my talk. It looks like we have time. I also have backup slides that we can look at maybe in the breakout room. Lots of topics that it was just too much for one talk. But certainly, I'm happy to continue talking about it later on. So with that, I'll open it up to questions. Thank you for your patience. Thank you, Field. Yeah, we've got questions. I'm going to hand over to Bart first. Yes. Hello, Field. I hope you can hear me. Yes, I can hear you. Okay, great. Yeah, I have two questions. The first one is a bit more technical. I can gather that by using less assembly in Bliss than in Open Blast or Co2 Blast, you also have less trouble that way, because one of the things you've been seeing with Open Blast is that as compilers get more aggressive, they expose bugs in the inline assembly kernels. They often TCC inline assembly syntax. They forgot to save certain factor registers and things like that. So I am asking this, is Bliss more immune to that? I suppose so. I would like to think, I mean, first of all, I want to reaffirm that you are, you're not the only one who has encountered these types of bugs. I recall one of our undergraduate students, he worked on the Power 9 system. Last year, we had an undergraduate, we're actually quite proud of this. So we had an undergraduate computer science major who, with the help of Bliss, he wrote the highest performance IBM Power 9 implementation of DGEM that we are aware of. In other words, it beats the IBM vendor library. But that little bit of bragging aside, one of the things he encountered was that he had to use the GNU compiler because the IBM compiler had some type of register allocation bug where with the exact same code, it would give us errors in terms of not being able to use certain registers. So if that's the type of bug that you're concerned about, we haven't encountered it very much on Intel hardware at all. That was actually the first time I encountered it just sort of in abstract in the IBM setting. But certainly, the fact that our code is simpler, one could suppose that that means that there's less register allocation juggling that the compiler has to do. And furthermore, because it's in assembly rather than intrinsics, the register allocation is mostly prescribed. And it's really just a matter of the compiler accepting the registers that we're trying to use as valid registers. Yeah, it's almost where I'm saying what you've been seeing with Open Blast has been a bit frustrating in recent years is that as we get newer compiler releases, they start to get more aggressive with the register allocation. It just exposes bugs in Open Blast. There's some that have been there for more than 10 years, maybe even written by Koto himself. So it's really quite astounding. But that's only more of a benefit of reducing your amount of assembly language because instead of having to fix it in one place, if it ever happens. Yeah, absolutely. 10 places in Open Blast sometimes. Yeah, I agree. So rest assured that if those bugs ever come up, we will immediately try to fix them. But it has not really been an issue for us, thankfully. Yeah, that's good. And the second question is from a user perspective, we're seeing AMD advertising their own Blizz, like AMD Blizz. But I also see that AMD has contributed back into the upstream Blizz. Can you comment a bit about how these two relate to each other at the present time? Yes, and I forgive me that with the connection I couldn't hear every word you said, but it sounds like you're asking me to simply say a few words on the similarities and differences between Vanilla Blizz and AMD Blizz. Yes, exactly. Yeah, so good, great question. So Vanilla Blizz, let me just make one thing clear. Vanilla Blizz came first. We at the University of Texas, along with our collaborators, we are the original authors of Blizz. Around 2015, AMD came to us. And basically, this was kind of at the point when they were in a more difficult financial situation by my understanding. And I think they had to let go of some of their people. And they used to have a vendor library, like the Laws Library. It was called ACML. And I believe it was completely scrapped in favor of open source solutions. So I think that the company changed their strategy so that rather than try to maintain all that code in-house, they wanted to build on open source software going forward. Now, they weren't opposed to having slight customizations, but they didn't want to have to build everything and scratch themselves. They wanted to take something open source as a starting point. And so for the Laws component of their software stack, they decided to use Blizz. And ever since then, they have been using Blizz, apparently to great effect, to optimize for their Zen architectures. Now, in my experience, limited experience, granted, I don't spend a whole lot of time using AMD Blizz. But there is not as much working or fragmentation as you might think between the two. We try to keep the two roughly synchronized in the sense that I know AMD likes to take major innovations that I push into the vanilla branch, the vanilla repository, and they try to integrate those into AMD Blizz. And similarly, big contributions to AMD Blizz, I try to merge those back into vanilla. So I think for most casual users, in my opinion, they are fine just using vanilla Blizz. The nice thing about using vanilla Blizz is that we can provide support to you as the authors of Blizz. We cannot provide support to AMD Blizz. So that's one consideration. If you find a bug in Blizz, and it looks like it's in an AMD specific portion, then we may not be able to help you. Is there a question? Yeah, the kind of answers my question is that they're kind of rolling at different, at slightly different speeds. So one time, a particular sub case can work better in AMD, Blizz or in vanilla Blizz. But it's good to see that you have cross contribution between you and me. I feel just something small to clarify there. It's not actually AMD contributing back to Blizz. It's you keeping an eye on what they are doing and pulling in what makes sense. I mean, it's a little bit in between. That's a fair point. They do not have push privileges to the upstream repository, but they can submit full requests. And the way it works is that I will review the pull request. I will often go through it and make changes and really vet it and put it because I do have a high standard for the code that goes into Blizz. It's one of the ways that I've kept it so neat and organized and well maintained over the years. So, but certainly they have, thanks to AMD support and their involvement, they have certainly pushed us to include certain things in Blizz that maybe would not have been included or it would have taken longer to include on our own timeline. It looks like there's other participants who have questions. Is that right? Yes, pass across to IOK now. Yes. Okay, so I'm going to hear. First of all, is the performance pattern the same if you're doing matrices where M and K is not the same? Good question. Do you haven't checked? Yeah, so it sounds like you're asking, is the performance signature or is the code path the same? No, no, no, the thing, the graphs you were showing, would they look the same for a elongated matrices? Yeah, so the short answer is it depends on how elongated. So if we're just talking about a ratio of like two to one or three to one, then we would expect that performance signature to roughly mimic what you saw in the presentation. But if we're talking about very skinny matrices where you might have a small dimension of 10 and then the large dimension of 10,000, then for those problem cases, we actually have an entirely different code path that activates because when you have these so-called skinny matrices, it turns out that the packing step becomes prohibitive or it can be prohibitive. So we have a different set of algorithms and kernels that don't require any packing at all. And that actually, that was a project that was funded by AMD. It's a good example, one of these things where maybe we would have done it eventually, but they certainly prodded us to do it sooner. And I have a whole other set of performance graphs that shows performance on so-called skinny matrices. And that's another topic, but we're quite pleased with how this performs under those conditions as well. Oftentimes we, in some cases, we outperform MKL even on Intel hardware. So that's pleased with that. It's highly dependent, though, on the specifics of is there one tiny dimension? Are there two tiny dimensions? How tiny is tiny? So it's really going to depend. Generally good anyway. The second question is, have you verified your code against uninitialized data use? Uninitialized data. Okay, so that's a great question. So it sounds like what you are asking is, well, let me clarify. So when you say uninitialized data, do you mean for the matrices like ABC or are you referring to having valid data in the operations? One of the bugs I happened to find back in, I think it was Cotoblast, was that one of the SIMD operations used a value that it shouldn't. It loaded more data than it should. Yes, I actually heard about that bug from a collaborator at my last place of employment. So if it makes you feel better, I have run Valgrind on, so in terms of things like memory leaks, we know we're good there. We do also run various test suites, including the official BLAS test suite that is part of LA Pack. And those run automatically through continuous integration every time we do a commit. So they're always running. And we think it's a pretty good correlate to saying we don't get very many regression bugs, but it's always possible there's something out there of the kind that you've described. But generally speaking, we were quite confident that the kernels are doing exactly what they're supposed to do. Yeah, that's usually one of the things I like to test. That's using the Intel compiler with check-in-init flag turned on, for instance. That's really funny. Not so funny on most codes, actually. Another thing that we're careful about is that when we are updating uninitialized matrices, so A times B to overwrite C, so when you set your beta to zero, the naive approach would be you could just multiply C by zero. But if you have infinities or NANDs, you can get some propagation, of course. So in the case where beta is zero, you have to overwrite C and clobber it all together. And we're careful to do that as well. Good. And to Victor for the question now. I just have two naive questions for someone who doesn't understand about linear algebra that well. So would there be any version of cool-breeze, rock-breeze, heap-breeze, invision, or just only CPU version of breeze? That's actually a great question, Victor. So you've hit on a really key point, which is the GPU implementation of any sort of BLOZ-type library would potentially be quite different than the CPU implementation. We at the University of Texas at Austin, we like to focus on what we know and what we know we're good at, which is, at this point in time, the CPU-only side of things. But we do have collaborators out there in academia, granted, so they're not as focused on creating commercial grade products. But there are others out there who are on the GPU side. We've gotten lots and lots of questions about, when are you going to do GPU bliss and can you do GPU bliss? Can we give you money so you can do GPU bliss? But for now, we have a very, very small group at the University of Texas and we're not comfortable taking on those extra things at this time. But yes, so for now, the presentation that I gave you, that was all about the CPU side only. My question number two is related to, again, as a naive question, do you have any real need to write assembly code? Is there any way you can wrap up these things, I don't know, as some sort of SIMD or any other type of library can make it like a different level of writing is making you even more portable? Yeah, so that's another great question, Victor. So one approach would be sort of a light approach, which would be, you know, basically do the same thing we're doing now, except instead of assembly, you could write intrinsics. Okay, so intrinsics are a little bit easier to use, right? But the problem with intrinsics that we found in the past is that, you know, if your compiler happens to regress in terms of its register allocation, you know, register allocator, you can get a lot of you can get spillover into memory unnecessarily, like you have enough registers, but the compiler's not using them properly. It doesn't realize that you can, that there exists a register allocation that would allow it to avoid spilling onto the memory stack. So we've observed that firsthand, and it's, you know, it's awful. And the horrible thing about that, of course, is that you have no control over it, right? It's a bug in the compiler. You did everything right on your end with the intrinsics. But so that's one of the reasons we like to write in inline assemblies, because there's no room for the compiler to trip up. Now, another, you could go further and say, well, what if you wrote a tool that would output the assembly for you? And we actually have collaborators who have looked into that. They, his name is Richard Varus at Carnegie Mellon University. At least he was at CMU. He wrote a little tool that basically the idea was that it would automatically generate the micro kernel code based on a few parameters that you would input. And of course, it would have to just intrinsically know about the x86 assembly. So if you were going to take it to a new machine, of course, you would have to, you know, teach it the syntax, the instructions on that architecture. But all of that aside, it was a pretty neat concept, but I'm not sure he ever took it to completion. So people have looked at it. But I don't know that we have a tool that's ready to use sort of day-to-day for the purposes of writing micro kernels. If you can make a comment about that, there is the code Gromax that they do have some sort of tools for writing there, what's called the non-model kernels, so that they just implement a new architecture. They just implement this library and then just write how the interstitial be called, and then they generate the micro codes for them. So the kernels, they just support a new architecture very fast by using leverage this tool that they have written for themselves. It's very specific for molecular dynamics applications in their kernels that they are implementing, so it's not so generic that could be used by you. I don't see any other reason hands, if you allow me, I can have a third question. Sure, go ahead. So there's a problem about supporting all these architectures, right? So then you showed results for Xenju, but do you support everything like art, ARM, and also all the other Intel X86 architectures or just like a slow-pacing support? Yeah, so this is just a rough estimate of what we support currently. I would say that we do support Xen3, but we don't have it officially supported yet, so there's nothing about Xen3 to my knowledge that prevents bliss from running on it, in case that was something your eyes zoomed in on. But yeah, so the support is quite wide. Most of it is level 3 support, so we may be missing some Axby kernels or some kernels for matrix vector multiplication, and that's really just a function of not having the manpower, because we have to maintain the framework itself and other things aside from just providing the architecture-specific kernels, but over the years we've cobbled together quite a bit of support. Okay, thank you very much. Certainly, yes. I see Oka is raising his hand again. We're 10 minutes to the next talk, so I suggest we move this into the breakout room if people have more questions or field, if you're okay with that field. Oh, absolutely. Just tell me what breakout room to go into. There's only one. I'll send you over there. Oh, great. Okay, I can pull you in there. Yeah, so if anyone wants to listen into the follow-up with field, feel free to jump to the breakout room.