 Hello everyone and welcome to the tool chains for people in a hurry. My name is Victor Rodriguez and here is a nice presentation that I have for you Well, the agenda for today. It's going to be dynamic linker for multiple architectures New errors in learning detection new Isaac for deep learning that we offer also in the latest version of glipc and GCC Let's start with dynamic linker for multiple architectures First the dynamic linker that we provide in GNU see library version 2.3 to the tree It's going to load optimized implementation of shared objects from some vectors of directories in your glipc Harvard Caps directory on the library search Right, so the basic idea will be to check where you are running your Application and then find the proper path where that library is right now This is story started four years ago Four years ago here is in the slide that I presented in the open source summit North America in September in Los Angeles So in that moment we were presenting this implementation of glipc 2.26 And the idea was that we were using the deal platform and the old hardware cap CPU feature to do a transparent linker to determine the platform and build an array of hardware capability names and The idea was to add a search pad when loading these shared objects In our mind was like in a smart octopus The octopus decided to have a process and application to run and let me check what kind of shared objects Do I need and search for the specific path where that library supposed to be according to the hardware Where in this example that we use at that moment was the library was an application for C blast that of course Is going to need to leave open blast that is so now this It's going to go and search first in in that moment in user lip 64 Haswell leave open blast so And if it was not able to find it said okay Let's going to try to find in another Situation in another in another place in the path as we can see in the orange box and go one level up Right for for something else and and try and fail try and fail What was our organization we create for the folders and simple there at that moment well We said okay, we have user list lip 64 Haswell and Inside Haswell we could have also abx 112 underscore one And then we could put libraries that are compiling for using the register CD amount and also abx 112 I said what about this? Well, this is the kind of instructions that are used for platform like Skylake and Recently released and now the important thing here is that there that was a path so in jillip see we're trying to say hey Am I sending in a platform that has four for these I said yes, okay. Let's get a search in that one. Am I In another platform like Haswell, okay It's going to be a user lip 64 Haswell and try to open the leave open CD Right and these are some example of the implementation that we did in that home for those libraries We put them in the correct place So when the application is trying to run it said, okay Find the library according to the platform. I am and shared and loaded the shared Object with the proper instruction for these specific hardware great And more about these you can find on a block called transferring use of library package optimized for into architecture and What can we improve from there? What was one of the things that we bring? Necessary improving from there. Well The cycle was pretty much tried to find the optimized leave fail And then try to find the optimized leave in another path and then succeed and then go for the next library Right because as you can see when we don't have the the leap open blast like in this example on a platform It has how has was poor or something similar it failed and go one level off one level off until it say 64 leave open blast is where I found the the the library that at least satisfy my For my average but at least it wrong right so the gilip see would not compromise the Functionality of the application it will try to find the optimal solution the optimal share object If it fails go to one level all right, but that permutation Increased the complexity to provide libraries of the mind efforts specific API, right? So that was the first problem second one more overhead to processing initialization due to the multiple files and access right and This this this came up as always in the open source to above and and cry and try to trade new solutions And say okay glpc 2.33 the solution glpc 2.33 is simpler Simplified search path that does not create sub paths right? That's that's important So we will go into the hardware capabilities and that means that okay imagine that now I'm gonna run these leap open CV example that I've used it before sorry that I was using before and It's going to Be run on top of a machine with a bx from hundreds for right now in that case The platform work we're going to be running. It's going to be so this is an example of an s-trace By the way this search code It's available in a link that I put over here in the presentation in the example now with a proper doctor file That has the latest version of the compiler the latest version of Of the linker so that you can go and play on the platform that you have to see how does it behave right now It's going to try to find first the usually 64 jealousy hardware caps x86 64 version 4 or x86 version 3 or version 2 now. What about this movement version? Okay, so The jealousy hardware caps for x86 was also set in this latest version of the tool change now the version 2 Which is close to the hailing it's going to have instructions for ssc tree ssc 4 ssc 4 Point 2 or as you name it right? But it's going to be closer to those kind of features now the version 3 is going to have abx 2 abx 2 abx 1 abx 2 bmi 1 bmi 2 epa main and other work more and the 8 bx the x86 64 b4 version 4 is going to have most of the 512 isa necessary for for for these part right and This is a much more Natural way to cluster the instruction that we are using and much more natural way to also manage future edition of their students Here is the path part of the patch that we that was added to the compiler because there is important We also asked for to build according to gilip see how we get one of the most use it flags in the compilers And March equal and you specify the CPU type rough now The you can specify Haswell skylight not hailing ice lake the platform that you want for a march in x86, right and Now one of the g we also add the gilip see how we caps we just going to have the capability to say, okay Please compile and march equal x86 version 2 x86 version 3 or x86 version 4 With that in mind you can say I wanted my code to be optimized it for these kind of instructions Or these kind of instructions or for these kind of instruction, right now How does it look the object down of that generated? Well, GCC minus o3 f3 vectorize and march equal In this case the version 3 which as we can remember it's close to Haswell Which is going to be trying to optimize for a via xx2 fma and in this code you can see that this line is going to be pretty much Targeted for that kind of optimization. And yes, it is so we can see here that he's using why I'm I'm registers What some a specific instruction for Haswell and When we compile to Haswell because the question will be am I losing something in the object tone? Is it doing something different? Well, here is a basic example of course There will be another case but at least for this basic tactic it is not right because the the exact thing that we Compiled for Haswell is going to be the one that is generated when we specify for the other cluster of instruction Which is x86 64 version 3 pretty much the same thing, okay So feel free to test it and use it now. What is the next thing to do about these? Well Before checking compatibility s Let's try to move to that because as software developers here in Linux. We also are in our Responsibility to provide the optimized version or to try to provide the optimized software according to the platform where it's going to Be running so the tool chains provide more and more tools that help us to create that From the dynamic linker that is possible to the tech. Hey, where am I standing my standing a platform with this kind of I said, okay, let's go and get the short objects that are I guess has the optimized It is our responsibility to put the optimized libraries in that position and the compiler make the things easy for us with Nice flag and say, you know compile for these and put the result over there in that path So that the linker could be easily to get rabbit and put it over there What about compatibility well the tls and previous Haswell fire I'll still present to keep them but they don't Right, but they're gonna be removed in future releases. We don't know when I have the specific date will be interesting to us in the jealousy Specific mailing list to check for that. But yeah, it's going to be removed. So Um, it's necessary to both to these four Now the other part I was going to tell you it's new errors on warning detection Last year I had a pleasure to present at analyzer for the first time And thanks a lot to all the compiler team gcc and especially David Malcolm here because Here's the one and this is the team was responsible for adding since the last year the first static analysis Incorporated part of the compiler So we don't have to go and worry about Searching for other tools and and so on now it is included into the compiler Why a code what in the stat why an aesthetic code analysis is important? Well, help us to check the sanity of our code before Running the binary right before executing Smoke tests or functional tests or stress tests No, we can check the code and see if it actually match with what it's supposed to Or try to discover security features functional box or Security box functional box or even performance in some cases Last year there is a link for that presentation about some of the flags that was introduced but these year There are two important use f analyzer was completely rewrite it And recreate it to good news Most of the box that was were submitted are address it and fix it, which is great. Thank you, David And the third one is it was added four new flags for our convenience They're by the by the way are enabled by default Now the first one, which is very fun. It's about shifting right Shift From from from a specific particle that we have to the left or to the right. It's very fun and very useful Algorithms can go are based on shifting for multiplication division Whatever you your application need to shifting. It's something that we might use in our code now Sometimes we might commit a mistake of shifting to the negative number And now when we compile just with gcc And and and that's it it will not detect this kind of thing It will not detect it will let it go and run and fail during execution But now we're able to with f analyzer to detect and actually it tells you When in a specific gcc it has the we have the code the mistake in our code The value that it has and it said well, you're attempting with a negative count for shifting So please fix it Now this is available in gcc 11. I provide also here a link to a folder file with the source code Of these examples and the compiler so you can test it with your code or or you create another image based on And that so up to you and free for that one um Here's another one also for shifting But how many we have an overflow and attempt to Do a shifting greater or equal of the precision of the operand type here and doing something very very ugly which is do a shift of Hundreds of thousands to to the left of a chart a little chart of course is going to be an overflow And now the f analyzer detects that part and say, you know something. It's very wrong and very portable Please don't do it and and and please don't fix your code Of course you can compile the thing compiles If it's unless that we specify the older warnings that are to be treated as errors But please it compiles in this part it but it's providing you some warnings um This is another one diagnostics for when we try or attempt to write it through a pointer to a constant object Here we have a constant and we have a pointer to So we're trying to to write to that constant through to that pointer. Yes, the the f analyzer is going to tell you know That's something that you're trying to do might not be a really good practice of coding So please fix it Also, this is something that I have done before and I'm not very proud about it, which is trying to write a specific through a pointer to a string literal So we declare a string and we I want to move for example the the first variable of that a string through a pointer Yeah, it's it's something that the compiler now it's going to tell hey You're trying to write a string literal here and it's not cool that one Um new isa for deep learning. Okay. The new isa for deep learning We have two new isa for for deep learning here The the bnni and the amx that we will cover in this presentation The first one is the abx 112 bnni or vector neural network instructions It's an extension of x86 Part of the infill advanced vector extension part of the abx 112 It's enabled in icelik. It was released A few months ago, so you're free to go and test it in the latest version of the seon In this case it provides four new instructions bp boss d or This is for eight int Variables and this is for 16 bit into your variables, right could be sign it In this case or it could be unsigned in the other case, right? Um, and these are the four instructions that are introduced, but Where is this instruction coming from and how do we start to work at this part, right? um, the first thing is the major motivation Behind abx 112 bnni extension is the observation that many type convolutional neural networks require two things The repeated multiplication of two 16 bits values for eight bit values and Then the accumulation of the resulting a 32 bit accumulator so We can see in In in in nutshell that well, we're trying to multiply a 0 times b 0 a 1 times b 1 And then do the addition of these maybe extended to more actually the instruction is going to be extended to a 0 a 1 a 2 a 3 And so on but that we can have as you can see it's a 16 bit variable and it's going to be fitted into a 32 but 22 values of 16 bit in a 512 bits the good thing is that you can use 60 512 bits register or 256 or 128 right But the instruction it's only available in some specific platforms like islet that It's going to have support for either of the registers that you have Now With this in mind we would say how does it match actually to something like convolutional neural algorithms, right? And here's a disclaimer My my feelings is not much learning or deep learning But i'm going to try to explain this with a basic example So in the figure we have an rtp image that has been separated into three color planes red, rainbow And of course, this is something very simple very tiny four times four Matrix, but you can imagine how computational intensive things would get one thing in a search dimension on a k Right and the role of convolutional neural network here is the fourth thing to understand Is to transform these into something much more easy to process but Without losing the important information or important feature that are crucial for image processing image Image prediction or phase rate of mission or anything do you want to do with your image in these keys? So that that's the part right um now This is an animation that it's very elusive for me because here we have the red the green and the blue And when we see it's a kernel channel one two three, which is not going to be nothing more than kind of a filter We're going to try to apply this kernel into the matrix And the result it's going to be added for red green It's going to be Multi for that, but also we have bias that it could be a number I don't have the specific offset of what what would be a proper bias for some but no that's that's definitely for a deep learning Expert but the important thing to notice here is that we as compiler We can see that. Hey, there is a pattern here. So you're doing a matrix multiplication here And then you're adding the value of each one of them and then adding another one that maybe you define to generate another matrix Right as a result you can see here in the animation. So that's that's pretty much interesting. How can we change that for simple instruction? well It's pretty much what we're doing. We're doing a matrix multiplication adding the result of each one of them here, we can extend it to three or maybe In this case and also add another value that we could define it strongly like that bias to have the result So wait a minute. You're changing the whole logic of the algorithm to something like this Pretty awesome. So How do we handle this new instruction in my code in my library that I can that I can handle by myself So here it is First of the point here is to create arrays that feeds on loads data on the second one is load the data into the race and To the from the arrays to the register and this is important because the instruction it's registered to register instruction It's not memorizing right now. It's registered to register instruction Actually, when we read the documentation of the instruction for the vnni instruction It said that it's going to take three inputs from three different Register the register could be 128 256 or 512 Bits that needs to be registered So we need to load the data that we have in our arrays into the register. Great There are some specific instructions that the intrinsics library provides for us like loading from a specific pointer into Our specific location in memory into the array, which is good. We need to be careful about the Range and and size and another thing to be aligned and things like that, but nothing impossible here is a link to a github repository where I have the source code the docker files the make files and also the compiler and even if you use the one that we were using previously or um If you want to compile from from from master and put it into your your docker file important to highlight To check the object down and also to finish the compilation. You might also need to have the latest being useful, right? So it's it's like full set. We need the compiler and being useful for this How do you compile that? That's that's a funny part. Uh, please straightforward gcc minus o3 mr equal ice lake server The source code and it's going to find immediately the intrinsics And said, okay, I need to translate this into a specific instruction the compiler after passing the um lexical syntax semantic Of course, you know from you know compiler arriving to the optimization Who will skip the optimization this part because by the library already nailed the compiler What what specific instruction these need to be translated? So when we do the operation the point two it said, okay, I'm going to execute the operation in this case With a b and serves Serves in eight would be in this case like in this example. We have um eight bits we have Sorry eight in the value of eight Value of one in a value of three, right? So we could have in this case the The operation so let's let's go and see what what results would have According to the documentation, we had said that it multiplies a zero times b zero a a one times b one a two times b two a three times b between and then add the four of them Uh together so four times a in this case 32 And then add it to the whatever we had in source Which is the cc string giving the result 35 and as you can see we do more we would perform the operation In in in this fact that was a purpose that I put exactly the same value to see that it matched It's basic example just for the active purpose to prove that Wow, I can manage the data and I get the result that I need. I know it's very basic But it was just for the active purpose And yes, as I said, um, you just compile with mr She called eyes in any just works fine. You might wonder where how do I test if I don't have the harbor in a minute? At the end of the slides. I'm going to talk about it Um, now this is a history line about the micro detector chains that we have seen Uh, for example, maybe x1 and 12 b in and I this is what we have right then Um, there was another ball there in the intel deep learning boost technology Which is every x1 12 b float 16 and then amx, but we're going to cover it's about amx now the intel advanced metric extension It's new x86 extension, right and the intel amx is standalone to create two new two things new instructions But not to handle something different, you know To handle new registers and these are going to be the first time that we produce a two dimensional register I know super strange but let's go to see those registers going to be called tiles What is a tile? What is an interlay mix style? Pretty much it's like when you are in your kitchen or your pattern like like a towel Um, and you see it, right? or you're in the roof and so It's going to be a two-dimensional array the matrix register files, which is like a tile are The developer has a capability to manage eight tiles from tmm zero to tmm seven, right? Each having a maximum of size of 16 rows by 64 bits column Which gives us a total of one kilobyte register, right one kilobyte register, right? And eight kilobytes for the total of the eight Register that we have now each programmer. You don't have to use the full one kilobyte register to these no no One as developer it can configure the the tile to the size of what we want Like for example in this case the tile was configured for two times four or times four three times four and The oscillator instructions that are provided also in In the new in new instructions It's capable to say let's going to make the multiplication Of the instruction dot product as natural as we do in the piece of paper. Yeah, it's going to happen in the hardware now So it's going to be three three forward What kind of operation can we perform? Well, that's a really great question The first thing is we need to configure with the belief which is going to be a register The developer is to configure the size of the register the tile What is going to be the size of it here? I provide an example. It's the link. It's also in the same place With the way to configure that believe for yourself now with that we configure the plead you also have specific instructions You can Yes, this are for configure the the plead and and also you can add to the plead you can store You can make it you can release you can make the the the tile of the register zero These are for tile and configuration, right? We serve for for tile and operation and this is okay Now that you have configured tile, what do you want to store in that tile? Do you want to store in eight variables or do you want to store b float 16 variables? Well, and Once you store the proper ones, what do you want to do with those? Do you want to uh, because you don't configure what you want to start just configure the tile? but What kind of operation do you want to perform amx in eight or amx b float 16, right? And you can have sign it sign it sign it on sign it on sign it sign it or on sign it sign it pairs According to the product of being date that you want to perform On the other hand when you try to if it does not apply when you have b float 16 tiles Because you're going to perform the dot product of a b float 16 data in this case, right? So we have configuration data or operation Now this is a full example in c that performs actually And here I have multiple things I have four Um, sorry, uh b float 16 Uh, I have for sign and sign it sign and sign it and so on some basic configurations And the thing it's very very very very basic It's taking actually from the gcc main branch as part of the test that was uploaded And you can take it as as a reference for for for your examples Now the main question. How do I test it if I don't have the hardware support, right? My hardware is very old. I don't have those instructions should I Um, how how can I develop for for those don't worry There is something called intel software developer emulator and the intel software developer emulator It's very easy to use how easy well you download here is the link you decompress and you use it So you can say okay, I have um here I have my sd and this is the binary that it provides sd 64 You specify the platform you might wonder um AMX styles, where is that going to be? Well, it's going to be in software wrap software wrap it an incoming platform from from x86 And I can start to test my code to take advantage of those things like okay Here is the binary that I create and we can say dash dash AMX in the For for the binary that I created and it's going to run it's going to take the the binary and the And emulate in this case the Instruction that it's not provided by your hardware and make it run again You will be able to make it uh to to to see how those instruction behaves And if you want to see the internals of how many times those instructions were executed and from what a specific point It was called inside and much more you can see with uh Dash mix Lie option that it's going to generate an output file and that file is going to give us for example here Number of times that that instruction was generated. It's very useful because we can compare to our previous loop and say hey, um How many times did you call this thing and you can put it for example here? I might try to put it like a thousand times said. Oh, it was called one a thousand times So it makes sense and more analysis that you can you can make about Now this is the the the end of the of the Of the tool chains for people in the hurry the latest features were added I want to thank you so much and as a closure these pictures in purpose I want to highlight that the tool chains are not anymore just A tool or a hammer to help us to create more new software It also helped us to discover what it's calling in terms of new isa In terms of new features that we will be able to use like Provide our software in a proper way optimize it and load it immediately in an optimized way for the user Or even also help us to detect when we have a bot In security functional performance by static analysis with the same compiler And there are much more features that I encourage you to go and check in the release note Of the latest features of pucs the communities out there For us to help us and they are working very hard for providing new and cool features. So I Encourage you to go and like in this picture go and search what is new and enjoy the journey of using new and latest tools Thank you so much