 Hello everyone, my name is Victor Rodriguez and today I'm going to be presenting innovating with ToolChains in 2023. What are the ToolChains in the first place? Well, the ToolChain is a very interesting topic that I really enjoy talking about. The ToolChains are a set of tools that are going to be used for the developers to transform basic source code into an executable machine language problem. And the code that actually we developed as a developer, it's necessary to be transferred to all these other tools. Imagine the tools as in the picture, like we have a hammer, we have a screwdriver, all the things that we need to create from materials into a new robot, a new furniture, an electronic device. The ToolChains are also responsible for taking source code and transforming it into the proper executable machine language program. Now the first part that we will see is that the presentation is going to be talking about what are the new features that the GNU ToolChains provide in the latest 2023. There is a bunch of exciting features that we're going to be talking about that in today's presentation. But before we start, the first topic that we're going to talk about is the compiler. The compiler is going to be responsible for transforming the C code into an assembly program. The compiler is going to be responsible for checking if there are some errors in power source code. For example, lexical errors, syntax errors, semantic errors, not only that, after that, if we desire, and there has been some previous blogs, presentation of Renovstone that has been very interesting about what kind of optimization we have in the compiler in previous years, we have been talking about looper rolling, about loop interchange, about parts of the codes that are never reached or used if the compiler is able to detect the pictorization of using new instructions for specific new hardware features. It's a bunch of things. Now that's the jump of the compiler. In the end, it will generate a set of instructions that are in the assembly program, put it as, that the assembler it's going to take and transform that into object machine language model, the footbed row. But that's not the only thing. So the linker it's going to be taking and do the union of the libraries in machine language model to the specific new footbed row that we have. Now that it's going to generate an assembler that out. That's the regular name that we have in 8.0 when we specify it's very, very funny, very interesting. It's 8.0. Somebody asked me why 8.0? It's because it's the assembly that out, that's the reason I'm coming out now. Yeah. The executable machine language program, now it's ready for the loader to be taken and put it into work as a process. Now it's the job of the kernel. Now it's the job of the operating system to that part of, say, I'm going to transform and executable into process and way much better presentation about memory management and scheduler outside of this node. Now, the first exciting new feature that it has been added into this latest version of PoolChains is the addition of one interesting system code that wasn't dependent on the kernel a few years ago, it releases a code. It's the release, the memory of the dynamic process. Now better memory management of application has been an interesting topic forever, right? And that is something that we cannot ignore. One of those fundamentals in variants of computing is that regardless of how much memory it's installed in the system, it's never known. And this is none of my words. This is a word by Jonathan Corbett in a very interesting block about this feature published in 2001. Now, killing a process must be very simple, right? Say, hey, I'm going to kill this process and the memory must be released immediately, right away. Well, in the real world, that doesn't work that beautifully. That's straightforward. However, we must know that ID and trying to optimize the use of memory, it's important for all kinds of systems. It's a small or a big one. If it's an embedded system or a big, super high performance computing system, all of them require proper use of the memory and be very, very, very optimized for the use of memory. That's for sure. Bring the resources for the users by killing the process. It doesn't work very quickly, as expected. In this basic, very basic representation or simulation of what I'm trying to say, it is like imagine that you have a set of memory. And in a specific time, you said, I'm going to kill this process. Well, it doesn't free the memory right away, straight forward in that specific microsecond. There is a ramp down of freeing up the memory. Well, that is the specific ramp down that we as developers, we're assuming that it's needed, but it's not. Now, the killing process by itself is responsible for cleaning up and freeing it resources, something very interesting to know. However, the killing process finds itself blocked even uninterrupted sleep. That cleanup work could be delayed forever. There are other factors that can slow down the freeing up of memory, including how busy is the relevant CPU and entire system, for example, it's super busy. You might not have time for it going up and cleaning just like right away. They are trying to do some other high-important tasks. Just saying. Also, if the CPU is running as low or low powers, they could also affect the speed of the cleaning up of the memory. The Chillip C Resulting, the latest version, has added the proper new system call for process and release. The process and release function has been added to the release of memory of a dying process using the color CPU affinity and priority with the CPU usage in order to do the color. And the tsunami of these, and here is the block, the code will provide about process and release. It's very fascinating. And here, it's an example that I would like to talk about it into this part, which is, how can we do the test? And this is a test already part of the GCC, Gillip C Tronk Master. Actually, it's very interesting, but they do it straight up for it. And if you're in a child process force, the sub-process is going to call the sub-process function. The sub-process function is going to create a timer, and after that expedition of the timer, it's going to kill the process by itself. Now, in that killing of the process by itself, we already have the process ID of the process that was created as a child. Now, we will say, hey, I'm going to try to use a new system call to free up that memory in data-specific form. It's a basic test, but it's very useful to illustrate how to secure it. The other new, recently added system call is Gillip C process and add-byes. Process and advice system call is used to give them an advice or direction to the current about the address ranges of another process or the current process. The goal of such advice is to improve system or application performance, very useful. But let's see, there is also a very interesting article now that we have about process and release. And let's talk about what are the risk taxonomy here in this specific part of Gillip C and get help you can find in a specific test case for process and advice that the developers of Gillip C added in the app support for these new features. Now, let's see the taxonomy of how do you provide advice. The first thing that we will see is that it has the process ID of the process to whom you're going to give the advice. The second thing is the pointer IEBEC points in the array of IEBEC structures to find it in this page file. It will provide or you can specify the starting address and the length of the region. The IEBEC structure describe address range beginning at the base and also at the length you want to provide. It's a specified specific region where you want to provide the specific region where to describe the address range. After that, the next thing that we can specify is the BLM which is specified the number of elements in the IEBEC as truck. This value must be less or equal to IEB max, right? Advice, how do I provide advice? Well, the advice argument is one of the two following values either full or page out. The move, the M advice code actually deactivate a given range of pages. It will make the pages a more probable reclaim target. And this is a non-destructive operation. The advice might be ignored. Of course, if some pages in the range are not applicable. The other side is the page out instead of deactivates it reclaims a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be stop it out. If a page is filed, pack it and there keep it will be right in back to the back in storage. And the advice might be ignored, of course, if some pages in the range when it's not applicable. The advice might be applied only a part of the IEBEC. If one of the elements points to an invalid memory region of the world process, no further elements will it process beyond that. That's very interesting, I mean for it. All success, the process and advice return the number of bytes it buys it. These return value may be less than the total number of requests bytes. If a narrow record after the IEBEC elements, we already persist the color and should check the return value. Of course, minus one is the one that we do. Here is another example for the test of process and advice in the GLBC open source project. Another interesting feature is how we're capable of pinnables or how we can have some genius library. And I like this picture because we have a set of different kinds of allowance. It's pretty much the same thing, it's a clipboard, but with different colors. Imagine the same in this situation. Every year, hardware vendors provide new hardware computers for the income in the city use. Either micro architecture or new accelerators that would be used to the new structures. Now imagine this situation. You have a system with an old version of the architecture that in that moment has a new specific instruction that was very, very, very useful. But few years later, we have a way much better version of that same instruction with better accelerators with new registers, new shiny micro architecture that improves even way much more. Now we have two instructions on architecture. One released five years ago, let's say, and another one released this year, both from the same family. Now, few years ago, our software team worked very hard to optimize the library for that specific architecture. Now we have to optimize the library for that new icing. But wait a minute. We might have some users with the previous version of the hardware and some other users that are gonna be getting the new instructions that architecture in the latest city use. How can we as software developers, how can we as operating system distributions can work to provide both of them in a transparent way that the developers and the user especially don't care about that and will be transparent. Well, this is possible with hardware capability units or hardware caps in the G-Lipsy. Recently, more and more distribution has been added, the elegant solution for X866 support version support. And this is thanks to the dynamic linker. The dynamic linker is gonna be loading, optimize it implementation of the share objects from a sub-directory called G-Lipsy hardware caps. And on that library, this directory is gonna be part of the library search path. And it will check under what platform I am located and based on that, the linker is gonna be the dynamic link of those specific optimized libraries when needed. So let's say that you have now the old version of the hardware you will have on the G-Lipsy hardware caps directory located in the specific sub-directories. The sub-directories are called X866 support version for version three and version four. So in these priority order. Now it's gonna go and try to search, hey, I'm located into this specific platform. Do you have something for this specific platform under this directory that I could use for dynamic linker? Yes, I do. Okay, so I'm gonna do the link for that. The initial support to the directories that you might be seeing, it's version two, three and four for X86. Now there is a previous presentation that we did in LinuxCon 2017 about the board of this idea many years ago, as you can see six years ago. The implementation was different by the idea was exactly the same. And there are some specific sub-directories lead the optimized compiler for that specific architecture and optimize with that new instruction, put those libraries that the dynamic linker is gonna use in those specific directors. It is still works in the latest version of GLC. Now the backbone of this is the logic that it's followed with new feature in GLC enables libraries to use new CPU features. And the library will substantially faster into specific new ISAM and specific accelerator and the developer provide different version of the library. It's just to provide different version of the library and put them into specific directories. One that it's used a new feature that it's super fast, one that it's used the previous feature. It might be a little bit slow compared to the new forward but hey, if somebody does not have a new forward they can still worry. And the best part of it is that it's completely automatic. The only job that the board has to do is to provide the specific version of the compile libraries or the differentiation in the proper subdirectories for X86. GLC will automatically do the appropriate libraries, load the appropriate libraries to the linking dynamic linking that match the version required for specific hardware that you have. And you don't need to hear. Example, let's make a big example. You can take the cleaners, Dr. Paul cleaners and inside you will see that in user live 64 GLC hardware caps you will see version three and version four. It's three forward. You can repeat the same experiment. Now I've been made a very, very, very, very, very basic code that actually do an operation of two arrays adding two arrays many times. And that it's a very basic and simple stress test for this, whether it's gonna be calling some specific maths part of the GLC, perfect. When we do an S trace of that we will be able to see that, hey, wait a minute. It's doing actually the call for open, the system call for open, for open V and open net, user live 64 GLC hardware caps, 86 version three of live C dot six, live C dot SO dot six. Hey, that's interesting. So it's gonna be reading the GLC that it's under our caps version three of X86. When we do an object dot, object dot of that specific binary, we will be able to see the binary. It's optimized because it's simplified to do it. Optimized for the CDMM registers and the CDMM registers, you can see in our presentation it has two and five hundred and 12 bits length it's very, very, very large registered and it's gonna be available for ABAs, four and 12 instruction sets. Live, for example, the Skylit that I was using for doing this experiment. Now, another feature that it's very, very interesting is GCC in GCC 13, it's improvement in the STDC plus plus. We're gonna put leaps STDC plus plus under diet this year for the sake of our review. What is live STDC plus plus? It's a standard C plus plus library. It is needed to use most of the things that we have in compiled C plus plus code. Even the simplest hello word, if you can find every word your first C plus plus code. Include ISTREAM in main, STDC out hello word and return C. But what actually it's include ISTREAM and C plus plus. ISTREAM stands for a standard input out of the stream and the ISTREAM declares objects that control reading from and writing to the standard streams. In other words, the ISTREAM libraries, the obligatory library that provides input and output functionality using ISTREAM. What is ISTREAM and ISTREAM is a sequence of bytes. You can think of it as an abstraction representation of the device like the terminal or the key words. So you can perform I operation the device by this abstraction and you must include ISTREAM header to do the proper interaction with this kind of devices. Like writing to the screen. Now, one of the many hazements that we have for GCC 13 and it's very well described in the blog from Patrick Palca, a leaner ISTREAM and live STDC and C plus plus for GCC 13 is that it's going to put it on your diet in current version, in the latest version, GCC 12, including ISTREAM. There is a problem because the translation unit, the TU introduce a global constructor into the compiler object file into the result of the compiler that we saw at the beginning of the presentation, the dot all file. One that is responsible for initializing the start of this thing object and also for program startup. On contrast in GCC 13 is will not be that case anymore. We will move the initialization of the standard string object into the share library that it's available for everyone. The benefit of course is the reduced executable size into the linker times and into the startup times and to dispose program that makes heavy use of ISTREAM. Now, using the compiler explorer, you will see here that, hey, when I have a very basic flow of work, I can go over here using the master and you can see it's a very tiny amount of assembly we're sold after the population and you use the latest release GCC 12.2. It has way much more because it has all the standard string objects initialization and program setup that will be moved to the share library. What about X86 new ICENG security? That's gonna be super interesting. Every year, if their hardware vendor provides instruction, the GCC team works in collaboration with them to enable those new instruction into the compiler. So all developers can take advantage of those new instructions and being impossible to implement them in our application. These years, GCC 13 is gonna have, in terms of X86, it's gonna be for X86, it's gonna have a new implementation for Raptor Lake, Meteor Lake, Zero Forest, Granite Rapids and Granbridge. This presentation, you don't have all the scope complete, you might have all, we will describe only for Granite Rapids and Sierra Force. Now, Granite Rapids is the evolution of software rapids that we're addressing to the release. Now, Granite Rapids, hardware, it's gonna be coming later. The interesting part about these new specific instruction is the AMX 12.16 was added. For Zero Forest, another one's like ABX, IFMA was added, ABX, BNNI in A was added, ABX and E Convert was added, and also CMP, CX, ATH, was added. What about, let's gonna go with Granite Rapids. Granite Rapids, it's a dot product of Floating.16, not B Floating.16 style, so you know the impact it's in the position. Now, in previous presentation, we talked about what is a tile and what is AMX. Just as a summary, a tile is a two-dimensional array. Yes, it's a very fascinating two-dimensional array since it's very traditional. One-dimensional array that we have, we've been seeing like XMM, YMM, CMM, yes, they change from being 128 bits, 256 bits, 12 bits, no, the tiles, it's two-dimensional array in the presentation that I put the link over here. I describe more about the micro architecture of the tiles and the new instructions, which in these cases, they AMX. AMX has a bunch of new instruction for doing our interesting operation for matrix, and one of those, of course, is the matrix multiplication-developed product. Now, in the recently GCC 13, we're gonna be adding new instruction for incoming Granite Rapids, which is do a matrix multiplication of Floating.16 elements from a tile to a tile tree and accumulate the package single precision mail, the result into tile one. Now, that's fantastic if we code in assembly but we don't without the application. But some of them, what if I want to code in C? Well, for that, we have something called intrinsics. The intrinsics, it's very simple to use. It will do include intrinsics.h and then imm3.h and then you can use the specific function, for example. Underscore tile and score dpfp16 ps and you pass the destination, the Positile A and the Positile B. Of course, there is a specific instruction for moving from memory into files and from files into memory so that you can have access to those. Now, how can we test this? There is a very interesting example already in MasterTron for GCC when they added the support for these. It was requested, of course, to validate that actually it works and it's possible to explain that for us as developers, how do we use it? Well, here in the example is very interesting because they load exactly the same values into tiles and then they create a non-calculate matrix function that will return as in a result and also use at the same time the new instruction that it's gonna be available very rapid and then the result must be the same, must match in specific. If it doesn't match, of course, the recent error but that's the way it is tested. Now, let's move to zero points. Zero points, it's gonna be very interesting because you will have new set of instructions and one of those it's gonna be BPM at 52LUQ, L for low or it's also another version for high. What is it going to do? It's a packet multiply from significant bits and figures and nap them into the low part of the 52 bits product of Q-words of memory. Now, where is the application of these words? In the previous, it was very straightforward to think about an application of a dot product of floating point 16 tiles matched with the application in high performance computing, weather forecast, you name it, right? It is very straightforward. Well, these one is kind of instruction is very useful for cryptology and there are other presentations in other topics they describe probably the use of these instructions that are good papers about and actually the use of this instruction into the cryptography how to take advantage of that. Now, the good news is that it's gonna be available from the intrinsics perspective into GCC 13, right? The interesting part is that now we see developers can take advantage of that instruction. How? Well, using this intrinsics, it's gonna be MM 256, there is version for 128 that it's gonna be doing this operation which is multiply the packet single 52 bits integers and then form, do the take the low part of the high part of 52 bits of the result of that product. Now, there is also an example over here of it's already master trunk example is gonna take actually destination, source and source, search one, search two and as a destination compared to the destination it's actually the same as the one created for the calculation function that GCC team created to validate the result of that. What about ABX, BNNI, Interpayment? It's very interesting. CeraForest, it will know that it will not have support for ABX multiple. However, it will still have support for specific instructions like BNNI for precision in A that are very, very useful for machine learning. Now, what is the core of this thing? Well, as we have seen in previous presentation about BNNI, we did years ago, so we installed, it was the idea or the core machine behind BNNI is to emulate the same instruction or operation that the neural network, convolutional neural network algorithms performs. And it does actually perform the same instruction but the same operation, but in a single new instruction it will multiply and add and saying it's same but with or without saturation. And it is a very interesting instruction because we have multiple version of that we could have, do the multiply and add with sign and sign, on sign and sign, sign sign and then in the end it could be saturated or not saturated and it will multiply and here is the description why it's not just a simple BFM app. It will multiply the group of four pairs of sign and bytes in one register that we can define. Corresponding sign and bytes of another register to the multiplication of those four groups, summing those into and adding into the result of XMN one. When we use it in C, we will see the multiplied groups of four package adjacent pairs sign it eight bit integers in A with the ones that have defined it in B. It will produce four intermediate signal 16 bit result. It will sum these fours with the corresponding 32 bits integers defined it in W that it's over here. So it has three inputs, A, B and delta, right? And when we compare it, as we explained it, they compared that logic of how does it work with the convolutional neural network. I'll go, you will see that in the end it's necessary to the result of the girls to apply that final operation. There is a nice example also already in Master Tron PCC that do the calculation by itself and also compare the result to we assume the instruction. The next one is AVX and E convert. This instruction is gonna load the B floating point 16 and interesting and convert them into floating points 32 elements with broken. Now, it's a simple translator. It will take the B floating point 16 elements and convert to floating point 32. This is for compatibility and it's gonna be also available in CR course. There is a nice example, very smart but what they do is convert from floating point 32 to B floating point 16 and then ask to the new instruction, please transform B floating point 16 into floating point 32 and the result of both operations must match given the same specific number. The last one, it's gonna be CMPCCX app. It's not just one instruction, it's a bunch of instructions that we could have surrounded by the encapsulated in the single one. It will compare an app if condition is met, multiple conditions could be met, here's the thing. For example, see compare value and if it's below or equal, add the value from 32 and then we have, and this is when you're doing compare below or equal X app, then we could have compare below X app which is gonna do a compare the values in the first operand with second operand. If it's below, add the value below but not equal from R2, the third operand to M3, 32 and write a new value in M32. So, and I just put a small set of those, you can find more the X86 info software in the developer manual where you will find all the possibilities that you could have the different kind of conditions. And there is of course a very good example in GCC master branch that you could use as a reference for coding. How can developers play with these new functions? That's interesting worlds from the developer's perspective, there are many features available over here with the link, with the summary of multiple blogs and then talks about how to use the latest GCC 13 or latest, you'll exceed 2.37 that are available in these first. And including the open source project with Linux which is very simple to docker pool for Linux and you will have the latest and greatest version of the pool. Now, having a better understanding of this innovation it's basic and it's fascinating because you can think about new things you can develop on your source team. You can improve the performance, improve the security you can improve the sizes we have seen that are on another presentation we have been talking about FNICER that helps you to create much more security much more clean code. You have seen how to use new installator that without that capability will be very frustrating because you will have a piece of power work that has a new shiny amazing new instruction but you cannot use it because the tool chains are not provided to create actually a connection within your source code and the use of that new instruction with a very basic new flag that you need to add to the compiler you'll boom need it to transform and generate the new version of the executable now with the use of that accelerator. Now we have seen in this presentation how they put into diet a very basic library moving or translating the work of initialization from the object file into the shared library and that will improve the size of the binary very much and also the performance. It's going to be fascinating. Well, thank you so much for this time. I hope that you have enjoyed the presentation. If you have any questions, please don't hesitate to ask me. Thank you so much and thanks for letting me pick at the officers to meet North American. Thank you, have a great day.