 Hello, welcome to the third day of the conference here. My name is Philip Miller, and I'm talking today about the gaming console, Orange Pioneer, which is based on Linux. So let's go back to history lane. The first gaming console, which is not by Nintendo, was the OEI. So it came out 2013, and it had Android on it. So it was the first console where people had the opportunity to work and do their own games. But it failed. So it had only 2,000 units sold and only 1,250 games supported. In 2020, someone said, hey, why not put Linux on it? Because it's a dead hardware. So yeah, the next was by Valve, the Steam Machines. So what they had was a Linux. They had around 500,000 sold, but the games were less, only 1,000. So Valve had to rethink. And a couple years later, the Steam Deck came out. So 2021, they had a lot of games, still 1,000 in the beginning, but the form factor was different. And so far, they sold around 3 million units. After that, 2023, the Steam Deck OLED came out. And what changed? Well, the first batch was sold out, so it was a hit for them. And the games are now at 13,981 games, which you can play on this mobile device. So we came along in 2024 with the Orange Pie Neo. We are intending to do the initial batch of 50K. And then on demand, we produce more. And we have around 14,000 games which you can play on this device. So how we port Linux to it? Yeah, there is always a question which came first. The hen, the egg, or the chicken? So to do that, we have to port Linux to the device, of course. Then if the device is not produced yet, we need another device to replace that. So I was looking to something, and you need developers which are doing the same goals. So what was on the market? On the electronic shelves, there was this ASUS Rock Ally. So when I was back in Munich, Germany, I went to the nearest media market and bought that device to start porting my operating system, which is Manjaro. So what was the hardware of this device? So it's similar to my device. It has the AMD Ryzen Z1 Extreme. It's the same graphics card, a similar screen, 16 gigabyte RAM, 512 water hard drive, and Dolby Atmos. So there were a lot of problems which will come with this hardware design. Then we have Windows 11, my ASUS, and a lot of drivers for Windows. So what didn't work? Part of the controller's works, but not every button. So this was a problem. Then graphic works, that's great. Screen works. What didn't work was sound. So you couldn't play any games with sound. So this was a challenge. We had to support the gyro. We had to support also the sound card and some buttons, of course. And Windows. If you have Windows, you have updates. So if you start the ASUS Ally, first was you greeted with a Windows setup. You have to create an online account. You have to do updates. And then you have updates of the drivers and updates of the launcher. So you stand there. You want to play a game. And it took you about one hour and a half to start the game because you only have to download the game. So yeah. And if you look here, there are two different models. What's the difference? The UI. Every manufacturer changes the UI so you don't have the same user interface. And how we change that? Well, we put Linux on it. So we have Manjaro. We have KDE Plasma as the UI for the desktop. We have some parts of Kimero S, which is a gaming operating system, and a bunch of other things, including Steam, Bud of S, and other launches. And with this, we can start and have the same UI as we want to have it. So the first what we have to do testing on the device is your operating system working. The second is give the vendor the feedback, what they have to change, and also do testing and feedback with the vendor so everything is working. What worked? So on my device worked part of the controllers, which is good, graphics works, screen worked, audio worked. So we had one more point, which is actually working on the ASUS Ally. And the touchpads worked as well. So we only had to work on suspend the four buttons, which were not working by the default drivers and the Chirol. So how we did that? We did first the basic functionality, so the operating system worked. Then we fixed the issues in firmware and hardware. So it was a lot of devices coming to my desk, and I have to change. I just button didn't work. You have to change that in the firmware. And the last thing was, we have to polish the experience. So this is what we're doing now. What is important? Well, the power management, of course, you want to have a long battery life. The cooling is also something which you have to work on. Some people like RGB to have some fancy coloring around. The controller, there are a lot of input devices. X input is the standard which everybody wants to use. So we aim for that. The Chirol was also something which people like to use. So you can turn the device upside down, and the game will react. Or you have a flight simulator. You can do some turns. And what also is clear, the build quality. You want to have a good device. So all the buttons should be working and so on. And you have to stay curious to adopt changes. So this is mainly what you do for porting the device. So are there any questions regarding the device itself? I have here the device. You want to test it? You want to test it? OK. So no questions regarding how to port. Morning, everyone. So I have a question for you that, are these software available for any other Linux distribution or just Manjaro? So the device will start shipping with Manjaro. But we have all the drivers there. So you can port your Linux system to it if you want to. But if you get it from the manufacturer, it has Manjaro. But we open source everything. So it can port to different operating systems as well. So it's like more or less the drivers are there. And we did the heavy work so other can adopt. And maybe that will bring a lot of people to use it. Do we have root access to that? Lock it? Yes, you have root access. Yeah, you can fully change everything. What made you choose AMD platform rather than another, say, ARM as an example? AMD was already established. So you can simply port everything. SteamOS is based on AMD hardware. And therefore, it's easy to port. It's just a new hardware, a faster CPU. And therefore, the work is less. If you look for Intel, there is the MSI claw. And they have a little hard work because the graphics card is new. So if you have a game, it doesn't start. You have to emulate it to say, hey, this is an AMD graphics card, then it starts. So there are a lot of things. And we want to have the first product easy. So that's why we are using AMD. Any more questions? Five minutes. Well, we try to do the sales, hopefully in May or June, depending on the manufacturing. So first we have 300 units for reviewers and testing out the build quality. And after that, we will have mass production, the 50K. And if that works great, we will ramp it up to about 50 more or 200 more, depending on the demand. So it's like, if people want to buy it, then we say to the factory, do more devices. So the factory can produce about 30 units in a week. But if we need more, then we can employ more in the factory to make that faster. Do you have a question? No. So did you do a licensing agreement together with Steam, Valve? Well, I'm working with Valve. Valve wants to open up the SteamOS for other vendors. So I'm in talk with them. And the first thing what we initiated was the legal issues. So if you have the first user setup on Steam Deck, you have the easy way to set it up. So if you start my device, it is set up in two minutes instead of the 120 minutes in Windows. So we simply put it up, lock in your Wi-Fi, your credentials, and then you're in and enjoy the game. And for that, we have to clarify the legal issues with Valve, which we are currently talking, but also they want to open up the devices for others. So the SteamOS might have in future also different devices. Different devices, yeah. Okay. Then that was all from my side. And you can look around. I'm here in the convention and you can try out the device. Okay, so I guess it's 8.45. Welcome to my talk. I'm Piot Krul and I would like to introduce you to Dasharo, open source firmware distribution. So yeah, so I'm founder of 3MDep, Poland-based consulting company, which specialized in embedded firmware. So mostly open source BIOS, but also embedded Linux. Yeah, I'm also instructor in open security training, which are some fancy free stuff which I will advertise at the end. So maybe first like short exercise, how many of you know this logo? Raise your hand. Whoa, almost no one. So there are two people knowing what's Coreboot. So Coreboot is open source firmware alternative to BIOS. And maybe you don't know, but every Chromebook's running Coreboot. So this is the BIOS for every Chromebook. So maybe it would be easier with this one. How many of you know this logo? Ever seen this logo? Also, not many. So, oh, there are a little bit more. So this is UFI. UFI forum is an organization which defines specification for interaction between BIOS and operating system. And based on this specification, there is implementation, maintained by organization on GitHub, which is called a Tyano Core, and they do reference implementation of BIOS. So there is like huge C code base, which you can use to develop BIOS for your computer, for your device. So I guess it would be harder for there because those are even more niche projects. This is, who knows this logo? No one. So this is Linux boot. So this is very interesting project, open source firmware project, where Ron Minich creator of Coreboot decided to put a Linux inside the spy chipset of the computer. So when you're putting a power on your device, there is no BIOS, or there is little bit of BIOS, but then immediately Linux kernel starts. So why he did that? Because he assumed there is no need to write the drivers twice. Typically on the system, one set of drivers is in BIOS, and then of course another in your operating system. So he said like, why I should write drivers twice? I will reuse Linux kernel drivers in BIOS for initialization of the system. And last logo is maybe very niche project, but very interesting. It's very security focused. It's hats. So this is like a Coreboot downstream distribution, very specific using hardcore cryptography, GPG, USB token for checking integrity of the firmware. So I really encourage you to Google for a Coreboot hats and you will find GitHub repository and you can read about this really secure, operate secure firmware distribution. So what I will talk about is Dasharo. Dasharo is another downstream distribution of Coreboot made by 3MDep by my company. And the goal of Dasharo is to provide open development firmware, a silenced platform security focused distribution with a UFI interface. So any modern operating system can boot based on that. So how the status of the project look like and what's kind of impact Dasharo already made. So first of all, so we already merged to Coreboot upstream like 16,000 lines of code. This is maybe not much, but another 20,000 is in progress. And we also got a power port which is like 50,000 lines of code which also will need a merge. So in case of hardware compatibility, it's also like we beginning, we have 17 platforms supported right now. There are five another which are in progress. We already released like over 500 releases. We run over 40,000 tests on those releases. All the tests are published. All the source code is published so we can build those images, you can flash those. But we also provide the binaries which you can immediately use on supported platforms. In case of market share of the devices, we support IBM Power 9, especially the Raptor computing Talos platform. We did quite a lot of Intel, but the most of stuff we did is for AMD, especially for firewalls from Switzerland, from company collect PC engines. In case of community, we have a community of over 250 people. There is around the communication channel is on matrix. There is around 2,000 messages a month. So not much, but if you are interested, you can join and see how this move on. We were highlighted by news release in various places. So Tom's hardware wrote about us when we did the port for MSI-Z 690. There are a bunch of other medias. Foronix likes us very much. So Foronix is like an open source news media outlet. So they like what we're doing very much. So there is a lot of stuff. PC Gamer wrote about us. So you can see that this is not only some unknown niche project, but there is some impact and we are growing. I wanted to show you a comparison between various distributions, I guess. The most important are two last lines. So maybe let's go through columns first. The rows are the various open source firmware, you can say distributions or not open source firmware because the row here is like closed source, bios, vendor, proprietary stuff. So maybe let's start with the first column. It's about openness. So the question is, if the source code for my bios is open, and you can see that typically your source code for your bios on your computer is not open and you typically don't know exactly what you're dealing with. There may be malicious code there, there may be malware, there may be bugs which are not fixed. There may be like outdated microcode, which cause that you are vulnerable to various meltdown spectra like bugs. But anyway, so that's what's different for the typical vendor bios. Then the question is, if in given projects you can get binaries, or this is just source code and you have to compile yourself and you have to deal with the tool chain problems and all the technical difficulties which are related to building something. So in most cases you get the binaries. The difference is with the core boot. Core boot is just, it's like Linux, like Linux provide source code. It does not provide binaries. Distributions provide binaries, yeah? So yeah, so the point here is that the core boot does not provide binaries. You have to compile it yourself, you have to decide what options you're choosing while compiling. The next thing is public long-term support. So the question is, if the platforms which were enabled in this project will be supported for long term or they will be dropped? So most of the distributions have long-term support because they support like even over 10 years old hardware. But in case of core boot, they have policy that if there is old platform which cause problems for further development, they just make a branch and just move on like remove that platform from the head of tree in the repository. This happened multiple time. And of course like in case of vendor bios, you never know. Typically a consumer hardware got like maybe two years updates of bios, maybe three years. The server hardware from serious companies may get like way longer support. But that's it. Maybe industrial embedded stuff may get longer support. But typically you're not getting long support for bios from bios vendor. In case of commercial support, some may need commercial support for the bios. So hats got it. Libreboot, I didn't found the information on the website, but I guess if you will call them, then probably they will provide. In case of core boot, there's dedicated website where you're going to coreboot.org and you have a consulting, you're going there. There are a list of companies which will support you. In case of vendor bios, it depends. But mostly yes, you will get commercial support. But the point is if you are small vendor or individual like unlikely, you have to be reasonably big to get that support. And with Dasharo, you're getting the support. In case of public community size, definitely hats and Libreboot, which are maybe not so well-known here, already have quite established community, which is reasonably big. Coreboot has huge community in comparison to hats, Libreboot or Dasharo. Vendor bios, you can go to forums of various vendors, but there is some small group, I don't know, 50 people who are interested in the bios modifications or doing something with bios. So typically it is small and do not last long. And Dasharo is growing, so I would say it's between the vendor bios and the hats. Going further, transparent validation. So the question is what was validated? Sometimes you're getting bios update and there was very descriptive release notes. Performance improvements, bug fixes, something like that, which is like not saying anything. So with Dasharo, you're getting transparent validation. So you exactly get information. This test was executed with this procedure and the result is that. If there is a bug, there is link to bug on GitHub, which and everything is transparent and open and you can check it. With hats and Libreboot, they have small testing community. This is not bugged by big organizations. So that means they typically rely on, like if bug pop up, there is GitHub issue and they discussing there. There is no formal testing and there is no transparent validation like a list. What was validated on given release? With Corboot, there are no binary release. So there is nothing to validate and with vendor bios, sometimes there are some tests, especially there are hardware compatibility lists published, but you don't know really what they validate, what operating systems they validated, which version, which service park and so on and so on. With Dasharo, you'll get that. In case of validation scope, how big is this test scope? How many tests they running? So I guess like with hats like very, very minimal, the hats running themselves. There is CI, which do the builds. Then with Libreboot, I guess there is little bit less. With Corboot, there is just building. There is CI, which do the just builds, nothing else. So if not building, they consider this is broken. If building, they not test anything else. With vendor bios, they validate a lot. They have huge labs. They have a lot of various hardware and they can really do a lot of validation why they do not provide the results of that validation. So that's the problem. And with Dasharo, we have quite a lot of tests. For some platforms, we have even 500 tests and we're still growing. In case of NDA docs, so no disclosure agreement documentation, which means access to the silicon vendor documentation, which give you ability to fix potential problems with bios or report problems related to hardware. So yeah, so typically for heads, developers do not get that because they don't sign and Libreboot, they don't sign NDA. Corboot, some developers are working for corporations, so they have access. Of course, vendor bios have access. 3MDep and Dasharo got access. In case of size of public documentation, like most of the projects got very little, you have five vendors, proprietary vendors got very little documentation. You can go to dogs.dasharo.com and see how much documentation we have. We have a tremendous, big documentation, a lot of information you can find there. So that's the comparison. So I wanted to pitch you quickly about something which is called a Dasharo trust route. So essentially, there is a mechanism of making the platform booting only, booting only the firmware that you want. So sign it by you. And that's typically what Lenovo or any HP, any OEM doing. They create, blow our fuses in the processor with the public key, so no other firmware can be booted, only theirs. So we have also this possibility for those who want it. Those fuses are here in some silicon vendor technology deep inside the silicon. That means typically that those fuses are used by something what is called a management engine or platform security processor on the AMD. It could be BMC on server platforms. It could be boot ROM on ARM platforms. And then from that, the root of trust starts of the platform and the boot process continue. It can continue through various means, through firmware, which could be Dasharo, Coreboot, Diana Core, I don't know, Uboot, trusted firmware. And then of course, pass the control to the operating system. This is logo of Zarkhus, the other project that we're doing. And of course, in operating system, you're doing what you want with your technology, with CRAS, GO, whatever. Okay, so you may ask, okay, this is all open source and so how you earn money, yeah? So what's your business model? So the business model is like quite complex, like on this diagram, but essentially for the Dasharo, there are three main streams. The entry subscription, so this is subscription-based model. Someone can pay us to get the binaries to their email. Of course, the code is open so they can compile themself. So it's up to them if this is waste of them time or this is okay for them. Someone can buy support package or there are dedicated products which go into customers. Okay, I get information that I should end, so let me give you this slide. If you want to buy the Sharo, those are links. Slides will be published by the organizers. This is my contact. If you want to talk with me, please feel free to reach me out. And I wanted to give you some free stuff. So this is free stuff. So with Zino Kovach, former April security engineer who was responsible for secure boot of MacBook Pro M1, he created this portal for free education, free security education. So if you go to POST2 FYI, you will get free trainings and various of them are very low level. So this is X86 architecture, core boot, UFI, debuggers. So you can learn a lot from that for free and then get a job. And if you want to work for us, feel free to send CV to us. There is a website dedicated 3MDep.com slash careers. You can check what kind of position we have open and that would be it from my side. Thank you very much. We have time for questions. No. Okay, so if you want to talk with me, please catch me in the lobby and I will be okay to answer any questions. Thank you, bye-bye. Hello everyone. So welcome to my session. So today I'm going to talk about our work in ARM processor. So welcome to the world of ARM. So as a key feature of the ARM ARC64 processors, they support multiple cores. At this moment, we are still leading in this industry. Average in the server area demand, there are usually 64 or 128 cores in a single silicon chip. So that's an advantage for ARM processors. So we have more exact computing unit. And at the same time, vendors are trying to add more hardware accelerators into the same chip set to accelerate specific tasks. So there is a point. So we need to balance our jobs between the cores, CPU cores, and the specific hardware accelerators. So today I'm going to talk about the work we did so far in this area. So people come to say the longer your subject and then the less attention you get. So I tried hardly to make a shorter one. So I'm going to touch SVE, so which is the ARM vector solution in the server domain and the hardware accelerators in my session today. So that's a brief introduction of myself. I come from Leonardo. Leonardo is an organization. We worked on the ARM software ecosystem. So today's content, as I mentioned, we touched a little bit about the software solutions and the hardware accelerator solutions and the work we did in OpenSSL and JDK. So in this session, if you get to learn something about SVE or hardware accelerators or OpenSSL, I will feel better if that can help you. So why crypto is important? So first thing is, so we get more and more data in the modern days. So, and the second thing is the government. There are a lot of regulations, rules, laws enforced by the government all over the world. I just checked that in Vietnam, we have the government has such laws and also China, also US, both the government and the industry-wide. So they require, they ask, they import us to apply encryption on our data. So that's big stuff. So for thinking about every text of files, every files, every byte we store in our server, in our disk, you need to do some encryption. And when you want to fetch it, you need to do the encryption. So that's a lot of work. The same thing goes for the storage space. We need to compress things, we need to decompress. So to conclude this need, we have two methods as I mentioned. So we have the CPU processor core method. We have the specialized hardware method. So let's first check the ARM solutions. So in the software solution, for that most of the time, we are talking about the instructions. So in the ARM world, we have specialized features, specialized instructions like for SM4, like for AS. So you can find specialized assembly instructions in the ARM ISA, so to process in such type of job. So people are working on that to use the assembly specialized instruction to optimize their libraries. So for example, the Eraser Code or the SM4 encryption library in the OpenSSL. So another software solution is to use the vector instructions, like a single instruction, multiple data. So we have a VX solution in the XS6 world and in ARM, we have SVE. So SVE stands for Scalable Vector Extension, I will mention it later. So also ARM introduced the metrics extension recently. Which is specifically useful for the AI processing. So what is SVE? So SVE is a vector processing, it's a Scalable. That means the register size can be defined by the vendors. So each chipset vendor can pick their proper size for a vector length. They can choose from 128-bit to up to 2048-bit vector length. And the beautiful thing for SVE is, so as a programmer, software programmer, when you write your code, you don't need to care much about the actual vector length. So either it's 128-bit or it's 512-bit long. So they provide the instruction mechanism to support this, to cover this for you. So in SVE, that's a bit about too much detail. So it has 32 registers, vector registers. As I mentioned, the vector length can vary from 128-bit to 2000-bit maximum. And SVE provides the P-register, predicate registers. Predicate registers is like a mask. So we, in the vector processing, we define the data as a lens. So the predicate register works like a mask. So you provide the P-register, you provide the mask, and then you issue the instruction. So only the lens has a proper mark, either one or zero here is one. So have a proper mask will be executed. So you don't need to move or cover this lens. Somewhere else, you just keep it here but put a zero mask in the predicated register, then issue the command, then you get everything down for you. So there is another example, like it's a multiply and a plus instruction. So here we can see that with this mask, the lens 012 will be processed with 03 is a mask. It goes and changed. So that's quite makes things simpler for programmers. So another important feature for SVE is vector length diagnostic. That means you, especially in the while loop, you don't need to know the specific architecture, chip, chip you are running your software has a 128-bit vector lens or 256-bit vector lens. You just use, it's implied by the instruction. For example, here you have the INC instruction. So traditionally in NEO of some other architectures, you have to know that when you increase index, the vector lens is 16, that means 16-byte, right? You process in one loop, you only process 16-byte. So, but in SVE world, so whatever it's, if you are running in a 128-bit machine, it will increase 16-byte. If you are running on a machine with 256-bit vector lens, then it will mean you increase the index by 32, right? When you write down your program, you don't need to care about the actual vector lens. That's what it means. Okay, so beyond that arm released, really? Okay, I go quick. So arm extends the SVE to more demand, SVE too. And also the metrics solutions. Okay, with metrics solutions, they introduce the bigger file, register file. So you can process the, do some complex stuff like the outer product or to vector. Okay, so now let's go to the hardware accelerator. So hardware accelerator is another solution. We know that sometimes CPU frequency, their capability goes to a limit. So there is chance that we use specific hardware to do specific computations. Like in Kun Peng 920 chipset, that's the chipset provided by, manufactured by the high silicon. So inside this chip, there are four accelerators working on compression, security, and something more. So we are going to explain the security part. It can do the AES, encryption-decryption. It can accelerate SM4, encryption-decryption as well. So to support that, Linaro works with high silicon, developed the UADK framework. So this is a software framework, but it works for the purpose of this software is to support the hardware accelerators. So it is constructed based on the SMU technology. With SMU, you get the same virtual address between the device and your program, your code. So you just need to give the virtual address to the device and then it will find it through SMU. Your device can find the proper physical address. Okay, there will be another session today in this room from my colleague. So if you want to learn more details about UADK and the underlying technology, you can welcome to join his session. Okay, we mentioned software acceleration, we mentioned hardware acceleration. So now how can we get the most efficiency, most capability out of it? So that means we need to find some balance. Okay, so we build some balancing strategy. We consider some how and in which conditions we put the jobs to reach processor in the software accelerator or the hardware accelerator. We consider the CPU, we consider the memory bandwidths. So when we define the strategy, we need to consider all these criteria. Here in Linaro, we define this solution. It's based on the openSSL 3.0. So there are pretty new features comes with the openSSL. So openSSL 3.0 was released last year and the older version, previous version openSSL 1.0 is obsolete so far right now. So the new features here is each application can define their own context. Within the context, it can load their own provider. Provider means execution unit like an implementation. So you can register different multiple implementations to the same library context. And it also provides a method of thought. So when the user application want to fetch, it that means we want to find some processing unit, processing implementation unit, it needs to fetch from the method store first. So that's something new. So based on that, you will know that we do the balance by implementing a load balancing provider. That's a special provider. So your application just need to register with this single provider register. And then you load the other providers into your application's context. And then in the load balancing provider, get the mirror of the real provider, okay? The real provider. So in your application, it's about here, it's called, it combined the load balancing provider and this provider will have the information of all the other providers you registered, you loaded, okay? So this strategy, which is implemented by the load balancing provider, your application can go between, go among all the child providers. So to find the best candidate for their application, for their job, okay? So then we implemented, we use the Java, most of the applications in the big data world right now is in the written by Java. So there are Hadoop, HBase and much more. So we need to export this capability, the load balancing capability to the Java world. So our method is to implement another, they also call it provider in the Java world. Sorry about that. So people like this name, I guess. So the key provider in the Java, it will call the load balancing provider and then load balancing provider will help you, help the application to allocate, to assign the tasks to either the hardware provider or the software provider. It's also easy for use, to use. You just need to change one line in your JDK. Bisheng JDK is the version derived from the OpenJDK project and it's open sourced by Heiselekin, Huawei. So here are some, sorry about that. I just need two more minutes, okay? Thank you. And here are some measurements. So we see with the load balancing feature, at the same CPU utilization like the red one, we get higher bandwidth. So similarly here, same CPU utilization, we get higher and here is to get the same bandwidth we use with the balancing, we just need lower CPU utilization. So here is another demo we did on the HDFS Hadoop file system. So with our method, we can achieve about 30% performance gains. So with jobs assigned to both the hardware accelerator and the CPU software accelerator. Okay, yeah. As for open source software, we've always welcome to your contributions. Here's our project. This project is in competing under the OpenUla Big Data SIG. So welcome to log in, join the community to check the code, to find the way to utilize your hardware better. So again, I'm from Linaro. Okay, that's almost it. Linaro works on ARM software. Welcome to Pinners for any service request. Thank you. Sinchao, morning everyone. My name's Andrew Offa and I'll be talking about how you can get benefits of working on ARM and help with your ARM hacking. So quick overview of who I am. As I said, Andrew Offa, I head up ARM's open source program office. I look after our upstream interactions, community engagements, such like. I also wear many, many hats as you can see here. I'm on the UXL foundation. So Gudong was talking about accelerators. That's all about accelerators. Southern Zen project for a hypervisor. I'm on the steering committee at Linaro and FreeBSD foundation, as well as chairman of the Octo project. And I also am the chairman of OpenUK. So how many people know about ARM? Does everyone, anyone? So we've got a couple. Cool. So just in case you don't know who ARM is, we're a semiconductor design company. We don't make anything. We come up with IP and ideas and we get paid for those ideas, which is brilliant. But our technology and our designs have been in with 270 billion devices now. About 33 billion that have been shipped to date this year. And yeah, there's over 15 million people developing on ARM in one shape or another. So ARM in Vietnam, at the end of last year, our education and university team hosted a delegation from the Vietnamese education organization. And please to say that we have signed an agreement with the Vietnamese government to join the semiconductor education alliance. What does that mean? It means we're building future on ARM in Vietnam. And so we're very pleased to see that hopefully in the very near future, the curriculum for computer science and other things will be based on ARM or at least have ARM included. So first stop when you want to start developing on ARM, whether it be from a hardware or software perspective is our developer hub. That is, we've spent a lot of time and effort redesigning how we can actually get information into the hands of developers. And so we've created the hub here to help collate all of that. It's very simple. It's what's the area that you want to develop on? You go there, you've got videos from ARM experts, both internal and external. You can find out where next ARM based event is, et cetera. And there's also some training things, training modules available, some of that is free. As part of the developer hub, if you want to work out, okay, I want to develop on X, how do I do that? Does that software work? We've just launched ecosystem dashboard, which you can get to from the developer hub there. And that actually lists all various softwares, what the earliest version that supports ARM is, what's the status? There's commercial products, there's open source products so you can actually find out early on as to how much work are you gonna have to do to actually get something working? As part of that, we also have our developer program. We strongly encourage you to participate in the developer program. It's relatively lightweight. It is a program based around Discord server where we have our staff from ARM, we have staff from our partners, we have ambassadors and we have regular, everyday developers out there all participating and there are a number of meetups and such like there. And get on board, you can get some nice swag and other bits, but it's a great place to actually find other people that are working on the ARM platforms. You wanna say you're working on Kubernetes, what's the best platform to use for Kubernetes you can find out there? What are the various quirks or differences that may arise? You'll find out very easily and you can have a nice social conversation with somebody and actually build a community around what you're trying to do. Pretty sure somebody's tried it, so learn from them and that'll help you move forward. As part of the developer program, we've created something called Learning Paths where it's a GitHub based, web content, so we've got, as you can see, a number of topics and you're actually able to edit and provide pull requests if you find mistake or issues, if you've got questions, it feeds back into the developer program and so you can actually learn, you can contribute if you found out, found a way of doing something that is not listed there, you can create a learning path, you can submit that and that helps the whole community ultimately at the end of the day. So we've recently released a new product called Performance Studio, one of the complaints that we've heard from developers, both large scale companies and individuals there's no real tooling to analyze performance of your software stack. How can we make sure that we can get the claims that ARM is making on how good our hardware is, right? And so Performance Studio came out, it's a suite of tools effectively where you've got targeting CPUs and GPUs, it's primarily designed for application developers, not low level hardware developers or firmware developers and it's free, completely free. We're not gonna ask you for any money and you can download it from the developer hub. So it may seem familiar if you install it because it is, we used our mobile studio as the basis for it. For those that don't know, ARM's primary market segment has been mobile for quite some time. Over 90% of phones run our designs and same with tablets. And so we had a tool set for that segment and we've expanded that. We've added FrameAdvisor for more graphics aspects and then as well as RenderDoc. And we've ensured that it's not just focused on the likes of Android, but general purpose Linux as well. So workflow is a fairly standard performance analysis workflow of monitoring your work, load, see what's happening, getting the reports of what's actually happening, what's failing, what's not meeting the triggers that you thought you needed, but then allows you to ultimately optimize that workload. So as part of a performance studio, we've got Streamline and as part of the Streamline suite, there's the performance advisor. So help with automating your profiling, should I say, and then the reports that get generated are designed for monitoring what you're doing. And then you can actually quickly and easily digest what's coming out of those reports so you can respond quicker to that. We've also got the sampling profile of the piece to the Streamline tool set, see what's happening with CPU, see what's happening with the GPU and see how they're both interacting if you've got workload that's moving from one to the other or having to call on one of the other processing units. You can see that in real time. Relatively low overhead for both your CPU and your GPU with any profiling, there is going to be some form of overhead, but we've kept this to a minimum. And so with that, you can actually see clearly what's going on when what's triggering things and just help move things along. So Frame Advisor is a new addition and this is focused primarily on graphic side. There are more and more platforms coming out based on ARM with multiple types of GPUs now. So my machine here is Lenovo X13S. It's an ARM based platform running Qualcomm chipset. There's some new platforms coming out from Lenovo, Dell and HP and there's going to be more different chipsets there. Qualcomm, Mediatek, there's talk of Nvidia, there's talk of a number of others. So hopefully that market will change. Microsoft also have their own Surface Pro platform. And yeah, it's so this way you can actually see what's going on. Forward Studio is a relatively, what should I say, Frame Advisor is a relatively new component. And so we are keen to hear back from users as to what's working, what's not working, what's missing, what would you like to see, how do you use things, et cetera. So please email us at performancestudio at arm.com. We'll be super keen to get your feedback. And as part of that, we've added RenderDoc. RenderDoc is widely used within the industry from a graphics debugger perspective. It's got full Vulkan support now. We've ensured that it works well on arm and targets our GPUs as well. We're trying to be good citizens with upstream. There are going to be times when upstream RenderDoc can't accept some of our changes for either licensing issues or whatever else. Or they just don't want to because it's too specific to a particular problem and it's not in the maintainers roadmap. And so that's, we will be pushing as much as we can upstream. There will be some things that will have to remain downstream. But it's all going to be accessible through performance studio. And again, performance studio is free. We're not charging anything for that. And Mali is arm's family of GPUs. And so we've got the offline compiler there providing shader syntax and performance aspects. So just to reiterate, you can get performance studio from developer.arm.com. And we'd love to hear feedback on how you're finding it, et cetera, so drop performance studio at arm.com. An email. So I think I may have tried to recoup some time a little bit too quickly there. So any questions? Yes. Thank you. So it is very interesting that from arm people come here and talk about arm. Arm is the most used professor in the world, right? So, but there are risk five. So I want your opinion or maybe your future state or anything from arm regarding risk five. I mean, risk five is open source and the speed and how in the future, what do you think? So future with risk five, I'm not sure how much I can say officially, but I think as we've seen, if we look back and learn from history as an example, there's going to be space for a number of architectures. It depends on how successful they are, et cetera. One of the crucial pieces that I think arm benefits from is an expansive software ecosystem. We've spent a lot of time, money and effort and manpower on ensuring that the ecosystem is where it needs to be and we're continuing to work on the software ecosystem. You can have an open piece of hardware, but if you don't have software, all what silicon is is expensive sand, right? So you need a solid software story for your hardware. Conversely, without hardware, software is just an empty idea, right? You're not going to see anything. So there is a symbiotic relationship, but you need that software to make that hardware work. And I think arm is in the lead in that respect. We spent a lot of time, money and effort and we're continuing to and we're expanding our investment, working with our partners, customers and the wider developer community. Possibly. Having an open instruction set is one thing. Controlling that instruction set to ensure that it doesn't fragment, that's a different story. And I think it is a significant risk. No pun intended that having an open instruction set allows people to diverge from what the central idea is. And so when it comes to actually supporting the software, it might run on a particular platform, but because there's so many different implementations, it's not necessarily going to be able to run well ever, because you will need to have, and it might be okay from a community perspective, but if you want traction with commercial vendors, for instance the cloud vendors or major ISVs, etc., they hate change, they hate choice in so much as they want to be able to write their software for one architect, one platform and they will tune that and harden that to the ends of the earth. But it's very limited, so if you go and look at, I don't know, for instance SAP as an example and you want to deploy SAP on hardware, you will find that the list of platforms that you can deploy it on is very small because they have to concentrate on tuning, security hardening, etc., and when you diverge from a hardware perspective, it makes it much harder for those software vendors to be able to get the most out of that hardware. So there are always going to be certain use cases where one architecture is going to be better suited than another architecture. What we are doing is providing choice to the ecosystem. With regards to x86 versus ARM, whilst raw compute power may be better, if you look at the power per watt and power per single thread performance, then you will see a difference. x86 does have slightly better multi-threading, we are closing that gap rapidly, but from a power per watt perspective, we are far more powerful for using a fraction of the energy and moving forward, that is one of the big issues, the age of AI that we are in now, the power consumption for training all this data is just going through the roof, it's ridiculous. People like OpenAI, etc., they need power stations, not data sensors to be able to power all the training that they are doing. So I think the focus is going to change from raw power, we want it as fast as possible, to we want to be as economical as possible, because it's going to cost not just the environment, it's going to cost the companies big dollars in paying for the power and cooling, etc., of all the hardware. I think from an application perspective, so as you said with databases, etc., there are some databases that will, for the most part, always work okay on one architecture or another, but they will operate very well on some architectures. That's always going to be the case. I think no SQL databases may operate better on ARM versus a traditional RDBMS, but keep watching, keep watching is all I want to say. I can't say one way or another. We're always looking at advancing compute, and traditionally ARM has always been in the shadows. We've been quiet. We don't have a jingle or whatever with the advertisements or whatnot yet. We'll see what happens, but we are trying to make sure that people are aware that we are here for the long run and we're investing heavily within the software community at large. ARM GPU? No, no, no. You can use this for the CPU as well. Possibly there are... GPUs is heavily driver-driven, and so if we can access the driver code, then yes. There is Perf based, yeah. Thanks very much. I'm Chen Na-shun, a technical researcher from K-Laptory for embedded and network computer of Hunan Province Hunan University. I'm here to share the event on behalf of our opponent community, and my topic is the XIVM driver-based virtual machine. Let me have a brief introduction for myself. I'm a PhD student from Hunan University, and in the open source community work and open all the XIVM, and the president of the open source student association at Hunan University. At the same time, I'm a XIV project and open all the XIVM development. And I'm here to share the XIVM, and what's the XIVM? XIVM is an open source embedded hybridized based on Java tools. You may heard of Java tools. It's a popular open-source embedded operation system in Linux rotation. And the XIVM is developed based on such an excellent open source embedded operation systems. And while we're talking about XIVM, the first demand for virtualization in embedded sensory is crazy. For example, the intelligent transportation systems maybe use a hypervisor to run multiple OS in one platform. And the second, there are a few hypervisors based on the real-time hypervisor. So it may be meaningful for someone to use it. And then for the students or the developer, they may need to learn the hypervisor to know how it works. The XIVM may be a simple sample to help them to understand how to design a simple hypervisor and how to use the simple hypervisor. And then we are talking about how to deploy the XIVM. Now we can deploy the XIVM on the virtual platform or a real hardware platform. For example, I fixed the virtual platform on the QMU emulator. It's a virtual platform. And then we can deploy it on the developer board like a ROK chip, RK3568 system chip. And if you are new to the XIVM, I'll recommend that you use the QMU emulator because it's more simple than the other OS. The XIVM is maintaining the compatibility with the XIVM already beyond the systems. It means that you can use the west command to build the XIVM just like build the XIVM. And then you can use the QMU emulator to start the XIVM. This is the start command. Maybe use the QMU Arch 664 emulator to start up the XIVM. And finally, you can get a QMU XIVM example. For example, XIVM host shell. And then you can use this shell to create or to run in the virtual machine. And that is what's to deployment the XIVM. And you may ask why we use the XIVM. In some embedded systems, it's needed to mix the correct deployment requirements of rich futures OS and 10 constraints. For example, the intelligent control system in the car or in the intelligent industrial control system. And in such a system, they may need two types of OS. One is the rich futures OS in the car. Just like in the car, you can see it's an intelligent copatter with the derivationization and the human-machine interactions. And there is another type of real-time OS for intelligent driving. Just like the charging control, power control and other motor control. And these two of them can be considered as a twin-starter coexistence systems just like the figures. And the rich futures OS is as the main star and the real-time OS is as the second star. Just to give an example, like the main star may be an opponent and the second star may be a QX for QXR toss. So in such a mixed correct system, they are single hardware platform and with multiple OS, multiple VMs and with various applications for different functions. That's why we're ready to develop such a embedded hypervisor. And then we introduce the framework in the event. We can see in the picture that the event can be divided into four layers, including the hardware layers, hypervisor layers, guest OS layers and the applications. First, for hardware relations, we support the MVA architecture CPU, including support the core device such as a process memory and interrupt controller and et cetera. For the virtualization layer, it's about the core virtualization extension models like the vCPU, the memory and the device. And for the virtual framework or the shared memory or other function. And then we use the Java toss kernel-wise just like this and we use these Java models and the other function. And then is the guest kernel layers. We can use the event to create multiple machine like the Java toss or opponent or other Linux with MVA architecture. And the app layer is the application layers. We can run the real-time or no real-time applications for the different function. And then we talk about the given in opponent embedded. You can see in this picture that opponent embedded will need a embedded virtualization to put several OS. And given if you are as a part of an opponent embedded ecosystem to provide some virtualization capability for two wrong multiple OS. The main function of the event including managing the virtualization resource. For example, managing the process resource or memory resource or the device resource. And then it can provide in the API to the application just like to apply to provide the hardware access API for the OS to control the real-time hardware. And then it's supporting multiple operating systems including the opponent embedded Ubuntu and the Java toss. And in the future we are supported Android and the freeR toss. And it's given in the opponent embedded the ecosystem. And in the future we are trying to merge the event to the Java mainstream to merge into the Java ecosystem to be an international community as a part of the Java toss. And then we choose the future of the event, the technical future of the event. First is the system configuration, configurability. It supports flexible configuration of the event can switch between type 1 to type 2 as needed. The second is the dynamic resource. It's included dynamic memory allocation, dynamic device allocation when VM starts and adaptive task scheduling. And the third is the device virtualization. It's included device pass through and the device for virtualization and the parallel virtualization with the virtual framework. And the fourth is the inter-VM communications. We support the communication between VMs based on the NVC framework and the shared memory. And the fifth is the real-time features. We reuse the real-time performance of the Java toss and to support the VM people or data-line scheduling strategy. And then the sixth is the edge-side inference. It's about support for a petalizer. Edge-side AI inference framework supports models like RISC 950 and others. And then we just tell the detail of the technical of the event. First, the system configurations. It's about how we to configure over the event. It's included two parts. First one is how to configure the VMs. And we've used the device tree to individually to configure VM according to the device tree description file just like this. And we use this so we can decide how many devices the VM can get and how many models they can get or can patch sound system inference to the VM. And the second part is the K-config system. Just like Linux, we can use the K-config system to compare the event itself. For example, to decide which system will be built into the event and what's the VM function or other models can be built into the event. And with this feature, the event can switch from the Type 1 and the Type 2 Type 1 and the Type 2 hybridizer. For example, if it's the Type 1 hybridizer, it means that a few of the systems may be built into the event itself and if it's the Type 2 hybridizer, it means that most of the kernel devices or other devices may be built into the event. And the second thing is the dynamic resource management. It can provide a flexible use of a hardware resource. It also can depart to two parts. The first one is the device allocations. It means that we can even unify all the devices in the hardware board and to send the device to the specific VM when the VM starts up. And then there is memory allocations. Even use a double-linked list to maintain memory space for each VM. And we even build memory memory to build the page table for each VM to translate the VM's virtual machine, virtual address to physical address. And then we also, as even also, can provide a different page table to different OS, maybe a small page table for a Java VM or a bigger page table for a Linux VM. And based on the Java HEP memory management, when you support the quick, rapidly memory allocations, it makes the VM more easy to use. And then the device virtual relations. Even use a different virtual relation master for the different device type. For example, for the core device, just like the interruptor controller or clock controller or the PCIe bus controllers, we use the four virtual relations master because if the improper access to such a core device may cause a fatal error in the system. And then we support the parry virtual relation device just like a virtual block based on the virtual framework. And it's common in other virtual relation systems just like Linux KVN or ZEN. And with the virtual framework, we can easily to add a virtual device to our system. And then it's passed through the device model. It's based on the Java device model. And with this device model, we can initiate the device and unlock the device to the idle device list when the system boots. And then when we want to start a VM, we are to instantiate the virtual device and to use the virtual device drive to unlock the virtual device to the VM. And when the VM want to access the device, we will even establish an MML map for each VM so it can directly access the hardware and do not need to cynical or other machinations. And then for the interruptor management, it's even handle the device interruptor through the driver's API and redirect the device interruptor to the VM. And the first is the inter-VM communications. With the VM communications, it can enhance the resource sharing and the collaboration between virtual machines. Just like the VM with Java and the VM with the whole line can communication with ours. And with the average man-stand framework, we can implement the data sharing between the two VMs based on shared memory, just like the memory loader. And this memory loader is to organize into an AVL tree for a quick search. And then this framework is highly portable and it's suitable for multiple operating systems. This can be deployed in the Java and the Apollo and we can easily port it to Ubuntu or other operating systems. And finally, with such a shared memory method, it supports real-time high-speed communication between VMs. And then is the real-time features. And we can even have a real-time assurance based on the Java and Apollo's plan for. You can see that it supports different scheduling queens, like a single-linked list or a double-linked list of red block trees. And it also supports multiple scheduling strategies like collaborative or primitive priority and the earliest deadline for real-time scheduling. And because Overwatch event is a lightweight kernel, so it has a facility to real-time performance and nice verification. And finally, the edge-side interface. Given you support real-time inference on the edge-side under the virtual environment, it's because the patchlights and the TensorFlow lights are supported on the Java TOS VM. So when given to boot a Java, it can run patchlights and TensorFlow lights. And this can meet the real-time requirement in the source-constructed embedded replanform and with such embedded processors. With this, in the future, it can expand the application domain of the Java and make the Java and VM more opportunity in the edge-side and make the ecosystem more better. And this is our use case of Overwatch event, and it's a self-balanced car. And in this board, it's running two types of OS. One is the upper-worlder, and the other is the Java TOS. And the hardware is four-linked OK3, 5.6.8C developer board. It's based on RockChipo, OK3, 5.6.8 system chip. And the software is just described. One is the one-ZVM, and the one-Java, and the one-Linux, and the Java is real-time control. It can achieve a precise control of self-balance. And it can be seen as a demo for a simple mixed-corrective system. And finally is our open-source works. First is the ZVM. ZVM is a project on the upper-worlder source communities. And if you want to try ZVM, you can go to this link and to download the code to try the ZVM. And the second is to... We are sharing our ZVM video in the self-development device, self-development developer summits as a global embedded open-source summits in 2023. And we have... We make ZVM competition projects in computer-system-developed capability competition in China. And finally, part of ZVM is merged into the padlock community project. And that's all. Thanks, everyone. And also thanks to the opportunity provided by the upper-worlder and the first-generation. Yes. No, no. No, no, no, it's just in the embedded system. Yes, yes. It cannot be used in video games. It cannot. It cannot. Why cannot? Because... Because it's a different architecture from the V8. And if you want to use it in the Intel, we usually put it on it. And this work, I don't have to do it. Yes, yes, only for... Okay, just to make a... ZVM is a small OS, and it supports multi-architecture, like 64, X86, Rixify, a lot of architecture. But it's mainly for embedded scenarios. For example, automobiles, robbers, industrial controllers. So, I mean, for other scenarios, like a server, like cloud is recommended to use, I mean, KVM in Linux. There are already a lot of solutions in the IT scenario. Right. But M is mainly aiming for, I mean, embedded systems. Right? Yes. Yes, yes, yes. We designed it to replace the role of Linux plus KVM. Yeah, yeah, yeah. So, what's the difference? Because, you know, Zephyr has the ability of real-time, safety, security. Because, you know, if you want to pass some certification, for example, in some serial, like automobiles, it requires, I mean, all the software must be passed the certification. It must be safe. Linux cannot pass the certification because Linux is too large. It's too complicated. Right? Yeah, so if you want to use, I mean, so, if you want to use, I mean, aerospace, automobiles, I mean, Linux, okay, cannot be there. I mean, the root of, I mean, real-time or root of, I mean, safety or security. Just like our cell phone, you know, your face, your finger, I mean, data is not stored in Android or iOS. There is a security island, which is accepted by a small OS like Zephyr. So, yeah. Yeah. Yeah, yeah. Yeah, yeah. No, no, no, yeah, because it's, you know, that for big machines, it requires performance, throughput, performance, but Zephyr is not... Because, you know, for that scenario, for, I mean, complex machines, because it's simple. Yeah, simple is better. But when you have, I mean, of course, you can use it, no problem. For example, in some, I mean, dormant control or central control. Yeah, yeah, but... Yeah, it's not complicated. Yeah. So, I mean, for that scenario, I mean, the performance, the cost is more than... Yeah, that's right, stability, no problem. Yeah, yeah, but... Yeah, yeah, yeah. Yeah, that's right, that's right. Yeah, yeah, so it can guarantee the latency of time, but it cannot guarantee the performance, right? Because, you know, for big machines, I mean, people more require, I mean, high performance, high throughput, right? So this is done very well. I mean, with latest plus KVM. Yeah. Yeah, that's good. All right, everyone. Welcome to our topic. I'm very excited to be here in the 4th Stature Summit, sharing our topic, Introduction to OpenUrEmbedded and its innovative features. My name is Yong Mao. Now I am the Committer of the OpenUrEmbedded repository, responsible for developing the Mika framework and auditing codes for developers when they upload their codes to our repository. Okay, in the following half an hour, I would like to introduce two main topics. The first one is Introduction to OpenUrEmbedded, including its key features, its general architectures, and its application scenarios. Then I would like to introduce the Mixed Critical System framework. And it's a management framework for supporting multi-OS running on a single embedded OS platform. Okay, so firstly, thanks to Dr. Xiong's presentation about OpenUrEmbedded, now we may have a better understanding of what OpenUrEmbedded is. As we know, OpenUrEmbed covers four main scenarios, that is server, cloud, edge, and embedded. And OpenUrEmbedded focuses on providing competitive operating system technology for resource-limited hardware platform. And because of the need to build customized OS, we choose Yachto as our composing system. So it's an open-source project from Linux Foundation, and it provides an easy way to tailor our operating system. Okay, so after knowing the relationship between OpenUrEmbedded and OpenUrEmbedded, now I would like to provide a definition of what it really is. So we define it as an open and comprehensive embedded software platform. So it's not just the embedded Linux, but it's like a solar system. So some is the embedded Linux. It provides the unified build, rich features, ecosystems, and standardized interface. And we also have some non-Linux planets which provide rich features, like it can provide security, RTOS can provide hard real-time, and bare metal can provide extreme performance, while the embedded virtualization can provide isolation and resource management. Okay, so let's take a deeper look to OpenUrEmbedded. So the blue part are the scope of OpenUrEmbedded. So for the hardware, we can support more diversified hardware than the cloud and server scenario. And for the application, we mainly support operational technology applications. And we would like to attract users and vendors to build your own application based on OpenUrEmbedded to provide more interesting features. And at the bottom part, we have the FusionDoc, which is a collection of technologies to support multi-OS running, so including virtualizations, containers, etc. And above that, we have the microfilmwork, which is the management framework for lifecycle management for multi-OS, and we also provide the communication framework for different OS communication. And above the microfilmwork, we have various types of OS. For Linux kernel, we have configuration and optimization for different scenarios. And above Linux, we have many different kinds of interesting features. And the RTOS, we don't have a best for our option for it, so we support it, you know, different kinds of RTOS. So firstly, the Uniproton, which is a very interesting RTOS, also from OpenUrEmbedded. And we have Zephyr, which is a well-known operating system worldwide and artist-read, and very popular RTOS in China. And also, reliability is our focus, so we would like to provide the performance tuning, debugging, and chasing methods. For example, the microfilmwork have provided debugging methods for RTOS, who has realized the GDB stop. And also, the tools and infrastructure are also very important, they are the backbone of our system and they are supporting our daily work. Okay, so let's move down a little bit to CDP features of our system. We have one plus X plus one feature. The first two features are the invariance, which is the two legs to support our operating system. That is the embedded Linux and the infrastructure, with the X features in the middle. And for the embedded Linux, we have a high-quality maintain kernel, which is the same kernel as other kinds of distributions of OpenUrEmbedded. And above that, we will do specific optimizations and configurations, like we will apply premRT patch for soft real-time applications, and we will do fine-grained configurations and tuning to reduce binary size and memory consumption. As for the infrastructure, we choose Compass AI as the foundation technologies for our CI-CD system, and we choose Yachto as our building framework, as I said before. And we store all of our cells in GitE, which is something like GitHub, but a more popular one in China. In the middle, we have the DSoftBus, which is a kind of, you know, a collection of technologies to manage and connect different node devices in a local network. And we have our mixed criticality system, so it provides the foundation technologies like hypervisors and containers, and we provide a management framework above it. And we also support embedded robot runtime, like we integrate Rows2 into our system. And we also have the embedded, you know, we also want to provide more support for embedded edge and embedded AI in the future. So, like, we will integrate kube edge k3s into our system, so that in the future, we will improve the collaboration between the cloud, edge, and, you know, embedded. And also, we will integrate mainstream AI framework in the future, like the TensorFlow, PyTorch, or MindsBoard. Okay, so after knowing the key features, now you may be wondering, like, so what does it look like when the open-user embedded is running? So actually, we have four running modes. The first one is the very typical mode, so that is we only run one Linux based on the embedded system. So now we can, you know, get the advantage of Linux ecosystems, but the cons are very obvious. There's no guarantee of real-time or high reliability, which are the advantage of RTOS. As a result, in the following three modes, we can run more than one OS based on a single embedded platform. For the AMP mode, or the asymmetric multi-processing mode, we deploy open-user embedded and RTOS based on different types of CPUs. Like, for example, we deploy open-user embedded on a entire ARM core cluster, and we deploy RTOS on entire ARM core. And the static... So the hardware resource is static-allocated, so the advantage is the number of OS is limited by the hardware resources. Like, if we only have four CPUs in the embedded system, then we cannot run five OS based on that. To resolve this problem, we introduce the virtualization mode, which can, you know, provide... which use the virtualization technology to provide, in theory, unlimited operating system to, you know, provide the... chance to run unlimited operating system based on a resource-limited hardware platform. However, because we have poor support for heterogeneous cores, so Hypervisor can only run on homogeneous cores cluster. As a result, in the end, we have diffusion mode. So we will run AMP mode for heterogeneous cores, and we will run virtualization mode for homogeneous cores. All right. So actually, we can deploy our operating system as three different kinds of system. So firstly, we can deploy it as a server so that it can control different kinds of node devices through Internet. And we can also deploy it as an edge-based system for edge computing. And we can also deploy it as just, you know, as originally designed for the embedded system. And it can control different kinds of peripherals. Okay, so I want to make a small summary for our ecosystem. OpenUser Embedded is an open and comprehensive software platform. It includes the embedded Linux part, which gains the kernel from OpenUser Community the same kernel as the other kind of distribution. And we use Yachto as our business system, which is different from other distribution. And we also have some non-Linux parts like RTOS and Hypervisor. For the applications, we support industrial IoT, robotics applications, energy industries, and BMC, which stands for Baseboard Management Controller. And we have some applications for it, like we have applied our operating system into the industry controller, the unmanned vehicles, and some interesting robots. Like last year in the OpenUser Summit, there is a very interesting robot. If you stand in front of the camera, the painting robot can paint a picture for you. So it's also run the OpenUser Embedded inside it. All right, so I would like to introduce the development board. So UluruPie is a series of development boards designed for running OpenUser Embedded. Hi, UluruPie is one of it. And its SOC is SD3403, designed by Hi, Silicon. So we name it Hi, UluruPie. It has very powerful hardware and can run complex software based on it. And its usage is mainly for industrial control and robotic applications. It can support various types of communication protocols and peripherals. Okay, so secondly, I would like to introduce the next criticality system framework. So at first, I would like to make a small metaphor to help you understand what is mixed criticality system. So in the past, like we only have small flower pots. In each flower pot, we can only plant one single type of flower with very limited quantity. But with the time flies, we only have more mature potmaking technologies. And nowadays, we have very big flower pots. In each flower pot, we can plant various types of flowers. And also, each type of flower are isolated. For example, if one bunch of flower dies, it will not affect other bunch of flowers. Okay, let's move back to the field of computer science. In the past, we only have very simple hardwares, like the tiny MCU. So we can only run something like bare metal applications or small RTOS based on that. And nowadays, we have very complicated SOCs with multi-cores, even heterogeneous cores. So that we can run more than one OS based on the SOC. So like we can run Linux plus RTOS plus other kind of bare metal applications. But to ensure the safety, we need to provide an isolation mechanism so that if one OS dies, it will not affect other OS running. Okay, so maybe now you have a general understanding of what MCS is. So now I would like to provide a more clear, more strict definition. So as its name says, MCS stands for mixed criticality system. So literally, it's just stands for a system with mixed criticality components. So what does criticality stands for? It mainly refers to safety, but it can also extend to other notions like real-time, security powers, et cetera. And we mainly focus on three parts of MCS, that is the deployment, quarantine, and scheduling. So firstly, we need to deploy different kinds of OS onto the platform so that we now have the mixed criticality. But that is not enough because to ensure safety, we need to provide quarantine between different OS. To achieve a better overall performance, we need to implement scheduling so that like the virtualization, if one OS is in idle state, it can allocate resource for other kinds of OS to achieve a better overall performance. And we believe it's the future trend of embedded systems because there are a few reasons. From the server side, the hardware is evolving from distributed one to centralized one. Like for example, the ECU in automotive vehicles they can only do very simple tasks. But nowadays, they are evolving into a zone controller to control a large area of hardware. So we need a more complex software to do the control. And from the client side, we are having more and more complex, no devices like the drones nowadays. They should keep balance at the same time while taking photos and transmitting the data back to the cloud. They are doing many complex things at the same time. So we may need more than one OS for the different applications. All right. So I would like to introduce our mixed critical deployment framework. Why we call it deployment framework is because for the quarantine and scheduling mechanism they are implemented by the foundation technologies like virtualizations and containers which is included in the fusion dock part. And our Mica framework mainly includes four parts that is the life cycle management cross OS communication service framework and multi OS build infrastructure. So for the life cycle management we provide a unified interface for multi OS management so that we can hide the differences of the foundation technologies in fusion dock. As for the cross OS communication we provide a shared memory based and to support efficient communication. As for the service framework we define the service interface as different OS provide different services and in the end we also provide a unified resource description so that you know different OS can dock into our Mica framework more easily in a more standard way. So we can take a deeper look into our framework so for the life cycle management framework actually because it needs to control the hardware so actually it's a platform independent framework. We will take RVA architecture as an example. So firstly if the Mica framework want to put up the remote processor it will send SMC call to the bottom level firmware and the bottom level firmware will do the actual work to put up the CPU of the remote operating system. And for our framework we mainly use the remote product framework as our base technology for doing most part in life cycle management. So remote product framework is a framework in the next kernel and it provides the standard way to abstract the remote operating systems like in the framework the remote operating system is a remote product instance so in our Mica framework we have many remote product instance and below the remote product instance we have a unified resource manager to manage all kinds of resources in our Mica framework and above each remote product instance it will call the interfaces of remote product to load our OS binary to the destination memory and parse the resource table and build up doing a lot of things. So the resource table so it's a very interesting and very special data structure in ERF format each entry of the resource table indicates a type of resource for between the two OS as there is a shared resource As for our service framework for now we mainly depend on this framework from kernel as well and if we have one service between the two OS then it will establish one specific endpoint for it but for the abstraction the service is another kind of abstraction other than the endpoint more abstract notion for example we can have pseudo-terminal service that is the PTY service but for the two OS they may have more than one PTY connection as a result for a single service we may have more than one endpoint map to it and also we can provide self-defined service like the debugging for RTOS with gdb-stop and the bottom level is not implemented with rp-message framework we are implemented with our own defined data structure alright so then I would like to introduce the history and the future of the Mica framework so in the past we only implement the mp-based Mica framework so that we can only support many deployments but for quarantine and scheduling and now we are developing with many excellent virtualization technologies like AVMs like rust shivers and we also want to use the lightweight containers to to enrich our ecosystem and as a result in the future we want to support both heterogeneous and multi both heterogeneous and homogeneous cores and support even you know the collaboration between the no-device and the cloud systems okay so I would like to show our vision of the future framework so firstly we will deploy many OS based on the you know the homogeneous core cluster with type 1 hypervisor because it's the type 1 hypervisor even though different virtual machines are isolated they actually have a management VM for management the whole hypervisor system sorry the virtualization system and because the management VM is responsible for take control of the communication in and out the system so like if the two VMs want to send interrupts to each other they need to pass through the management VM but to improve the efficiency of data transmitting the two VM can you know directly establishing their data transmission channel and secondly we can also deploy our operating system to heterogeneous cores and the communication can happen in the physical memory of the physical layer we have the VRT-IO protocol as the link layer protocol and above that you know except for our message protocol which is a kind of transport layer protocol we also want to support other protocols like UDP and TCP and above that we want to establish our own RPC framework that is the remote procedure core framework and certainly we also want to modify the IZULA so that IZULA is a kind of lightweight container technology which is something like Docker but it's from the open user community and we want to modify IZULA so that users can deploy operating system through K3S, through IZULA to the MIKA framework to here or here to the embedded device alright so last welcome to engage in our open OILA community and these are the resources for you okay thank you for listening if you have any questions welcome to ask please how is the quarantine aspect of MCA and MIKA enforced when you've got operating systems on the same hardware quarantine technologies for our MIKA framework so we mainly implement it through the virtualization technologies or container technologies so we will not implement it by ourselves because there are some mature technologies for it yeah okay thank you so what are you using for a BMC are you using open BMC or an existing project actually we community members they are implement our system but they are not open BMC image they are using their own company developed BMC applications however our community has implemented the open BMC image so you can have a try yeah so anyone else okay so another thing to mention Dr Ren is our chief architect of open year embedded and if you have more questions and you are welcome to ask him as well okay thank you from China it's my first time to come to Vietnam I'm very excited and honored there are so many young faces my topic is accelerator framework UADK I worked for NINARO for over 10 years and have been collaborating with high silicon for over 10 years now I'm maintaining the Kono sub module UACCE and also maintaining areas of UADK like the open SSL and DBDK we are also upstreaming the SBDK UADK component there is a gender first then a dam then we have a GPU and some component of UADK and lastly the performance as we know there are now there are many accelerators like GPU, MPU, TAPI, so so and also more and more user drivers memory copy from user space to kernel space. So it's better to directly have provided the user driver. And lastly, the security. We all know the kernel is safe, we can trust the kernel. But if the Java is in user space, can we trust it? This is our UADK overview. UADK is based on the other MMEU. So it provides boundary check and permission check. For example, if your driver cannot access other area of the accelerator, you can only touch your own accelerator. So it is much safer. And our MMEU provides SVA feature, that means the shared virtual address. So in the user space, we can directly use a virtual address. And for it is a user space driver. In the user space, we can directly do the DMA to the accelerator. What's more, this SVA feature, we all know usually we have to use physical address, but this SVA feature, we can use the virtual address directly. So it is very convenient. Thanks for the pass ID. We can do using multi-process because the pass ID can distinguish them. Then I need to control several modules, the kernel module and the user space module. This is a general framework. In the bottom is accelerator. It can provide several queue. The queue, each queue can provide a service. So many queue can work simultaneously. Now how user space use that queue? So the accelerator have to register to the USCE. With the help of USCE, the application, multi-application can directly open the USCE to get the access of the queue. For example, the left is pass ID one, is application one. The left, the right is pass ID two, is application two. So they can run in the meantime. The pass ID can distinguish them. Also, in the pass application one, we want to access two queue or many queues for the multi-thread for the better performance. So you can open multi-times. The USCE can use our MMU for the bind. So our MMU can bind the device to the current MM. The code is already merged in the kernel, in the kernel, in the 5.7. Is there any question about this picture? This is a general framework of the USCE. Okay, next. This is SVA, general SVA. We all know the MMU, CPU has MMU. It has a patch table. So the application uses a virtual address. Also, device also has MMU. It's called SMMU or our MMU. It also has a patch table. So there are two patch tables. If so, why not the same? So now the SVA feature uses the same patch table, the MMU and SMMU that share the one patch table, only one patch table. So they can know each other. With this, the DMA can directly use a virtual address. And also they are patched forward, our patch forward, similar like the MMU. For example, in the SMMU 3, when their patch forward happened, they had a handler to call the handleMMMFOT. It's the same as the MMU. It will provide the memory for you. Is that okay? This is how accelerator use the USCE. In the accelerator driver, in the pro process, it had to unlock USCE and then register to the USCE. The USCE was using the our MMU feature to open the RPF feature and SVA feature. And after that, it will provide a char device. So with that char device, the user application can access to the accelerator. In the remote, the USCE will do the USCE remote. Also it provides some ops. When the register, like get queue, put queue, start queue, stop queue and memory. So after the registration, the application can directly open the USCE. It is a char device. So you simply open the char device. With that open, the USCE will bind the device, the accelerator with the current MM. So after that bind, the virtual address can be directly recognized for the application and the accelerator. And the USCE will get a queue from the accelerator. And after you open, after the application, well, a map, a map of the MMRO is doorbell. Just prepare DMA and send doorbell to tell the accelerator to start the DMA. Also a map of the other area for the DMA, DMA ranging. And then you can all control the start. After the preparation, you just prepare the DMA, prepare the data and send the doorbell to tell the accelerator to start. After that, you can just pull, you can pull the FFD or just keep reading. And under all the finishes, you can put queue. And close FD for the last time. This is the user application to use USCE. That's the kernel module, USCE. This is the whole picture of the UADK. The right is a picture. In the bottom is the accelerator. Then the kernel driver, USCE. After then the user space, there is the library, is the UADK library. We can provide the user driver. Currently we, but be careful, there are two requirements. Firstly, our MMU have to provide the SDA feature. Also the device have to support because they're pass ID. We can use the accelerator in the multi process. So the device have to recognize the pass ID. If the accelerator does not support SDA, you cannot use it. In the high-sirking platform, we currently we use the crypto. Like crypto accelerator can support this feature. We provide compression like zip, Gzip, deflate and ZSTD. Also the asynchronous crypto. Like I say DH, ECC. We know the asynchronous crypto is very costing. It needs a long time because it is not like the CMIC crypto is very simple and much faster. So I think like I say DH is very, if you do it using the CPU, it needs a long time. So it's very suitable for the crypto, for the accelerator. We provide the ICDH and ECC. Also the digest is just like hash, MD5, SM3, et cetera. Now we support our UADK to the DPDK, also the SBDK. You can directly use it. Oh, sorry. This is a performance. The green line is using CPU. The brown line, brown color is hardware. We can simply see, for example, for the first line SM4, if using the software and hardware about four times better, also the CPU rates is much downgrade. Like ZLIP and Gzip, also almost 20 times better. The ICDH is also for three times better. This is a performance data. We test this on the High-Stick and Compone 920 platform. This is a host. This is a guest. In the guest, we have to use the virtual RMMU. It needs the lasting translation. It is still in upstream. We use techniques to support the better performance in the guest. We know in the guest there is some penalty because there is a nasty translation from the virtual address in the guest have to translate to the physical address of the guest. And then lastly, the physical address of the host. So there is some penalty. We just use some technique. Use a huge page, use two megabytes. Huge page is used to decrease the TLP miss. With this technique, the performance in the guest almost comparable to the host, about 85% or 90% of the comparable to the host. This is a guest. This is a UADK test for the open SSL. The blue line is soft. The yellow line is using hardware accelerator UADK. You can see the sign, for example, the left sign. Sign is much better. Four times, four times, the verify is a little bit better. And SM3 for the big data, because hardware usually do better for the, do better in the big data, big package size. You have to, we have to prepare and if using small size, the CPU is better because there is no preparation. SM4 is also the same. Using the big data is better. Here is the DPDK. DPDK was supported. It's already 22.11. We are now, we are now upstreaming for the SBDK. So in the SBDK and the DPDK, we can directly use the UADK components. We can see also the same, the small package, the software is better because there is no preparation. You can use any time with the CPU. But for the big package, the bigger, the better. Hardware is better. You just prepare all data to the accelerator and then you don't care. Also, since the UADK provides the crypto and the compression. So now the UADK only, if you use UADK, you have to have the accelerator. But sometimes you may not have the accelerator. So we are investigating using the CPU structure, CPU instruction. We tested in the, this is a test with Neo, Neo CPU instruction. Can see the instruction is also better than the software. The software is not the CPU instruction. Just a CPU, you can see it's also better. So as a total solution of the UADK, we also, we also added the CPU instruction like the CE, we are CE, SV, CE means crypto extension. SME means scalable vector extension. They are also, they are on V8 feature. So if you have, have on V8 CPU, you can use these two CPU instruction. So we also adding these two instruction to our UADK solution. Yeah, we are just adding this, see the blue line. Also, we, the OpenULA ACC Lib, we have meetings bi-weekly. If you have interested, you can join this meeting. That is GitHub. This is a simple description of NARO. We come from NARO. And NARO has many maintainer, maintainer. We have always top three contributor of NISCONO since over 10 years, since 2011. So we have many, also we provide some like a special feature like TE, it's still developed and maintained by the NARO. So if you have interest, especially on the ARM ecosystem, you can contact these us, yeah, that's all. Thank you. Any, any questions? Okay, thank you. So I guess we'll start. Hi everyone. You are in the vector processing with the RIS5 vector extension talk. So I'm Rémy Denis. And I happen to be, well, you don't really care, but I happen to be a software security research engineer. And this is not, I'm not talking about my work today. I also happen, and that's closer to the presentation. I also happen to be one of the main developers for the VLC media player for those who knows, who knows, sorry. And I have about 30 years experience in this tree. So, but first let's talk to optimize programs, calculations. So when I started computing, I was still a kid and I was using my mother computer. And back then it was running at 0.008 gigahertz. So it was pretty slow. But back then it was nice to make your program faster. You just waited a few years with a new computer and it was just faster. And in 2003 I got my first own computer at university and it ran at 2.8. So several hundred of time faster in just 10 years. But then something happened. We reached, we got close to the physical limits and we couldn't increase clock frequency anymore, but instead we increased cores. So how to speed up your program? Well, obviously, first we tried increasing clock frequency. Then we used multiple threads. So there's symmetric multi processing and symmetric multi-threading, which are kind of similar tech. The problem with those is you need to make your program multi-threaded and that's not always difficult and doesn't always work even. Or you can use specialized processing units like GPU and VU, VPU, ISP and so on. And then there's a trick that has been also used for general purpose computing is to use superscalar execution. And superscalar execution, sorry, is, well, you make your processor do multiple things at the same time in a single thread. And for instance, let's say that I want to calculate the sum and the difference of two values at the same time. So A0 and A1, well, I can have, I will have those two instructions to run on the processor. They're independent instructions. I can run them, the processor is able to run them in parallel and save a lot of time by doing this without increasing the clock frequency. Of course, you need to increase the processor size. Now the problem with this, you need to decode multiple instructions and you have to look for dependency hazards or data hazards between the instructions because, for instance, if you want to sum A1, A0, A1 and A2, well, the second instruction depend on the first one. I can't run them in parallel because if I would do that, then it would, obviously the calculation would be incorrect because I need the result T0 of the first one to calculate the second one. So here comes vector processing or sometimes also called CMD or SIMD causing an instruction multiple data. The idea is to apply the same instruction on multiple data elements at once. It's also called vector because, well, it's kind of like, you can think of it kind of like your linear algebra if you've done a first cycle mass. And here this was popularized in the 90s by Pentium MMX and we basically tell the processor explicitly that we want to execute multiple calculations in parallel and that it is safe to do. So of course it means we take the responsibility as a programmer to warranty this to the processor to enable faster speeds. So today we're talking about this on RISV. So RISV as many of you probably already know means there is a new instruction set architecture which is an open standard with in print people royalty-free and no patent encumberment. And it means reduced instruction set computer which means that it's very easy to decode instructions for the processor you don't waste. So your processor you can dedicate it to calculating things and not to wasting time decoding the instructions which would be the case on an Intel processor. And this also means that anyone can design their own processor you don't need a patent license like for X86 or you don't need to buy a license expensively from a limited like an arm. And you have vendors from China, US, Europe and I don't know about Vietnam but it might also be. So RVV RISV vector extension is an open standard. There's already some open source designs for it and there's also some real hardware will come to that. It's also, the reason I'm talking about it is also it's a lot easier in 25 minutes to explain that than X86. And an interesting aspect compared to X86 especially is that it's scalable. So it basically or bright ones run anywhere. I think you've heard this story already in the stocks in this track today. So availability as there's currently one piece of hardware that's widely available and it's actually quite cheap so about one and a half million dongs. But I think we're gonna get, this is credit card size. We're gonna get much better hardware within months or years. You can also emulate it on QMU of course, just install the, on Linux, you can just install the bean FMT QMU wrapper and it will literally run your RISV programs directly on your X86 or ARM Linux system. So this is really convenient for testing purpose. So anyway, that was the introduction. So let's have as simple as possible, still not that simple, but motivating example here. Obviously in 25 minutes, we are not going to go through the 190 plus different instructions in the instruction set of the vector processing unit. But so the classic example for vector processing is SACI function, which means sum of AX plus, or calculate A times X plus Y. And this, but if you know C, but even if you don't know C is fairly easy. So what we do is we calculate the product of vector X by scalar value or constant value A and then we add a vector Y and that gives us, that gives us a vector. Obviously this is a linear algebra. And this is a fairly simple example of how you can do vector processing because we're actually doing vector calculations here. So I'm going to rewrite that function a little bit in a way that is more like the computer would actually want it or the compiler would optimize it for you. And this is just essentially just removing the iterator and instead directly incrementing the pointer addresses as we go, it's actually faster to do it like that. And in practice your compiler would do it for you, but this helps to understand what happens. So essentially at each iteration of this loop, we decrement a number of elements that we still have to process and when we have to increment the three pointer addresses if we have as we go down the vectors. So one element by one element. But obviously this element at the time is not very fast. So it in a way that is even more close to all the computer would actually see it. And we don't have wide loops in computer, we only have foundational branches. So in practice, well, this is our main function I've renamed the variables to use the register names so that we can map it then to the code that's going to come on the next slide. So A3 is a count value, A0 is the first argument, it's a target output vector memory area, FA0 is a constant, this means floating argument number zero, A1 argument on his second argument that is not a float. So it's the X pointer address here and A2 is Y and then A3 is the counter. So this is where we would actually look like in more pseudo code type of like all the compiler would probably implement this in practice. Of course, it's going to be in with assembler instructions, but we'll get to that. And we of course have to have an explicit return at the end. So I don't know if many of you have done assembler before, but if you want to optimize it down to the instruction level, that's all we get. For this one, it's actually fairly simple. So instead of, now we are switching to assembler. So we're going to have to write this at the top of our source file instead of a dot C. So we just say that, okay, first is dot option, arc is a directive for the assembler, which says that we are enabling the vector extension. Essentially, we are telling the assembler that, okay, it's okay to use vector instructions because I know my processor will have vector support. And likewise, we enable the ZDA, it doesn't matter much. And then we say dot text, this means we are now writing code, which is text of the story for the computer. If you will, you could also write dot data then it would be like constant data or modifiable data, but text means that it's a computer program. And then we say we have a global function or global symbol, which we call saxpile. And yes, there's no A here, it's how it is. It's not a typo in my slides. It's actually like that in there. And then we say, okay, saxpile function starts here and then we'll write our code. So after this, well, how do we put this? So how do we put this in vector instruction? So if you have never seen assembler, this is going to look a little cryptic. Let's get through that step by step. So first we have to tell the processor what is the vector length. So the number of elements that we have in our vectors, in this case, this was a count variable which is now the A3 register if you recall the previous slide. So here we tell it, well, we have A3 elements, whatever values that happens to be and each element is 32 bit because we have float and float is single precision floating point values in this example. So it started to be their element. We could put 64 or 16 or eight depending on the data. So if you want to do AI, for instance, it's going to be half precision. You could put 16 bit, you could say E16 to have 16 bit floating point calculations, for instance. And then, well, what we have, what we have here is, so A1 and A2, X and Y vectors, they are pointers or addresses in memory. So the data is in the memory, it's not in the processor. So we need to move the data, transfer the data or copy the data from the memories around into the processor where it can make calculations. And this is called loading. So we do V for vector, L for load and then E32 means element of 32 bit. And we say we load that address A1 and we load it into, we need to put, so the processor has 32 vector registers named V0 to V31, not very original. And we need to pick one, like it doesn't really matter as long as we remember which one contains what. So here I took zero and this will load what was that address A1 into vector register zero. And then we do the same for Y, so from A2 to V8. So now we have transferred, we have copied the data of X and Y into the processor. We need to multiply X by constant A. Constant A is a parameter. So it's already in the processor, we don't need to load it. So we can directly do a multiplication. So this is V for vector F for floating point. If you don't put F, it would be integer. Mool for, well, multiplication, obviously. Dot, and then we say what we are doing, what data we have input. So VF means one vector and one floating point value or one scalar value or one single value, singular value, if you will. Here, for instance, the next one you have VV. This means we are operating with two vectors. And here VF means we have one vector and one scalar or one singular value. So this calculates FA0 times V0 for each element in V0 and puts the results back into V0. So in assembler, usually you have the result on the second parameter, first parameter. And then we do the add. So V8 plus V0, also into V0. This will add, well, this, which is X times A with V8, which is Y, so we have our result. But now the result is in the processor and we need to transfer it back into the memory, which of course, where are the calling code or the program expected to be. So we need to do the opposite of a load and the result of the load is a store. So vector S for store, E32, because again, we have 32 bit elements vector. And we say we are transferring the content of vector zero into the address in the memory location, or pointer A0, this kind of looks like this in C or in REST or in like C++. All right, but there's a problem. But if we do this, the processor doesn't have infinite capacity to store infinitely large vector. So if you have like one million vector and one million element in our vector, one million dimension vector linear algebra, this is just not going to work. So there's a trick here is that this T0 here, it will contain the number of elements that the processor is able to process at a time. And I mean, it's going to be more than one, but it's not going to be absolutely huge. We are going to have to iterate this set of instructions several times. And every time we will operate on T0 elements until we are down to, until everything is calculated and we can exit the function. So in total, okay, it gets a little bit more compact here. So we again have a set vector lengths and we count down. So A3 is the counter, so we just subtract. So we are going to process T0 elements so we can subtract T0 from A3 and update A3 and this. And here at the end, we check if A3 is not equal to zero, then we go back to one. Then we have our load. That's this X, this is Y, this is a multiplication, this is the add, and this is the store. And here we have the, so this SH2 adds are just updating the pointers, so incrementing the pointers across loops. So the way this works is that actually I think there might be, it might be A1, T0, A1, but as this adds, so SH2 means shifted by two bits to the left. So this is shift the value by two bits to the left, multiply by four in other words, and then add it. So this calculates A1 equals A1, well, this updates to add four times four. And the reason why we do T0 times four is that we have started to beat elements of four bytes, four bytes elements. And in C, the compiler will do this adjustment, this multiplication for you, but in assembler, the pointers are always in byte values. So you have to adjust for the size of the elements you're operating with. Because if you think about our pointer actually is represented in memory, if you've ever printed a pointer like in the CMAL or something, you'll see that it's always in byte value. So we have to manually multiply by four because our elements are four bytes. Of course, if we had two bytes element because we would be doing AI stuff, then it would be times two and this would be SH1 because to multiply by two, you have to shift by one bit to the left. And that's it. So at the end, we check if A3 is now down to zero and if it is not equal to zero, then we go one to label one, B means backwards. So back up, we could put F for forward, but today, of course, it's backward. So, yes, so that was the main part. Now we can rest our brain a little bit. So, and I mean, I know it's a little bit complicated. When it's around, it's actually much more complicated on other architectures. So let's go back to this set vector length instruction. So as I said, we have element size, 32-bit in this example, A3 is an input, it's telling we have this many elements available that I want to make my calculations with and T0 is the output, number of the super processor tells us the program or conceptually how many elements it can process in one iteration. And on current hardware, which I mentioned earlier, in practice, we have 128-bit vectors. So 32-bit elements, another 28-bit vector, they're gonna get four. I mean, of course, assuming that A3 is larger than four, then we're gonna get T0 is gonna be equal to four, which means we get a speedup of two to X, to four X times approximately. Of course, it depends on the hardware, but we can do better than this. And I mean, this is significant, but if you have a large scale, if you pay per minute of your VM runtime on some cloud service, multiply increasing your performance by four times is actually a huge cost-telling. But we can do much better than this. So there are additional parameters we can add here. NA, T-A-M-A. And M-A is officially called the groupity player, but it's essentially an unroll factor, which I explained what it means. It's basically saying that, hey, if we go back to, you can, I have a URL to the slide, you don't need to go to take photos. Yeah, so if we go back to this, I'm only using V8 and V0 here. And I have 32 registers. I'm only using two out of them. That's kind of not very efficient. Use of my hardware here. So I'm like, well, I only need two vectors for this calculation. So all about I'll just pretend that I have fewer, bigger vectors. And the maximum you can use is eight. So what this means is that essentially V0 becomes a representation of all vectors from V0 to V7. So V1, two, three, four, five, six, seven. V8 becomes eight, nine to 15. And if we would use 16, it would be all the way to 23 and then 24 to 31. So by doing this, we ended up adding only four vectors, but they are eight times larger than they would be if we had just used a normal M1. And this again improves the performance quite a bit. Like with this, we would typically get like five X to eight X, maybe they have, of course, it doesn't scale infinitely, but, and I mean, of course, we can use other values. So if we need more vectors, we might, we have M4, M2, M1 is the default, and you can even go into fractional. So half, quarter and eighth. TA means tail diagnostic. It basically means that if we have less than four elements, if you have less than the maximum, we don't care what happens to the stock after the end. So that's why it's called tail agnostic. The tail is the stuff. So if you say you have a vector of four elements, capacity in your processor, and you have only three elements that you actually want to calculate with, you are saying that the fourth element at the end of the vector, I don't care what the CPU can do. It does with it, it can do calculations, it can ignore it, whatever. So essentially, you're telling the processor, please do whatever is fastest for your implementation. I don't really care. And similarly, MA is mask agnostic mode. This is about masking. I'm not gonna go into the detail of masking. If you're interested, you can arrange the specification, but again, it's another way of saying that don't need to worry about masking in this calculation. You can do whatever you want, whatever is fastest. So this, and that's pretty much it. So, and we're on time. Reference is, well, obviously I only had, so we only have like five instructions in reality, a vector instruction that is about 190, you can take different instructions. I mean, you have subtraction, you don't have division, but you have all sorts of integral addition, like XOR, and OR, and add and sub, and reverse sub, and whatever, and even more complex things like mean, like which A plus B divided by two. So I can't go through all of them. If you're, of course, in real case, if you want to take this into use, you're going to have to go through the RISC-5V, vector extension specification, which has all the list of all the instructions, and they are nicely grouped into different families, like integer, float, mixed with, and all sorts of things. The untrived RISC specification is the baseline RISC-5 instruction set for like doing the base things like sub and branching, for instance. Again, I mean, you might need another branch condition here. This is just an example. You can go to the untrived RISC specification for that. What's nice to note is that they have separate specifications for untrived and privileged, and if you're just writing like software, normal software, you can just use the untrived. Privileged is if you want to write KNL code or hypervisor code, which goes memory management, that sort of thing. So here we don't care about that, so we have half of the specification that we don't need to care about. And then you have the ABI spec. That's where it tells you, that's a spec that tells you like, which O2, like, where your parameter and your return values are going into the, which register, which, where they are going, and all the other data is laid out in your memory. It's essentially mapping the C and C++ and OST and OZ, D, and, while all these native component language shows a map to the processor on this specific architecture, and you have ABI specifications on every architecture. There's an ARM ABI, there's even an ARM 64-bit ABI, 32-bit ABI, etc., so there's a RISC-5 ABI. And you can find all of these on the RISC-5 GitHub. The slides are there, so if you need to take one screenshot, if you need to take one photo, it's this one, and also full working sources with the Mac file and stuff there, and that's it. Yes? So in my experience, obviously it depends a lot on the implementation and because RISC-5 is an open specification and not a hardware as such, it will depend on the vendor. So this hardware I have, which is currently the main available hardware are coming from our scan-on, but it's really bad by a T-head, which is a subsidiary of Alibaba. I think for this loop, it gives us between, well, maybe six to eight times faster, but it depends on the loop. And compared to x86 or arms, I think compared to ARM, it's roughly the same, typically. Of course, it depends on what processor you have. Usually you can't choose. Yeah, so x86 is AMD and Intel. So for x86, it's a bit different because x86 usually on the face of it, using MMX, well, not MMX, but SSU IVX will give you better improvements than this, but the reason I think is more like on Intel, the CPU is really slow for non-vector calculations, for scalar calculations. Because of the complex instruction set, they have to do a lot of work in decoding and speculating execution and all sorts of things. And because of this, the scalar execution mode is actually quite slow. And because of this, they get a lot more benefit from vector calculations than a simpler RISC architecture like ARM or RISC-5. And for this reason, I think this is one of the reasons why you will see that AVX is actually giving you better performance compared to plain x86, than RISC-5 vectors would give you compared to RISC-5 and also if you compare ARM NEON or ARM SVE to plain ARM. Yes. So I don't make hardware, I'm a software engineer. I mean, I think this year several vendors have announced that they will release hardware, but the problem when they announce is you never know did they make sense and then make hardware RISC, which that means that you have one or two years to wait before the CPU comes. Or does that mean that they actually have a board that's coming, like that you can buy on order on? So currently, this one is available. It has been available since October last year. And if you order it today, you'll get it in a week. But there will be better hardware from a variety of companies. So you have T-head Alibaba in China and also Star-5. In US, you have Sci-5 and 10th Torrent, I think. And well, there's a whole bunch of them. Because it's an open-air-quest system, there's a lot of competition. So this vector specification, version 1.0 was released two and a half years ago. Yeah, approximately. Or they're not touching the base specs, they're adding encryption extensions to improve the speed of also encryption algorithms. And maybe some stuff. I don't actually know what they're adding. Yes? How many are what? So this hardware there, it has one vector-capable core and one non-vector-capable core. So it has two cores, only one of which have vector support. But again, there are some cores. There are some boards already available with four cores or eight cores. And somebody has even made like a super calculator with like a single 1,000 cores. RIS-5 is just an instruction set. So you can do designs of whatever you want. I don't think there's a fundamental difference between in terms of how many cores you can make, it doesn't really matter which is much a set you have. You can make them really large, you can make them really small. RIS-5 scales up and down. And as I said, like of course, if you have multiple sockets, you could have really a lot of CPU and one series, one such hardware. I don't remember, I think milk, V, something, something. Chinese company has made some really hardware with a really lot of cores, but they don't have this yet. It's gonna take a few more years to get high performance vector cores, I think. So currently obviously at that price, you're not gonna get a super calculator. So this is good for early testing and benchmarking, but I think in a matter of year or two, we're gonna get really performance, like more access or mobile phone class type of cores coming. But I mean, I'm not a vendor, so I can't really... And if that's all, then I guess I'll release everyone for lunch. All right.