 Hello. Thank you for coming to our presentation session. And I'm Seungwoo from Samsung Electronics and your resource channel. And today, we will introduce our 64-bit corner bring up on our Samsung TM2 board with the V8 architecture Exynos 5.433 SoC. Before starting the main subject, I want to introduce ourselves first. And here, the channel is the corner maintainer of the Excon driver, the external connector, and also the Exynos cloud, and also the device frequency. And in our Tizen corner team, Chanwoo is one of the maintainer of the Tizen corner. And me, the Seungwoo is the corner maintainer of Exynos DRM driver. And on Tizen corner team, I am also one of the maintainer of Tizen corner. OK, this is today's agenda. First, I introduce about what is the Samsung TM2 board. And also, I will share some of our working under our 64-bit corner. And after that, Chanwoo will share more information about what we do and our experience under some issues under our 64-bit corner. And then maybe we have some time for Q&A and discussion. OK, then what is the Samsung TM2? It's simple, the Tizen Mobile 2. Samsung Tizen Mobile 2. It is the mobile reference board for Tizen. It uses the Samsung Exynos 5033 SoC. And the SoC is based on V8 architecture. And of course, it supports 64-bit. And it's short history of our development under Tizen Mobile 2. In early stage, Tizen starts supporting 64-bit under ARM Juno board. It's a bit big board, so not properly working with mobile reference of Tizen. So we started mobile reference board development under September 2014, when the Exynos 5033 SoC is announced. And then just after that, we started Tizen support all the things. And nowadays, on Tizen 3, the TM2 board supports almost all functionalities, just except for modem. Because modem has many proprietary codes, so not yet supported. And it's based on the Tizen Git repository. But we are also working on the mainline 3 and the first patch posted on the mainline just after we started the development. And the first patch was posted by Chanu. And in these days, also the patch for the TM2 board is also in mainline, not yet merged. OK, this is information of our Tizen Git repository. The corner is based on the version 4.1. And based on the 4.1, there are more than 1,300 patches are applied for our Exynos 5033 SoC and the TM2 board. And actually, the number is captured by one month ago. So it's almost 1,400 nowadays. And also, more than 250 patches are already posted to mainline. And even more than 100 patches are merged. OK, it's short information of our Exynos 5033 SoC. Maybe it's first version of the SoC from Samsung-supported architecture VA. It's octa-core with a big little architecture. And its display has a maximum WQHD and WQXGA. So we have two both types, TM2 and the extended TM2, the TM2e. And it supports GPU on Mali Midgard. OK, then I need to introduce something about Tizen. I think you already know about Tizen Project. It resides in the Linux Foundation. And the Tizen is a project for the supporting software platform for multi-device category, including mobile, wearable, and TV, and IVI. And in these days, we are starting for the IoT area also. All the categories of the devices support some range of API set and based on their common API set. If you have more interesting about Tizen, then please check the Tizen web page. OK, this is a screenshot from TM2 with the Tizen mobile platform. One is a simple menu screen, and the other is a hero map screen. And in our corner with a 64-bit corner shows better performance than a 32-bit corner of the same architecture, more than 20% improvement on the CPU benchmark with the drystone benchmark tool. It's almost similar with ARM's reference information. And then what we do for ARM 64-bit corner really? OK, first, it's just our information about ARM V8. It, of course, supports 64-bit memory address space. So its memory map is 64-bit. And on the ARM V8, for the firmware control, it supports PSCI interface with this difference. OK, first, we support 64-bit memory address mapping. If you are using the recent controller, then you don't care about this. But in legacy, maybe it just sends the 32-bit address map. So it should be fixed to 64-bit. In our case, we need to fix. OK, also, we need to fix drivers on the early ARM architecture to support 64-bit. Many of them already support 64-bit, but some of them have some issues. For example, the external side, IOMMU, the C-SAMM, originally only support 32-bit, but the version five of the hardware support the more secret address space. It's just easily on the hardware side, shifting the old page entry values by four bits. So the driver need to be updated for support version five. It is already merged to mainline. And the other case, S5P MFC driver case, it is the video codec of Exynos SoC. And maybe the original code was bad, but it has some fixed base offset of the MF system core, so we need to fix it. And the other thing is, I think it's user mistake. It uses some fixed size pointer casting with 32-bit, so we need to fix it. And it's not really kernel bug. In early stage of ARC64, there was some GCC bug in LightSift. So on the for loop with the light rotation, the result was not same with our expectation, so we need to fix on the kernel side with some workaround code. But I have checked under these days. And fortunately, it is already fixed from GCC side. OK, and after development, I applied to runtime checkers Kassan and Ubisan, the sanitizers. They are originally started from Google on the user space. And on the kernel side, the former Samsung engineer Andrei Riavin applied to the X86 architecture. And thanks to the Rinaldo team, it is applied to the ARM64 also. And in Kassan, the kernel addressing ties can detect the out of bounds access and the user F2P and the memory address related bugs. And it's one of our example we can detect with Kassan. In the initialization array, we need some null element, but there was no null element, so it accessed the out of bounds area, so it detects. But fortunately, kernel already have a guard area in the array. So the original code was not really have some bad things, but anyway, I fixed it. And the other thing is also user F2P free detection from Kassan. The original code condition was some typo, so there was some bad access memory. So I need to fix the condition. OK, the other runtime checker is Ubisan, the undefined behavior sanitizer. It detects some undefined behaviors like sift out of bounds or overflows. There is some example from the DWMC. And there was some sift exponent to large value for the sorry to be type, and I need to fix. It is already merged into mainline. And the other case is also sift exponent, and the code has a wrong offset in the array, so the code is fixed. This is fixed from our Tizen kernel. And from now, Chanu will introduce more things. So my name is Chanu Che. I will give a presentation continuously about what we do for. I'm 64-bit for CPU cores. The first PSCI, I'm originally provide the PSCI to support the power management. The PSCI support the standard and common interface for the power management. In the legacy device driver, we should know the specific and nonstandard code depend on each vendor associate. It is difficult to support the power management. So if we use the PSCI, it makes the power management support easier than before using the legacy device driver. The PSCI supports the following scenarios, suspend, idle, and CPU half-flow, and system shut down and reset. But PSCI does not support the DVFS, such as CPU and GPU frequent scaling. Also, the PSCI handle the exception level. So each AAC32 and AAC64-bit architecture has the exception levels. So usually, Linux kernel is operating on the exception level one, which was. So sometimes, we need to send some command to another exception level. So we use the SMC or hypervised command. SMC is the secure monitor core. For example, if we want to turn on the secondary CPU cores, we should use the SMC. In the legacy device driver, we should handle the SMC core directly. But if we're using the PSCI, there is no any specific. Just we call the CPU on functions so we can get the turn on the CPU. So I added some example for the suspend with the PSCI. In the legacy device driver to support the suspend, we should add some specific function, for example, actions on the suspend function. This function includes some different and non-standard. This code depends on each brand or source. But in the PSCI, provide some standard function. We just use this function without any specific code. It makes the power management support easier than before. And next, I added some CPU half-flow example with PSCI. In the legacy device driver, we should add some specific functions for the SMP operation structure. It means that each function includes a specific code. It is difficult to support power management, and we need more time to support it. But PSCI provides the only standard function. This function does not include any specific function. For example, when I was developing the TM2 board with the ActionOS 5.433, so I tried to turn on the octa-cores on the latest kernel. So I just added the device tree node according to the documentation of PSCI. There is no any specific code. Just after kernel booting, I got successfully all octa-cores turned on. It is very simple. We can save the many times to support the power management. And next, how can I support the big needle cores? So recent reflection models has the big needle cores to support the high performance scenario, such as VR, AR, and gaming scenario. But Linux kernel include only IKS in kernel switch to support the big needle cores scheduling. But in kernel switcher does not use all cores at the same time. So we should use the HMP. The HMP is provided by Linano. So if we use the HMP scheduling, so we can use the all cores at the same time. It means that we can support the high performance scenario on the hardware device. But HMP is only supported kernel version 3.10, and HMP is verified on the ARM32 bit. So in our project, I tried to support HMP on the kernel version 4.10 and for the ARM 64. So I just added some simple test for the HMP. This Schismatch event marker created six threads. And I used the ARM DS5 profiling tool. In the first case, without HMP, this result showed that the performance regression happened due to the CPU contention. CPU contention means that if there is no enough CPU resource, CPU contention happened. But in this case, this hardware has the octa core, and Schismatch only create the six threads. It means that the mainline kernel does not support the big needle cores effectively. So I offline the HMP patches. I tried to test again. So I got there is no any some CPU contention issue. It means that we can use the all cores at the same time to support the specific scenario. As I always say, that gaming and VR scenario. And next, this chapter, what are the challenges on TM2? This chapter includes the two subjects. The first one, 32-bit, process running on 6-bit kernel. So as we know, the 64-bit kernel supported all features for the 32-bit. It means that this kernel has the low compatibility. But sometimes the issue happens when we use the 32-bit user press on the 64-bit kernel. So in my project, I will explain these issues and I will explain how to fix it. The first, personality 32-bit setting. So as I always say, the 64-bit kernel supports all features for the 32-bit. But 64-bit kernel provide different information to the user end according to the type of running process. According to 32-bit or 64-bit user process. So sometimes the user process needs to use this information provided by a kernel, such as this kernel is supported with FB. So I added some examples on the next page. In this page, the Unity engine on Tizen platform checked whether some mode is supported or not supported. I tried to execute the cap pro CPU input command on the 64-bit kernel. This result does not include any some features. But 64-bit kernel supports some features. It means that if we get the same information, we, 32-bit user process, should use the person on the bar. It looks like 32-bit IO control. So I added some test result. So in this case, I used the 64-bit kernel and 32-bit user process. So I tried to execute the username command and cap pro CPU input. We can show that this result. But in the same case, I just tested it with person on the bar, 32-bit IO control. We can get different result. When I execute the cap pro CPU input, this result includes some mode. It means that frankly, I have planned to skip this page because this is more detailed knowledge. So I will explain it. So this page includes how to support some mode on the 64-bit kernel. If we use the person on the bar, 32-bit IO control on the 32-bit user process, this kernel checked this IO control. And so compared on the bar, PSR on the bar, extra bit, macro means that the ARC is 32-bit, CPSR bit. So if we, the 64-bit kernel, to support some mode, if we, 32-bit user process, use the person on the bar, 32-bit IO control, the compact PSR, T-bit, is set. So if this bit is set, the kernel changes some instruction to the ARC 64-bit instruction in the kernel, right? It is the new kernel, handle this method for some mode. So also 64-bit kernel has the configuration of a compact configuration. When we enable this configuration, 64-bit kernel supports almost features for the 32-bit user process. But sometimes issues happen when we developing the TM2 board. So first, if we want to handle the 32-bit IO control on the 64-bit kernel, we should add the compact on the IO CTL functions. And the 64-bit IO control is handled by unlocked IO control. And next issues, sometimes even if we use the same IO control on the 32-bit and 64-bit user process. But there is the different size IO control according to the type of user process. So I added some example on the next page. This page just showed patch information how to fix it. So on the next page, I added some code. The first example, so I added compact IO CTL function points to support the 32-bit user process. Also, in the next example, even if we use the same IO control on board 32-bit and 64-bit, but there is different size. So we should support these issues. So we added a new IO control, but it is the same operation. So next, sometimes alignment issue happens. So we should fix these issues, aligned pad method. And next, sometimes ML payload issues when we're using the 32-bit user process. So to fix this issue, we should use pi of 64 C flag. So if we add this C flag, the ML on the user process use the ML, so kernel space use the ML 64 functions. So for example, Mesa and Wayland always this C flag to support the ML issues. And this chapter, I explained the issues required to the discourse on May 9. It means that some issues depend on the only 64-bit kernel. Other issues depend on the both 32-bit and 64-bit kernel. The first issue, FPP CMD segmentation fault, this problem is still remain on the May 9. So when I developing the TM2, the segmentation fault happened after the CPU wake up from suspend state. But this fault only happened on the 64-bit kernel plus 32-bit user process, such as UW or Betsy user process. So I tried to find it. So when we reviewing the auto options, there is no any issues. So auto options, if we use the auto options, to change use the FMOV assembly command instead of move command. The FMOV is the plotting point. FMOV is used for the FPP CMD features on the ARM 64-bit. So FMOV is the plotting point assembly command. Also, FMOV use the specific register. So D register is only used on 64-bit. So I dumped this register of D to try to find the curse of this segmentation. So in the result, on the failed case, I got this result. The right case, the D register has the garbage data on the failed case. So I tried to sum code to handle the FPP CMD register on the mainline code. So FPP CMD thread switch functions handle the FPP CMD registers. This function covered two cases. The first one, the red line, when context switching happened, but there is no any current thread pointer, there is no change about the thread current pointer. This function just keeps the D registers data. But on the other side, this function always loads data from memory for D register. It means that this FPP CMD segment port happened on the first case. But this issue is not every time. This issue sometimes happens. It means that this function does not consider the suspend to RAM situation if we when use the 32-bit user process. So just in our project for the temporary solution, we just use the always load data from memory for D register. So I fixed this issue. But for the fundamental solution, which FPP CMD thread switch should consider the suspend to RAM situation for the 32-bit user process. This issue is not happened on the 64-bit user process. Only depends on the 32-bit user process. And next issue, I'm generally timer issue happened on the 64-bit. The Linux kernel needs the clock source for timekeeping. So Linux kernel use the clock source and scared clock. The clock source is to provide a time, a time name for the system. And scared clock is used to schedule and time stepping for the over the print K. So I'm provide the timer IP. The IP name is I'm generic timer or I'm architecture timer. So this timer has the two counter, physical counter and virtual counter. You can use both counter. But I'm 64-bit kernel recommend to use only virtual counter because the only is both virtual counter to user VDSO. So when I tried to use the virtual counter, I got some issues. So I failed to synchronize the time stepping between the CPUs when putting the log. You can show some issues. So I tried to find it. So if we want to use the virtual counter on the 64-bit kernel, we should initialize the counter offset register on the boot loader. If this register is not initialized, the each core gets a different random offset. So we get some pain issues. But in our project, we just do not use the VDSO. So we just use the physical counter. But in the mainline kernel, we can use the physical counter because some patches removed the physical counter usage. So just we need to add some code. So yeah, just maybe another case, another on the hardware. This issue may be, yeah. And next, in the mainline kernel has deadlock issues between regimap and common clock framework. When we use the regimap subsystem and clock prepare functions, the deadlock is halved. This issue is still remain on the mainline. So I added some example. Samsung S2MPS MFD device. So the MFD device includes the PMIC device and clock device. PMIC and clock device share the one iscale line. But PMIC and clock device use the different clocks. So first, on CPU zero, PMIC device rocked the regimap, buttex A, to write to the ISC first. And context switch happened on CPU one. I'm sorry. On CPU one, clock device called the clock preparer. But clock preparer used only one clock in the common clock framework. And clock device rocked the buttex A to write the ISC first. But this buttex is already rocked because the two device shared one line, one iscale line. And context switch happened on CPU zero, rocked the buttex B of the clock preparer because the clock preparer used only one clock. So the deadlock issue happened. So fix the deadlock issues. We just use the clock preparer should be on probe time. And on runtime time, we should use the clock-enabled function to handle the clocks. Because clock-enabled function used the spinna instead of the buttex. But following two patches were posted to fix this issue on the main night. But it is not a fundamental solution just temporarily. I tried to check the on-regimap side. The regimap subsystem has three device, MMIO, SQC, SPI. So on the SQC, regimap MMIO driver used the only spinna for the speed. The regimap SQC and SPI driver used the Utesu for the locking system. If we use the just regimap MMIO, if we use the regimap MMIO, there is no any issues. So if we change the locking mechanism to the spinna, there is no any issue. But also it is not a fundamental solution. So as I said that just we use the spinna, it is not a fundamental solution to fix this deadlock. Because there are many similar cases on the main night corner. So we need to locking ideas. So these patches were posted by many lists some weeks ago. This patch has the new locking idea. So as I always say that the common clock prepare function only one club of clock. But this patch is a new locking idea. Use the each clock controller. Use the own buttex lock instead of a glauvala. So if after applying this patches, maybe this issue will be fixed. Next, there is dependency between power domain and common clock framework. When I was developing the TM2 board, graphics hardware model does not work after CPU wake up from suspend to ramp. It means the reason is that graphics related clocks initialize after wake up from the suspend state. But common clock framework already supports the handling for the suspend to ramp to save, restore, and clock registers. But common clock framework does not prevent this issue. Because sometimes the hardware's clock domains belong to the power domain. So already some device, we can register device to the power domain as a child device. But in the main night, common clock framework does not support the runtime theme. So it means that the clock controller is not able to make the dependency with the power domain. So this diagram shows the clock hierarchy in the Action 5.033. So as I said, some child clock controller, if child clock controller bellowing to the power domain, the clock controller should handle the clocks when power domain state is changed. The other side, the power domain state is changed. The child clock controller should know that this changes the state. So in our case, for the temporary solution, we use the notifier chain. But it is just temporary solution. So how to fix the issue, the clock common framework should support the runtime theme interface for the power domain. So the Marek is in our team members to develop the TM2. He posted the following patches to fix these issues. After applying these patches, maybe this issue will be fixed. And the ARM 64-bit kernel does not support the decompression features. So we can not use the G-imaging. But if we want to use the compressed image, we can use the decompression features of U-boot bootloader. So in my project, we use the decompression features of U-boot bootloader. So I explained some more issues. So this page includes some issues to support the TM2 on the mainline. So I tried to post the patches for the action of 5.433 and TM2. Also, you can check the old history and patches of TM2 on the Tyzen Git repository. So I will, I prepared some short demonstration for the TM2 with the Tyzen. This is just to the handling the TM2 board. It shows some of our schedule page. This is our presentation. And then it's a web-share samples with Aquarium, 40 fishes with 40 or between 40 to 50 FPS, 4,000. It's not showing well. But it's seven or eight frame. And then it's last year's ERC YouTube from our team. And it shows with the hardware video codec. And the last item is showing a pass from Tegel Airport to Meritim Hotel from the Google Maps. OK, let's finish the demo. The Exynos 5.433 has a Mali mid-guard GPU. And we use Mali user binaries. OK, thank you for listening our presentation. And if you have any question, then please let me know. Any questions or any discussion point? Actually, the TM2 board is sending to Korean University for researching support. But there is no plan for sending boards to other. Maybe you can contact our OSG group for getting more devices. Any other question? OK, it seems no question. And OK. You mean that the best way about the PSCI for hacking? So yeah, the PSCI, as I already said, supports the standard interface. So frankly, if some SOC supports a fully PSCI, the latest corner provides the PSCI 3 versions 4.1, 4.2, 1.0. If each SOC supports a fully 1.0, there is the simple to support power measurement. So frankly, Exynos 5.433 does not fully support the PSCI. So example, as I already said, I tried to turn on the octa-core. This SOC fully supports for the CPU half-low. But when I tried to support the suspend trend for the Exynos 5.433, but this SOC, even if SOC supports the PSCI, there is any specific code to support the power measurement. So I think that some architecture, the hardware architecture, supports the fully PSCI. So I think that we should save the many times for the power management. OK, time is already finished. So we are still here. So if you have any other discussion point, please let us know during the conference. Thank you for listening our presentation.