 All right, hello everyone. Thanks for joining the Automotive Linux Summit Europe. Our next talk will be Ishisan from Panasonic Automotive Systems. Welcome, Ishisan. And enjoy the talk, debugging tools and techniques for virtualized automotive systems. Take it away. Thank you. Hi, hello, everyone. Thank you for joining this session. My name is Hiroyuki Ishii. I'm working at Panasonic and also being a member of the Azure Project. Today, I will talk about debugging tools and techniques for virtualized automotive systems. Firstly, I'd like to apologize for minor change in my talk's title. I have removed the contentization part from today's topic for making my talk fit into the time frame. And I have already uploaded this slide to Sketch so you can download it anytime if interested. So here's the agenda for today's topic. First, I will provide a brief introduction. Next, we quickly explore various debugging tools useful for virtualized systems. Then we'll deep dive into a practical example where we'll analyze the VHOSnet, which is a typical component in virtualized environment. Finally, I'll wrap up the talk. Now let's get started with the introduction. So let me introduce myself again. My name is Hiroyuki Ishii. I have been working at Panasonic automotive systems for more than 15 years and have experiments in Linux-based software development for in-pico infotainment products and also cockpit domain control products. My primary areas of expertise are Linux scanner, performance engineering, virtualization, and cloud-native technologies. And in 2021, I joined the Azure project where I serve on the steering committee and system architect and also contributing in various roles such as virtual resources expert. So let's take a look at some background trends in automotive software. In recent years, there are various requirements arising related to software-defined vehicles, such as high-performance computing, hardware scalability, and workload distribution, mixed criticality, and many more. And in response to these needs, there has been a growing demand and initiatives for virtualization as a key technology for each of these requirements. For example, Azure project has been actively worked on virtual machine-based demo integration using KBM to address these needs. And other projects such as Sophie, Eclipse-CDV, and Android automotive are also actively contributing the virtualization solutions for automotive use cases. However, there are big challenges associated with the complexity of virtualized systems. These include complex interactions between host and guest components, increased system footprint, or complicated integration, and limited capabilities, and visibility within guest environments. As a result, debugging and performance engineering becomes increasingly difficult, and complications arise in understanding the overall system behavior. And to address these challenges, there is a growing need for specialized tools and techniques for virtualized systems. All right, so let's proceed to the next section. In this section, we'll review various tools available for virtualized systems, including Perf, Trace Command, BCC, Frame Graph, and Divak InfoD. These tools can help you analyzing, debugging, and optimizing your virtualized systems. So the first tool is Perf, which is Perf profiling and tracing tool for Linux systems, as many of you may already know. Perf acts as a front end for various tracing technologies in Linux, such as Perf events, Ftrace, and Kprobs. And it's developed within the Linux kernel source tree, which ensures it is stable and reliable at a certain level. However, this also means that Perf has a strong dependency on the kernel version, which may impact its compatibility across systems. And additionally, some hardware events supported by Perf are platform-dependent, which may limit its usability in certain cases, for example, on a platform. The next tool is Trace Command, which is the front end tool for Ftrace and developed by Steven Lodestad. Trace Command provides features quite similar to Perf, but with a simpler and more user-friendly interface in my opinion. And one notable thing is that some recent updates in Trace Command is designed for embedded and portrait use cases, including Trace Command agent and Trace Command recent. Unfortunately, I haven't been able to test these features this time, but I plan to turn them out soon. And next is VCC, which is a BPF-based tracing tool designed for performance analysis and debugging. VCC offers a rich set of features, including Python support and various practical example scripts, and I really do to simplify the creation of the custom tools really to a point. However, there are also some drawbacks to using VCC, including its relatively large dependency, for example, on LLVM and Python, which may not be ideal for some embedded use cases. Additionally, VCC has limited support for the ARM platform. Unfortunately, we were unable to use some good scripts in VCC, such as KVM exit and many more at this time, due to platform dependencies. And next is FrameGraph, or framegraph.pl, which is the batchization tool for performance data samples. This tool helps users understand the workload overview and identifying bottlenecks within the system as shown in this picture. And the last tool is Devak InfoD, which is a bit different type of tool than the other I mentioned. It's a host demo for handling and managing Devak Info files. Devak InfoD is developed as a part of the LFU2 project, and it's compatible with numerous popular Devakin tools, including GDB, Perf, Trace Command, VCC, and many others. Devak InfoD is especially beneficial for embedded systems, I think, as it allows for offloading Devak Info files from the target system, which solves storage limitations. And additionally, Devak InfoD resolves past conflicts of Devak Info files within virtualized systems. And a big advantage of Devak InfoD is its CMS integration with the yokt build system, which enables you to utilize it with a single command. So let's deep dive into practical examples using these tools. The goal of this section is to gain knowledge of Devakin tools, usage, and techniques in virtualized environment through practical examples. And as the example, we'll investigate the behavior of the B-host net, which is an acceleration for standard battery net by comparison with standard one. Then we'll explore how and why it can improve performance, as well as any potential side effects that could exist. Now, let's look at the difference overview between standard battery net and B-host net. The key difference lies in the implementation of the battery net device emulation. When B-host net is enabled, the emulation shifts from the QMU user space to the dedicated B-host net kernel thread. This change can result in improved performance and reduced overhead. But I will not go into too much detail here, as we'll analyze it by tools later in this section. OK, so before we dive into the analysis, let's quickly go over the preconditions. First, I'm using the AGL KBM demo platform, which is built from the AGL UCB master branch as a base image. And I have installed additional tools, including B-host net support and IPAV3, SSH, and Burst Devakin tool to the base image. If you need more details, please refer to the appendix of this slide without loading it. And my test environment is running on AGL reference hardware here, which has Renaissance Alka H3 SoC with eight physical CPU cores. And it hosts a single guest VM using QMU with set SNP equal 4, which means that guest VM has four virtual CPU cores. And during the analysis, I'm generating the network traffic between the host environment and the guest environment by IPAV3 to load the network virtualization systems. And to ensure the accurate result, I have halted all other workloads prior to the investigation. And I will be comparing the metrics and traces between B-host net and standard battery in other words, with and without this B-host net support by utilizing tools such as perf, trace command, and PCC. So let's begin with the simple bitrate measurement using IPAV3. With standard battery, the bitrate was 1.66 gigabit per sec. While with B-host net, it reaches 3.26. This shows that B-host net is about two times faster than the standard battery net, which clearly highlighting the benefits of using battery net. No, no, B-host net. Now let's start deep analysis with KVM event statics using the path set. Please note that we are limiting the number of packets here to 100 megas to provide the same load to the different environments. We can see various events recorded here, but key observation here are the significant reduction in the WFX events, which is a CPU standby event for on the AMP platform. And also, B-CPU wake-up event occurrence are same on guest B-CPUs. From this, we can guess that this may have some relation to the performance improvement in B-host net. Next, let's focus on the VMXGT events using path KVM stat. Interestingly, while we've also observed the decrease of WFX events, events again, but VMXGT events themselves have actually increasing in B-host net, mainly due to the data about event and IRQ event. The common belief is that a higher number of the VMXGT events leads to poor performance in virtual resistance, but this may not be always true. And here, we can guess that WFX could be more costly than other VMXGT events. Now, let's change the approach for focusing on monitoring IO operations on battery using. Battery stat, it's from BCC. Now, note that we have to learn this command from the guest side instead of the host side. The buffer length handled by battery net is shown here, here and here. And the left side is in buffer length, and the right side is out buffer length. And as a result, we can observe that the traffic handled by battery net is almost consistent between standard battery and B-host net, which implies the behavior of guest side should not be changed, at least on the interface level. And next, let's try to take performance sampling across host and guest using Perf K VM record. Before starting, it's necessary to mount the guest file system from the host using SSHFS like this. Afterward, we can pass the mount point to the Perf command like this. So this showed the result of Perf K VM record in standard battery or environment. By using the Show CPU Utilization option, you can visualize whether the workload belongs to the host or the guest, such as these rows, no, these rows are guest side and these rows are host side. For example, the top line of this indicates that the guest kernel function named and received Perf consumes about 20% of CPU in guest-conter guest. And please note that the process name starting with a column like this represents child threads of another process. In this case, all of them are a virtual CPU, vCPU thread of QMU process. From this, we can observe that two out of four vCPUs are working in this use case. As shown here, you can analyze high load functions in detail across both host and guest at once. However, this function level of detail might be too much in this analysis. So let's summarize the previous result using Perf K VM report sort option. This is showing the result summarized by processes. In this view, you may gain better understanding standing than before. However, it's important to note that the overhead metrics in this view includes idle functions. Thus, you need to identify the idle functions and manually exclude them from each process. The result, after exclusion, is shown at the bottom. As you can see, the overall CPU consumption has significantly decreased in the V-host-net environment on each processes. This clearly indicates the efficiency of V-host-net. OK, then let us try to look at function level workload again using frame graph. The basic usage of frame graph is outlined here. If you're interested in, please take a look later. And here, we present some tips and tricks for virtual systems. First, you can record samples on both the host and guest systems at once using SSH command like this. Second, you can merge the two outputs into a single SVC file like this. This makes it easier to compare and analyze the data. But please note that these procedures are designed for convenience and may not be strictly correct. And these are the results. All those, the font might be small, will zoom into each part later. Frame graph allows you to analyze system-wise stack traces. In other words, the function called hierarchy. The horizontal axis, this axis represents the total duration of the function across all samples. And the stacks are displayed from bottom to top with the bottom most element indicating the process name. Yeah, too small. And this visualization makes it easy to understand the performance bottleneck and its flow of execution at once. Now, let's take a look at each part. First, guest side has a nearly similar shape of workload between host standard and virtual and vhost.net. As same, IPerf3 on host has quite similar workload. Regarding swap case-rate on the host, the workload appears somewhat similar. But the idle time is slightly longer in the vhost.net environment. And finally, we can observe that primary difference lies in QMU and vhost case-rate on host. We can also see that the same function load, which is target user, appears in different contexts, such as QMU and vhost.net, which clearly visualizes the shift of network device emulation from QMU to vhost.case-rate. Additionally, we can identify an important difference here, which is the usage of the system course. In the case of standard Bataio, as QMU is a user process, it must utilize system course specifically widely to invoke kernel functions like target user. In contrast, in case of Bataio.net, there is no need for system course because vhost is a kernel thread. And this results in the difference in overhead. In the case of standard Bataio, we can see that many intermediate function course functions are being called until reaching to the target user. This leads to the loss of CPU time, as shown by the time difference between this bottom most coder function and the actual load function. And in conclusion, we can say that QMU generates significant overhead during network packet transfer, primarily due to use of system course. All right, so let's move on to the final exercise. We will try to trace interactions between the host and the guest using trace command record. To effectively analyze the context switches between the host and the guest, I recommend tracing both KVM and SCAT events at once like this. Additionally, K-Proofs can be utilized to trace almost all kernel functions. This is an example of creating and tracing K-Proof event for target user function. And here is the result of such tracing in standard Bataio. I know that this may look too complicated, so let me explain it step by step. First, let's focus on that two CPU cores are working here. One is the guest VCPU, and one is the QMU's main thread, which is also known as IO thread. Then let's examine the sequence. The starting point of the sequence is this MMIO event initiated by the guest VCPU. This prompts the host, in other words, emulator of the network device to transfer the next packet. As a result, the QMU IO thread awakens. Afterward, the guest VCPU enters the sweep set using WFX, and it awaits the next event, which is the compression of the network transfer. Then the QMU IO thread invokes target user function. Next, QMU handles the IO queue event to inform the compression network transfer for the guest. And then the guest VCPU gets woke up again from the sweep set. And finally, the next MMIO event is invoked, which means the end of single sequence. So let's evaluate the durations here. The total duration during one sequence in standard battery was 411 microseconds. And we can infer that the root cause of this duration is likely due to the VCPU's sweep and wake up, as well as the use of system calls. And finally, the sequence becomes obviously shorter than standard battery. So let's take a look at the sequence as well. As you can see, it's essentially not so different with standard battery. But there are two notable differences, such as there are the VHOST case-rate running and VCPU doesn't invoke WX. And the total duration during the one sequence in VHOST net environment become 187 microseconds. So in conclusion, we have observed over two times time efficiency in VHOST net, as seen in the sequence trace. And these results are highly consistent with the bitrate measurement observed using IPer 3. Essential reasons for this improvement would likely include the removal of QMU user space, which utilizes system calls from the critical pass, and also the minimization of VCPU sweep and wake up occurrences. So let me wrap up this talk. In summary, during this presentation, we have confirmed the value of VHOST net by the behavior analysis using specialized tools for virtualized environment. VHOST net is twice as fast as the standard battery net, with no significant side effects observed, especially in terms of CPU utilization. Additionally, various methodologies and tools covered in this talk. And I believe they can be applied to a wide range of development scenarios for virtualized systems. These include analyzing performance improvements and regressions, as well as debugging or walking through virtualized systems. And next steps, I will contribute materials to AGL, which likely includes some debugging tools. As well as VHOST net support. And I will also explore some additional tools, such as porting BCC script to AIQ 64 architecture or developing custom tools using BCC or trying to trace command agent and reason. And also, I will investigate on containerized environment and cloud environment. So that's all of today's my talk. Thank you. Any questions? If you have questions, please wait for the microphone. So thank you again. All right, thank you.