 Hello, I'm P3, responsible for analyzing and improving the performance of in-vehicle importainment platform at Hyundai Motor Company. Today, I would like to introduce various platform analysis features and automatic performance monitoring capabilities using an open source program called GUIDER. First, let's briefly discuss performance issues, and then I'll introduce performance analysis tools, including GUIDER. Next, I'll showcase features that enable automatic performance monitoring and generation of analysis reports when issues arise. Finally, I'll share what performance data generated in vehicles can be collected on the server and utilized well. As we develop our product, we encounter various performance issues. If there is something perceptible, it would be things like slowness, instability, and stuttering. Analyzing performance issues at the level of a single app is not very challenging. Most platforms provide good debugging and analysis features for apps. However, when the scope of the problem extends to the platform level, such as corner, analyzing and improving it becomes much more difficult. And the reason is that the areas to use mind broaden the amount of study requires increases and their analysis programs to be used are multiplied. So in the case of the importable system developed our team, various apps and services within the system are interconnected to implement complex and sophisticated specs, considering features that interact with external systems like head-up display and cluster and ADAS and others. Those systems become extremely complex. Performance issues arising from such complex systems are truly diverse and often challenging to analyze. To quickly analyze issues and pinpoint their causes in such complex systems, the best approach is to swiftly select and adaptively use the optimal tools. So what performance analysis tools should be used effectively? Analysis tools can be broadly categorized in three types. Monitoring programs that show the current state in real time. Profiling programs that collect and summarize information over a 13th period. And tracing programs that display all detailed operations. As performance analysis tools are diverse for each system area, users need to choose them wisely according to their priority and experience. However, when actual performance issues arise, knowing where to start, which tools to use first can be quite challenging without prior experience. Having worked in performance analysis for several years, I have come to realize that many of these tasks are inconvenient and insubstant and even impossible with the tools available. Over time, I found myself in situations where I had to analyze systems without access to the source code and compiler relying only on a root shell. In light of these challenges, I decided that I would be more practical to create my own tool, leading to the development of an open source program called Gaider. Gaider, as mentioned earlier, provides visualization features along with monitoring and profiling and tracing features. All these features operate immediately under target without the need for source code and compiler. Developed for comprehensive performance analysis, Gaider currently comprises around 150 comments and various options, the details of which will be explained later. It utilizes GPL2 with the official repository hosted on GitHub, supporting also PIP and open embedded. Gaider can be executed with basic Python without the need for separating building and installation or environment setup. It supports popular CPU architectures and requires about two megabytes of storage size. During system monitoring, CPU resources are consumed about two or three percent of one core, and RAM usage is around 17 megabytes. Gaider operates as a stand-alone on all Linux-based platforms, Android and Azure, particularly on macOS and Windows also. There are two examples to use Gaider. One is by installing it using PIP and the other is by downloading it from GitHub. I'm sorry for the color and size of font. It's a little bit difficult to see. First, let's try installing it using PIP. It's done. Next is the method of downloading from GitHub. It's done to use Gaider. Gaider currently incorporates approximately 150 built-in comments. It's implemented directly without relying on external execution binaries, ensuring independence from dependencies. In addition to the monitoring, profiling and tracing and visualization features you see in the diagram. Gaider also supports more functions, experimental analysis features that is capabilities to control systems and tasks. Communication features in a server client structure for remote control and system load testing and various utility features are included. Due to time constraints, I will provide a brief overview of the most commonly used ones today. Let me guide you through the frequently used features in Gaider with the video demo. I hope this helps understanding Gaider. First, let's explore the help and examples. Gaider has about 150 commands supporting various operating systems. You can check the help for each command by running it with H option. At the top of the command, help, you will find descriptions, options and examples for the respective command. Before analysis, let's create a workload. Gaider supports the creation of CPU workload with the CPU test command. This command creates a task called GaiderWorker that uses 100% of one CPU core. Similarly, you can allocate and monitor system memory every second with the memtest command. There is also the IO test command for reading from file or writing to a file repeatedly and measuring at lots of time. Now let's check how much CPU's workload task is using with the top command. This shows that the GaiderWorker task using around 100% of one core. If you use filter option to focus on specific task with G option, you can use like this. It's a for regular expression. You can also save this information to a file with O option and visualize the profile result with the draw command. Then the SVG file is created. Let's open it with Chrome browser. This way you can create a resource usage graph and detailed information about process using a lot of memory. Next, let's check information about files or sockets opened by task with the f-top command. Without filter option, it shows the number and types of file descriptors and sockets of all task. And applying a filter, we can use G option with target task name or ID. It shows details about the file descriptors and socket including position, open options and protocols and others. So now let's move on to function monitoring commands. To monitor in real time which functions of particular task is using CPU or taking a long time to complete, we can use U-top command. So let's create a workload in background like this and start monitoring the functions of the task dynamically with U-top command. This shows functions that are waiting or taking a long time to execute in real time. And you can save this information to a file with O option. After then, I converted the monitoring result into a frame graph file with draw command. The name of the file is a little bit long. Open the generated SVG file with Chrome browser. You can visualize profiling result in a frame graph. So clicking on specific functions provides an extended view of the related call stack. And you can search for specific functions using Ctrl F with regular expressions. Then it's highlighted with red colors. Next, let's monitor system calls for specific tasks in real time. First, create a workload in a background again and start monitoring the system calls of this task with Sys-top command. This shows real time system calls being invoked. And saving, converting result into a frame graph file is also available as before. And similarly, let's monitor Python functions. First, run the IOTAP program that is written in Python and very famous to monitor system IOTAP. And next, start monitoring the Python functions of this task using pytop command. Saving and converting the result into a frame graph file is also available as before. Now let's explore profiling commands at the system level. First, profile the action of threat with the rack command in background. And create a workload for short time. And start profiling and convert the profiling data into a report file. Yeah, it's done. And let's open the auth file. This report file summarizes resource usage for system and task. It also shows scheduling latency and schedule level stats and block operation size and count and more stats. There are many stats. Below you will find histograms for which stats and detailed information on block operation patterns such like sequential or random access. File operations about file paths and access size. And now let's visualize the scheduling information of tasks based on profiling data with a timeline. We can use the draw command to create a SVG file and open it Chrome browser. After drawing, open it in Chrome browser. Then yeah, we can view task scheduling on each call. Mouse over each bar to see task names and scheduling related information. The black color at the top of each bar indicates blocking state. And red indicates pre-emission state in scheduler. Next, let's profile all system calls invoked in this system. Start profiling with the sysleq command in background and generate a workload for a short time. And start profiling and create a report file with the report command again. Then open the generated report file to see stats above all invoked system calls. Then there are many type and duration and call count and error count and time statistics on system level. And you can also check the similar stats about each task. Now let's move on to tracing command. First one is system call tracing with the strace command. It's very similar to Linux strace. You can trace all system calls for specific tasks with backtrace in real time. I think this video's frame rate is a little bit bad. And next, let's trace user-level native functions with the bittrace command like this. If there are too many function called, you can apply function filter with C option. The bittrace command supports additional sub-command for handling when a specific function called. This is the list of sub-commands for function call handling. When the target function is called, you can read or manipulate values of specific register or memory and introduce delays and repeatedly call again or redirect to other functions. For example, to read the data to be written during a write operation, you can use this command. So you can read all verbal data from target's task memory like this. And similarly, let's trace Python functions of IOTAP process executed before with pytop command like this. And let's move on to signal tracing command. First, create a test program source code. This will use out-of-bounds array index. So if you run the following trace command, sigtrace, the process received a signal board due to text-messing detection. So as it uses out-of-bounds array index. And finally, let's explore memory tracing command. First, create an example program. This will allocate about 100 megabyte memory and keep them. Run the retrace command. The guide that our file contains a table at the bottom displaying course text, allocation size, and count of unreleased functions. So additionally, it shows file-level information and open the generate fsvg file. Then you can view a frame graph of these functions. So unlike servers, embedded systems with significant resource constraints always need to archive maximum performance with minimal resource. To archive this, strict validation, analysis, and optimization of performance must be conducted over a considerable period. However, the modern SDV development process requires greater flexibility and development of more features within shorter cycles. So as new features with complex interdependencies are continuously added and fixed test cases alone cannot cover all user scenarios. So furthermore, previous guide commands could only be used in person when the state persists or was reproducible after an issue occurred. So therefore, there is a need for capability to monitor vehicles from the development stage on the world and automatically analyze and report performance issues when they occur. So I updated GUIDER to be the automated performance monitoring demo. GUIDER, DEMON, automatically monitors the performance of the system and generate a performance report based on the correct data in case of any issues. So all of these operations are performed automatically and based on various threshold values defined in file beforehand. So firstly, the object that can be monitored include a wide range of items from the physical devices of the system to logical resources, various logs, diverse types of functions, and even IPC. Everything can be automatically monitored. Not limited to the system level, monitoring at the task level and function level is also possible. To effectively utilize this capability, careful preparation of monitoring settings, conditions, and strategies is crucial. Actually, each monitoring item is detailed with these monitoring values. The automatic performance monitoring function is executed by GUIDER as follows. Initially, GUIDER wrote the configuration file to understand the monitoring targets and conditions such like threshold, and then it begins monitoring. So when an issue occurs, GUIDER automatically executes the corresponding command list sequentially. The command list may involve directly handling the problem and collecting data for performance analysis or sending notification to external systems. If there is a command in the list that generates a performance report, GUIDER creates a performance report as a file in those storage. This continuous performance monitoring function must require minimum system resources. Next is the most crucial part of the performance setup. During initialization, GUIDER wrote the configuration file to understand the monitoring targets, scope, and conditions. Then when the conditions are met during monitoring, registered commands are automatically executed sequentially. Then left side of the screen represents the file format defining these events. The right side shows corresponding example values. At the top level, you define the monitoring targets as shown earlier, such like CPU. In the next level, you specify the scope. If it's for the entire system, it should be entered as system. If it's for the specific task, the task name or ID can be entered. In the subsequent level, you define event conditions and processing method. This includes the applicability of the condition and threshold values and continuous conditions, a list of commands to be automatically executed when an event occurs. The most critical aspects here are the threshold and the list of commands. When the conditions meet the threshold, the commands from the list are automatically executed sequentially. Summarizing the example on the right side, it means monitoring CPU usage at the system level. If the total CPU usage remains above 95% for five seconds, automatically executed the following two commands. The commands may be built-in guide commands starting with save that generate performance report automatically based on collected data or external shared commands executable. I'll explain more about the content of performance report later. Guider operates as a monitoring demo in the background, so there is a need to control it externally for various purposes. To facilitate this, the following control commands are provided for use in the shell. The commands enable you to change settings or control the operation of Guider during runtime without any restarting. Let me explain the automatically generated report files at the time of problem occurrence. Guider continuously collected the system performance information from the start of monitoring and storing it in fixed-size internal ring buffer. When the save command is executed upon encountering an issue, Guider generates a performance report file, and the performance report file summarizes the system state for specific duration around the time of the issue. It includes various details, but I'll specifically mention about the most crucial summary information and snapshot details. Firstly, the top command info table provides a line-by-line summary of the collected system information. It includes CPU, memory, IO, network, and system event information gathered at regular intervals. Below that, there are tables summarizing the resource usage changes for tasks in each resource unit. The usage of the resources for tasks is displayed over time on the right side. In addition to this CPU usage summary table, named top-CPU info, there are also summary tables for delay, and sketch priority, and GPU, and VSS, and RSS, and block, storage, network, and C-group. The previous summary information is extremely useful for analyzing changes in resource uses, either the system or task at a glance. However, since it is summarized information, it can be challenging to examine detailed information at a specific point in the time. In such cases, by reviewing the snapshot information included in the report file, you can access more specific system information. The system status is displayed in the format of the top command output. At the top, the total amount of system events and physical resources are shown. Below that, information such as system latency and resource usage and the number of events are displayed. Of course, usage information for supported GPU is available. In the last section, resource usage and event information for each task are displayed. At the very bottom, special task information, including new and terminated and abnormal tasks, is able to be presented. Based on experience, setting the monitoring vocal size of the guide to 3MB, generating a report file result in approximately 15 minutes of system information, being captured. In reality, analyzing such a large amount of text manually is almost impossible to the extensive numbers that are not easily accommodated on a single screen. In such cases, the report file needs to be visualized as shown on the screen. Guider allows you to convert the text-based report files to SVG files. When opened in a browser, it produces an image that provides a comprehensive view of the entire period. At the top, you can see system CPU usage and GPU usage and task-specific CPU usage. Below that, IU and memory-reclaimed related statistics for storage and network are more displayed. At the very bottom, the system's memory usage is shown. You can also display changes in the task memory usage and IU usage and more. Additionally, it's possible to visualize system log with the report file. When using a conversion command, you can input the log file pass and define log information as events with additional options. Then the output file will display the occurrence time and context of system log at the top of the screen. Since platform context and system context are displayed together, the analysis becomes much more manageable. When there is a problem with performance, a report file is made in a specific territory with the predefined naming format. This naming method avoids making a duplicated file and makes it simple to find them based on the order they were created. Not the system time. You can also set a limit on the directory size to automatically delete the oldest reports when it gets too full. This way, storage will not fill up endlessly, preventing possible problems in our production. Let's start the demo video. First, let's take a look at the configuration file. At the top, you will find the categorized monitoring target, followed by descriptions of each event scope. Very small font. And attributes describing the event conditions are listed like this. Then you will see variables related to the file naming and finally the built-in commands like this. Then let's move on to the section describing performance events to CPU. The first-level CPU represents the monitoring target followed by system as monitoring scope and then attributes for each event. This condition means that if the overall CPU usage remains above 95% for more than five ticks, it will automatically execute the save command. And it will create a report file. Then let's activate events testing conditions. We have a task starting with yes maintains CPU usage above 98% for more than three seconds here. And the CMD t-tap utap2 command for the yes event handles function profiling for two threads showing the highest CPU usage in the process. Saving it. Let's start Guider for automatic performance monitoring like this command. Then it has started with rows about enabled events. Then let's create a workload on another share. The first workload runs the yes program making the yes task utilize 100% of one CPU core. And after three seconds, automatic function profiling for the yes thread begins and the results are stored in a file. Yes, finished. So if you open this file, you will see the functions that consume the significant amount of CPU for the yes thread. The performance information, including the performance report collected through the previous automatic monitoring is specific to a single vehicle and exist only within a vehicle. So to effectively analyze and utilize performance data for large number of vehicles, it needs to be collected on the server. So now let me talk about this in more detail. So the real time performance data generated by Guider can be continuously updated separate JSON file. Here is an example of some of the information in this format, details that are system and task level for each resource unit are represented. So at the top left, you will find threshold events that is a display of performance issues called event that occurred since booting along with its count. At the top right, you will find peak figures for maximum usage for each resource. We can transport this data to our analysis servers for analyzing the performance of driving cars. While it would be impressive to collect and process and display real time information for all devices on the server. But I think real time collection and processing may not be effective considering cost, stability and availability. For your information, our Hyundai Motor Company currently operates approximately 10 million connected cars. So the scenario I have chosen involves not real time but aggregating performance data at the moment when the vehicles stop driving and turn off. So analyzing them on a daily basis. So this process data is then used to provide a dashboard overview of the general performance for a large number of vehicles. In situations where issues arise, detailed analysis is also conducted using performance report. As potential data candidates to collect on the server, we have a list of performance issues and peak resource uses and performance report. As demonstrated earlier with the performance report data alone, not only can we obtain specific snapshots of the moment when an issue occurs, but you can also visualize extended periods before and after the problem arise. Additionally, you can use biolink graph to visualize peak resource uses data collected in large quantities. So this enables us to categorize and observe trends in resource uses among vehicles based on factors like platform and time and version. So for example, we can monitor resource changes after deploying new versions and track corner cases in the operation of complex specs and identify abnormal system behavior. Certainly analyzing the detailed performance reports becomes possible by linking the detected unusual data with the performance report using timestamp, vehicle, and platform information. Beyond peak data, utilizing load average data enables to analyze user overhead in our system. So far I have explained some kind of useful guide features and more features besides the one I described, but I couldn't explain all them because of time limitation. And I introduced in performance monitoring demo features, it will be more expanding to manage and analyze system performance issues itself. And for specific details, please report to the readme file in GitHub. And if you have any questions, please contact me using email or GitHub. So thank you for your listening for a long time. Thank you.