 Hi, all. As the title say, today we are going to talk about how to synchronize Austin gas in a combined tracing session, and we will discuss a method to infer how well the algorithm is performed. But let's start by presenting ourselves. My name is seven. I'm a student at the University of during studying computer science. I'm currently carrying out an internship that Susan, where we analyze what we are presenting today in the second part of the presentation. Hello, my name is set to meet three hours and I'm part of VMware open source technology center where me and my team work mainly on F trace on its user space ecosystem, tracing libraries, 3CMD and Karnoshark. Before joining VMware, I was working on design and implementation of system and network software for various network devices. I had a lot of experience in that area. This talk is about guest hosting guest tracing using F trace. So I will tell a few words about F trace for those of you who have never heard of it. F trace is the official tracer of the luminous kernel. It has been part of the kernels since 2008 started by Stephen Rose that as part of the kernels real time patch. F trace is enabled by default in the kernels of the most limited distributions. It's very popular and widely used by kernel developers. It also took inside every corner of a life running kernel. Almost any function can be traced also a few thousands predefined events in the kernel. So we have a Linux host running one or more Linux guest virtual machines. Combining traces from all running kernels can be a powerful approach for analyzing host guest interactions, race conditions or any hard to debug complex problems where hosting guest kernels are involved. There is F trace in each kernel. What more is needed? We need infrastructure for configuring F trace on each kernel, starting simultaneously the trace, collecting the traces and saving them in files and displaying all them in a nice way that is easy to view and correlate the events. F trace uses local cocks for time stamping the events. So in order to merge them, who events must be in the same time and space. That's it. Events, timestamps must be synchronized. And the accuracy of this synchronization is essential. The timestamps are with nanoseconds partition. There is already a user space tool that is designed to configure F trace, control the tracing and collecting the data and save it in a file. Trace CMD. Trace CMD is easy to be extended to support host guest tracing, but there are a few challenges that have to be solved. Past transfer of huge tracing data between host and guest and synchronizing events timestamps. Depending on the trace configuration, numbers of trace CPUs and a lot of the machines, gigabytes of trace data can be generated each second. These data must be transferred to the host during the trace and saved in files. A few metals were tested and compared, each with its own pros and cons. The results in that table were measured on my laptop and eight core Intel i5 CPU with 16 gigs of RAM. Using TCP IP is the most generic and universal solution which can be used in any environment and does not require any support from the hypervisor. But obviously it is too slow as the data passes through the whole IP stack of the kernel. This solution also has security concerns as anyone could access the open TCP port. FIFOS and VSOckets have similar performance, though the FIFOS are slightly faster. They have no security concerns as the TCP IP solution. FIFO looks like the best choice, but it has a few limitations. It supports one-to-one unidirectional communication only, so a FIFO is needed for each direction. It cannot be used for nested virtualization use cases where the inner guest cannot establish FIFO to the host. FIFOS are supported only on KVM hypervisors. VSOckets seem to be a more applicable solution. They support one-to-many bidirectional communication, work for nested virtual machines and have wide hypervisor adoption. VSOckets are chosen as the full transport for a guest to host trace data transfer, but there is an option to use FIFOS on KVM hypervisors as they give better throughput. Synchronizing of events time stamps is critical. Current events might happen within a few now seconds interval. The more generic approach is to use a PTP-like algorithm between host and guest. The PTP is already supported by the hypervisors. There is a Paravirtualized clock, which uses a shared TCP page between host and guest, and on top of this PV clock is implemented KVM clock. But all these are used to synchronize the guest clock with the host. Our goal is not to modify the guest clock just because of the tracing. We want only to find the offset between timestamps of the traced events. If coaxes are already synchronized, you can use them. The PTP protocol is designed for high precision clock synchronization in sub-microseconds range for a local network. To achieve such accuracy, there are two important conditions. The packet transit time must be equal in both directions, from client to server and from server to client. And using hardware time stamping of the PTP packets. Obviously, these conditions are not met in the host guest scenario. The roundtrip time is not symmetric and there is no hardware to record the timestamps. But these issues can be overcome. Ftrace itself can be used to track the timestamps by using trace markers. Each time a packet is sent, an Ftrace event is generated. Timestamps of these events are used to calculate the offset. It is not as good as using hardware time stamping, but it is good enough to achieve reasonable accuracy. Guest is the server site in this PTP packets exchange. That way, the inaccuracy caused by the shared link of the virtual CPU is bypassed. Performing the PTP packets exchange hundreds times in a series and filter the best matches and calculate the average. Our test shows that if this 500 PTP packets exchange, the inaccuracy caused by the CPU cache, preemption and task migration could be mitigated. But the guest clock is something virtual as a whole guest machine. It is calculated by the hypervisor. If there is an easy way to access this hypervisor in terms from the user space, the synchronization task is not a problem anymore. We have it for KVM. The TSC clock offset scaling and ratio and fraction bits are exposed to the debug file system. And this simple formula can be used to calculate the guest clock. But it is not possible to use this approach in all cases. Ftrace supports multiple clock sources for the events timestamping and the TSC x86 is one of them. We have this information for KVM hypervisor all. All these conditions are met. If all these conditions are met, KVM uses an Ftracing configuration to use TSC clock source. Very high accuracy can be achieved. But for other cases, the PTP like algorithm is available. It has good enough results for most use cases. All this is already available. The latest release of TraceCMD 2.9 has support for host and guest tracing. Also, the logic is implemented in libraries. LibTraceEvent, LibTraceFS, LibTraceCMD, which can be used to build your own tools. There is also a Python interface to these libraries, still in development. But all this functionality will be available for Python also. The latest screenshot release 2.0 can visualize host and guest traces in a nice and clear way. Here on this screenshot, each guest CPU is mapped to the corresponding host task that runs that virtual CPU. These horizontal bars in the host task represent the time slot between the VMEnter and VMX hypervisor events, in which the physical CPU is given to the guest. The guest events are visualized on top of that bar. It is easy to visualize, visually correlate the events and see them below in the list. Okay, that was all from me. Now, Stefano, will you describe his work on evaluating host and guest traces? Stefano? Thanks, Zetomir. Let's suppose to have just finished a host-guest tracing session. How can we know if all gone well? How can we know if degenerated traces are trustable enough? So, this is the main goal of this second part of the presentation, to find a way to check how well the synchronization algorithm used performed, in this case, PTP or KVM, but could be any kind of algorithm. This main goal can be analyzed under two different aspects, the validation, so if at least the synchronization process worked properly, and the synchronization evaluation in order to detect the accuracy and the error rate itself. For the validation process, two important events are going to be tracked, the KVM entry and the KVM exit. As we know, the former report the execution of virtual CPU instructions and the latter vice versa. So, in order to say that the synchronization algorithm did its work properly, we expect to see only guest events inside the KVM entry, KVM exit session, and vice versa for all the events outside. Otherwise, some problems in the synchronization process occurred. So, by keeping in mind these are the behavior, here we have an example of a valid trace, and as we say, there are no guest events after a KVM exit, and no host events after a KVM entry. Instead, here we have an example of a not valid trace. As we can see, the guest events are all four positions after a KVM exit, which is clearly a synchronization error. For the case of host events inside, not all of them are errors. In fact, what we've seen so far can be applied straightly, since there are events that are allowed to be where they are in terms of trace post position, as the events visible in the trace, for example. Since this kind of situation happens frequently inside a single trace, the tool has to handle this discrepancy, and we'll see how. For the validation process, we can have multiple paths to follow. The one we decided to pursue is based on looking for specific sequences of events in the merged trace. The idea is to find events that, if seen on one side, host or guest, they have to lead to another specific event on the other side. Since we know that it has to always happen, we can use the alternance to infer the accuracy by looking at terms. If the sequence would change, it's clear that we can use it since we can rely on C, the expected event on the other side. And if a sequence we know doesn't have to change, actually changes, we have found a synchronization error already detected by the validation phase before. To sum up, the two requirements are consequentiality and no changes in every scenario. The first of those kind of sequences that we have found is related to how actually a timer can be armed in virtual environments. On bare metal systems, it consists of simply writing a value on a register, the TSC, the NMSR, and when the timer expires, an interrupt is generated. As we know, instead, in case of virtual environments, these kinds of measurements can't work straightly, since a virtual machine between its native can't already to write on a physical MSR register. Thus, in order to really write value in the register or either to permit the host to handle the request, a contact with the selected clock slash, the host has to take back the control of the execution flow, leading to a VM exit, directly visible in a trace by looking at key metrics with reason and some right. Thanks to this, by now we already have two points that could be possibly related, the writing of the MSR and the timer, and the KVM exit with reason MSR. The last thing that we need is to find an event that will actually lead to the use of the MSR. As we can expect, we have noticed that the HR Timer subsystem uses the TSC data in MSR before, we need TSC data in mode, and this can be tracked down by looking at events like HR Timer start, HR Timer cancel, and HR Timer expire exit. So now we have a vital sequence to use. We can see it on the exit side as visible in the trace example by the HR Timer start event on the guest, leading to the KVM exit or the KVM S21 with reason MSR right, and on the entry side by the KVM entry event on the right MSR1 on 6C0, which is the TSC data MSR address. The second sequence can be found by narrowing down the idle loop and observing how this is handled in virtual context. As we know, when a CPU goes idle, a special task, the idle task gets selected by the scheduler and start to be executed. The work of the idle task consists in the execution of the idle loop, composed by two main portion, the selection of the idle state by the governor and the actual transition to the selected state. The transition itself is also visible in the traces by a specific trace point, which report the start of it along with the upcoming state. This also happens when a guest meets a CPU goes idle, but there's a substantial difference. The CPU is virtual. That means she cannot really go in a C state. Thus, the host has to handle this process and since the execution flow is held by the guest, that would lead to a VimExit. And again, this alternates of Austin guest state is really precious for the analysis that we are performing, since we now know that a C state change on the guest side and the upcoming VimExit are strictly correlated. Let's check the sample. Summarize what we have said so far. When we see a CPU idle event on the guest side, we know that we will see a KVM exit where it isn't hot. So configuration can make the sequence invalid like idle equals pool, because that will lead to unconsequential events, because the virtual CPU doesn't leave voluntarily the physical CPU, but this is a really borderline case and the idle out is usually selected by default. Okay, now we have finally two sequences that can be used. And now that we have these sequences, the last step is to actually understand and select a way to retrieve information from them, useful to estimate the accuracy of the tracing session. What we have decided to do is to calculate the duration of each sequence found in the traces by subtracting the timestamps of the events inside, and so each result as a sample. So we are creating even dentals. After collecting all of them, the mean and the standard deviation for each sequence type is computed. The combination of them will estimate the general performance of the algorithm use. Other idea could be to infer at runtime a standard baseline of how much the entire sequence has to take and calculate the difference from it, which is an idea that could be developed in the next version of the tool. And here we come at the final tool. By now, this version of it, which is a prototype, can handle more than one guest or more than one CPU. The dash and option was included in order to exclude events from the validation process that are not related to synchronization problems, as we have seen in the validation, validation process before. And the dash as option will dump all the phone samples by every sequence. The tool itself is implemented by using the lubrication arc, which presents a series of essential functionalities to handle trace files and make the entire processing process transparent. The tool itself could also be useful to riffing the positioning of some trace points, like the one that generates host events after a QVM entry. The output of this first version of the tool reports the statistics related to the validation phase. So everything inside and inside the QVM entry section and the mean, the variance and the standard deviation for all the samples found divided by the two sequences of events. As we said before, the chess will dump all the samples in the format visible here, means for tools like the plot, it will create two different files, one for the timer event set and one for the other event set. In case of two or more guests, the output will present the same general statistics section as before, but also specific information for every guest involved in an empty tool to detect which of the guests performed better and which one performed worse. The production of the presentation will introduce a real case where the tool can be used for answering a question related to what said to me said before, which performed better between the PTP algorithm and the QVM one. In order to achieve it, different scenarios will be set up. We will make the analysis on idle system, but also on stress system by using stars and G the dash dash matrix zero parameter will be used which starts a worker for each CPU that perform various metrics operation on floating point values. The entire analysis will consist on 30 sessions of tracing for this scenario of 20 seconds each, which is not around the number since we want a good balance between the number of samples and the lost events that will generate around 8000 samples for the first sequence and 3000 for the second one. The table above shows the result obtained regarding the validation phase, which shows the percentage of how many guests events outside of the QVM section were found. What we see is self explanatory since KVM never generated this kind of synchronization error. Instead, for the evaluation phase, the table above sum up all the partial results obtained by the tool by calculating the mean of each of them for every parameter. The results are then divided by scenarios, so no stressing, stressing all in the loss, stressing all the guests and stresses both of them, and also by sequences of events. As we can see, in every situation, KVM is far better than the PTP. We also have to notice that the whole set that we have found can be used in a stress system system that doesn't permit the CPUs to actually go side. To better appreciate this performance gap, as an example, the two charts represent respectively the mean of the first sequence of the events and the standard deviation. As we can see in the mean distance of KVM is always far below the PTP one, and the results obtained by KVM are also more trustable as we can deduct by the standard deviation chart. In conclusion, host and guest tracing is possible only if the traces can be synchronized with each other, and the trace MD offers two synchronization measurements for that purpose. PTP, which is much more complex to implement if compared to KVM, and as we've seen in the analysis before, is not accurate enough, but is hypervisor agnostic, which means that doesn't need any information external to trace MD itself. On the other hand, we have KVM, which has a simpler implementation and is much more accurate compared to PTP, as we've seen, but relies on debugFS entries. And this dependence can be a problem. Is debugFS a stable enough ABI? And what happens if debugFS is not present at all in a system, as an example for security reasons or on a production systems? The questions then arise. Are there ways to access the needed information? For example, a new system call, maybe an hypercall from the guest to access the data of the host? Or what else can be done? If you have any feedback or idea, feel free to contact us. Here are the links for the tools discussed in the presentation and the accronomments to all those who helped us during the work. And that's all. And thanks for your attention.