 Okay, so let's start. Good morning everybody. I'm Witsuhiro Tanino. I'm working at Hitachi, limited Japan. Thank you for participating this session. And also thank you for giving me a viable presentation stage. So today, I'd like to talk about how the AIR 100 will be improving for AIR Blue KBM hypervisor. Before entering the main topic, I talk about my saved owner introduction. So I'm Witsuhiro Tanino. I'm a Linux engineer. I'm working at Hitachi, limited Japan since 2004. Kind of my working area is the last feature for KBM virtual environment. Last meeting in the AIR beauty and the AIR beauty and the sustainability feature, these two features are important to vision critical environment. And my currently focused area is to improve features and feature associated with in the year recovery in order to apply KBM into highly IVB ideal virtual environment. And I have a development experience about virtual machine manager for heterogeneous cloud systems. This software can manage AIX, HP UX, also vCenter or HP VMM, Windows, Linux, many hypervisor or developmental service can manage, interesting software. So, still I don't talk about these contents. At first, I talk about the background of the hardware AIR 100 improvements. And then next expectation, I talk about the expectation for hardware AIR 100. And at third session, I talk about issues of the AIR 100 and issue of the correct AIR 100. And last, my feature work. So, as part of the ground. So, as you know, it is common that companies drive their business using private cloud or public cloud. It is the environment, virtual environments, hardware AIR makes greater impact on the system because a few dozen of the virtual guests are working on one other place there. So, previously environments, there are many servers and each server has individual services. So, if the one server go down, but other servers can keep their operation. However, currently service or virtual machines are working on one hardware and one app advisor. So, if the hardware AIR occurs, all guests affected the hypervisor errors. So, impact of hardware AIR are more bigger recently. So, one of the key feature is hardware AIR 100. In order to minimize affected area and hardware error, such as machine check is detected. It's working hardware with a failure, shutting down only affected guests are required. As for commercial units, such as AIX, HPUX, this platform of isolation features for hardware error and unpretentious error occurs at CPU or memory. And on the other hand, maybe as Intel 64 and Intel architecture 32 servers is not have isolation features for CPU or memory. So, therefore, cardiac panic was the only way when unpretentious error occur at CPU or memory. However, Nehalem EX, this is the Intel CPU and CPUs after Nehalem EX generation and share recovery was supported. And share recovery can handle memory error and unpretentious error by specific error detection and inform it to operating system. So, this feature already enough to be able to handle memory unpretentious error. So, there are three key features for hardware AIR 100. One is a pre-failure detection and next is failure isolation. And third is continuity of isolation. So, as for pre-failure detection, this means detect and isolate error regions of hardware such frequently exhibited corrected error. So, if corrected error occurs frequently, it's a region maybe likely go down in future, so isolating is important. And next, as for failure isolation, isolates error region of hardware that they exhibit unpretentious error before operating system or application use the regions using the such as much empty recovery. And third point, as for continuity of isolation, retrieve information of hardware error such as corrected error, commonly, until the hardware is cleared. So, the next section, I will talk about the expectation for hardware AIR 100. And first, I talk about what is hardware AIR 100? For example, hardware AIR 100 in the pre-operating system receives a platform error such as memory 2-bit error when the system isolates memory from usable area or recover error based on hardware specification keep system running as much as possible. Generally, server has this hardware, GPU, memory, AIR system. This hardware has each AIR 100 features. And also, as for general errors, there are corrected errors or uncorrected errors. This error to have a two type, why the UC? This error cannot handle up-operating system and also cannot recover. And UCR means uncorrected recoverable error. So, this error type can isolate hardware and collaborate between hardware and software operating system. And fatal error also cannot continue the system. So, this table shows the expectation for hardware AIR 100 for each GPU, memory, AIR sub-system. So, for example, CERR of CPU expectation is operating system or application monitor for the error occurrence isolate the CPU. It's a special over special. Current action, MC log, this is the software of machine check, monitoring the software. If you don't monitor the CERR occurrence out of good logs, isolate the CPU if possible. As for uncorrected error of CPU, expectation is operating system handle UC and isolate the error region before operating system or application use that region. But however, currently, processor only falls the system down because the processor does not have a future of handling the UC. So, waiting to give you future in the internet. So, as was seen of memory, expectation is operating system or application monitor for the error occurrence and isolate the memory region over special. Currently, MC log monitors CERR occurrence isolate the region can be isolated but this feature is not enough. Also, UCR of it, UCR type of memory expectation is always handle the UCR isolate the error region before operating system or application use the regions. With current action, current receiver UCR from hardware kind of try to isolate error region if possible. If that's not possible, hypervisor chip running ignorance error, but this behavior is not enough. So, today I focus on the UC and UCR of memory about the memory. So, it's a detail of improvement point for memory error handling. As for the preferred detection of CERR improvement point is MC log does not retry isolation failed due to use of memory or failure of memory migration. This is an improvement point. As for a continuity of isolation of CERR improvement point is MC log does not keep CERR occurrence permanently. For example, on the disk or on the MV now that the information is stored in memory and it's crashed after rebooting the server. So, the keeping feature on the permanence permanence trace is required. As for failure detection of UCR, there are two improvement points. First point is currently K-Dump failed to get to a memory dump after isolation of UCR because K-Dump second kernel of K-Dump touches the error memory region and scores the kernel panic. So, K-Dump failed and need to fix the future. Second point, data collection comes from the data loss program, O-Cas when unforesee the recoverable error O-Cas on a dirty cache. Because the page is truncated and the data with the page cannot be written to disk anymore. This slide shows the type of machine check, machine check affection and error handling role. This table shows the type of machine and this table shows the error handling role. For example, as for the error means error collected by hardware. UC means hardware could collect an error, processor complex is corrupted and not continue to operate the system. UCR, there are three types. One is SRAR, next is SROAO, next is UCNA. There are different errors. As for SRAR, software recoverable action is wired. The error is detected, processor already consumes memory, so shut down with the system is recommended. As for SRAR, some data in memory is corrupted, but the data has not been consumed, and the system can perform a recovery action for a little bit of a system. UCNA in the some data memory is corrupted, but the data has not been consumed, system may continue to operate. So look at the bottom figure. I explain about the UCR handling role and CE handling role. If machine occurs, there are two types, UC and UCR. Then UC occurs, kind of funny, is it fly out. If error type is UCR, there are SRAR and SRAR type and SRARO. If error type is SRARO, there are two types. Each is an error page and beyond the kernel side. The error page cannot isolate the flow of the kernel, so the error handling feature of operating system, ignore error page and continue to operate the operating system. If the error page belongs to the user side, such as process operating system, handle a cube process and continue to operate the operating system. If machine check occurs, the type is CMCI. CMCI means the corrected machine check, and that's it. But this type is corrected error operating system, operates low ground error page and keep system running. Next section, I don't talk about issue and corrected error handling. So this shows the SRARO isolation mechanism at the next. There are six steps for SRARO error handling. Step one and step two. If memory error occurs, CPU notifies the CPU finds the memory error. CPU finds the memory error while doing password scrub or explicit write back. This is the machine recovery feature of the CPU. The CPU raises SRARO machine check exception in the run operating system. And after step three, the dual machine check, which is the machine handler of Linux, which is the machine check and register error to machine link, ring buffer, and if error type is SRARO, SRARO is a recoverable error. And then after step four, double the memory failure function, which is hardware pointer handler, is called at process context after MCE occurs, and handler continue to process SRARO in MCE ring after step six and five until MCE becomes empty. And step five and step six, hardware pointer handler sets the PGA hardware poison drug into the PGA SRARO act. PGE manage the broken physical memory areas. And then hardware pointer handler try to isolate target pages. So hardware pointer handler has some error isolation actions. The detail, I will explain next slide. So memory, there are four type of isolation features, hardware pointer handler. One is isolate, is isolate the error page from the operating system, usable area of operating system, and other process can not use that error page. And then QQ is a page beyond the process, and cannot recover process, they should use the QQ and process is QQ. And SIGNAR space, it process have a signal handler for SIGNAR space, process handle the, using phone signal handler and continue to operate. Ignore means error page beyond the kernel, that page cannot isolate. So only ignore action. So this figure shows the SRARO isolation mechanism between host OS and relax guest OS. So if page belongs to QMKVM process, you send to QM at step six and step seven. Step six is here and step seven up here. So, and then QN sends a presubo SRARO to guest at step eight. And after that, MCE handler inside relax guest receives MCE and HANDRASRARO in same way at first. So the error page can isolate inside the relax guest, only relax guest. So next I will talk about the problem of hardware point HANDRASRARO. There are two programs, SRARO error detection. One is KDUMP to get a memory dump after isolation of the SRARO, next data loss and reduction of guest when SRARO detected what added guest. At first, I talk about KDUMP case. So when KDUMP process operates memory dump, at that panel does not know pitch memory has uncorrected error type of SRARO. So therefore KDUMP, KDUMP panel may touch memory error, pitch has uncorrected error and as a result, KDUMP machine check of a memory dump process to get memory dump. So, in order to fix the problem, my approach is there are two approach. One is a memory dump file which is a command of one of the command of KDUMP. KDUMP file already supports a feature, which specifies the type of unnecessary page, such as free page or swap page or so. So the page into a bitmap and this page excluded from a memory dump file. So my approach is exclude the hardware poison page from memory dump by adding a hardware poison page to a dump bitmap such as here. And dump bitmap is created. So check map means include this memory dump and non-check means exclude from the dump file. So this approach to, necessary to fix this, why is KDUMP site export to the hardware poison graph in order to use this graph at the dump file command at the dump file site, add a new page type to exclude the hardware poison graph page. And this exclude the hardware poison page is I made a deposition. So I propose a patch set accepted by a maintainer merged at the dump file 1.5.2 and 3.9 rescanded one kernel. And my upstream activity is correct. This particular shows KDUMP result. As you can see, there are some pages. These pages are not included. These pages are excluded from KDUMP file. For example, page filled with zero page or cache page, or cache page of private page, or free page, and I added hardware poison page. For instance, KDUMP shows the result and there are 20 hardware poison pages and these pages are excluded from memory dump. And also KDUMP sucks sheet. So next is data loss and production case. This figure shows the isolation actions after hardware poison handler. Hardware poison handler have some actions depend on the page type. So it's a data page beyond the free page. That page isolates the target page and keep the operating system running. Slug means kernel page. In this case, hardware poison handler cannot handle so only ignore the page. Also, if the page belongs to green cache, that page can isolate and keep the host OS or case OS running. This figure shows at the number five case, there is a difference between find isolation action and expectation action device. So detail I would like to explain. This shows the logical data loss and corruption problem. So if step one, application writes some data with write-through mode. The data is cached. The data is cached on the proper cache and then write system code is returned instantly at step five. Therefore, a write-through code from an application is completed instantly at step three. In memory error occurs, MC recovery lays machine check into kernel and the hardware poison handler catches the error and set height of hardware poison error in the target page. And also sets as EIO flag into the file system. And then, if target page is dirty, that page, hardware poison handler blankets the target page at step four. So as a result, cache page lost. As a result, we have a problem orgasm because the data cache does not fit into a disk. Next, if application reads the data using the system code at step five, all the data is read from the digital disk. So therefore, if application process using all the data, data corruption problem occurs at step seven. So if application reads the data using sync system code or system code, the system code gets as EIO error at first time. But the file system clears after a problem system, clears after fast sync or X sync. So the next leads get auto data from digital disk. This is a detailed problem. So the next slide shows the impact of the data loss and corruption problem. This is the user's case of KVM environment. For this case, impact of application data cache is low, I think low. Because KVM processes are main processes. If KVM use cache code known for this type, QMKVM submits IEO by the IEO and IEO are not cached out of the host side. And as for this case, impact of truncating data cache is high because IEO from application inside guest is cached after file system, file system layer. If that is cached is truncated by hardware points on the app, data load application is lost. And also if application use auto data in the program with digital disk, data corruption problem occurs. So require the isolation action is kind of gold spiked instantly at guest OS and same user data before data loss or data corruption problem. So currently I'm proposing a patch and discussing the patch at stream. And I propose a kind of panic note into a hardware poison handler and some maintainer commented my patch. And currently we are discussing what is best way to handle the data loss problem. So for the next session, I did talk about issue of protected error handling. At first I explain about the MC log teachers. MC log is a demo for reporting hardware errors. The teacher is error logging and error counting but page of writing cache error handlings. This figure shows the result of MC log. If memory error or cache MC log outputs the set log, this will indicate the error or cache CPU 0, normally back to error type is uncorrected error and survey row. Address, address is here. This is another result of MC log. This means five corrected memory error occurs during the 24 hours and also one uncorrected error occurs during the 24 hours. So the next I will talk about the page of writing. So if corrected error occurs in specific memory region, intentionally the region is likely to go down in the future. So MC log can run the corrected error trends in order to run the system administrator to take the other corrective action when the error rate exceeds the short. This table shows the expected future for back-to-page error running. At small number one, I select target page in case of pre-page and it's going to be supported. At small number two, retry isolation when the page is free. This feature doesn't support client or MC log. At small number three, continue to keep memory error database into a disk or a number type of memory until your hardware is depressed. This feature also doesn't support client and MC log. At small number four, isolate the error pages. Again, using memory error database after rebooting the server. This feature doesn't also currently support it, not currently, not supported currently. So the requirement of action one and action two continue after corrected the isolation. I think MC log isolate flowing pages and the error rate exceeds the short. However, MC log failed to isolate the page because the page only use the page with the page or the page belongs to the client. As a farmer case, a page can be isolated from the usable page after application is used of the page to be finished. So required feature is MC log should isolate the page and the page will be freed from application. So as for number three and number four case, continue after corrected the isolation. Actually, MC log manage error page using memory error database, but this database is temporary inside of only memory. So information of error pages, you can see it's corrected the error only exists on memory. This information are flashed after reboot such as daily system maintenance or weekly system maintenance. So required action about number three, store memory error database into a file or disk or output the errors into end-of-the-run. Using good EFI or API function in order to keep data until hardware is depressed. And required action about number four, isolate error pages using back page of writing with the error data of number three. Error pages should be isolated from usable page before boot up of KVM guests. Because KVM guests allocate large memory, a large amount of memory. So this is a summary of my presentation. Recently, it is common that enterprise servers, especially for cloud environments, likely have a large amount of memory. This means the security of memory failure is included with the improvement of hardware, hardware error handling has been provided at 64 or their texture 332 servers. The hardware error handling of Linux has been continuously involved, however, more improvement are necessary at these three points. At first point, it's prepared detection. Device affection or hardware project error by our detection of hardware error, isolation, which marks how they are poisoned of the error page as a usable area. As for failure isolation, isolate hardware error and project error, providing isolation action which use an asset to meet their needs. And content key, require an action of content view of isolation. Store up hardware error data permanently on disk or use that data as a presentive activity to prevent output error. So, future work, there is the future work for failure detection and repair detection and content key after isolation. Thank you very much. If you have a question, please ask me.