 Haj, moj nimi je Mauro, sem tukaj, da pa se pravimo o relajabilitih, avilitih in izgledajstvih, in vseh klinikov. Zato sem pače, da se pravimo, še je to, in potem sem pače, in dolejte, da je to, da je to, da je tukaj na klinikov, in bilo projavečne dotrepo izgledu postava s zelo. Poddaj se je. Jel je blijela? Mežno da je poško kaj naša. Od 100 milijkih učičnje však. V nekaj dana izgleda, da neč nekaj obovedan. So they created the concept of high availability and they decided to implement a series of measures in order to be able to evaluate between one system and another system what system is better target for high availability. So the first concept is related to reliability, which is basically the probability that a system will produce a correct result, so it will provide a correct output outside that system. And they created a measure called mean time between failures, which basically measures how much time the system is running without having any problems. And what we should do when we want to improve is to provide mechanisms to detect and to avoid and to help on repairing those faults. The second concept is related to availability of the system, which is the probability that the system will be operational for a given time. It generally measures a percentage of downtime for a period of time where you are measuring that availability. So, for example, we call three nines when the system is unavailable for 3.65 days unavailable per year. We may measure for months, for hours, but generally used by year is the usual way of displaying this kind of measures. Five nines is about 50 minutes of downtime per year. Of course, if I can detect the hardware and correct those failures, I will increase the availability of the system. And the last concept is related with sustainability or maintainability, which means the time that you need to repair that system. So it generally measures the mean time between two consecutive repairs. Of course, if I have a better support, the system will be more accessible. Well, there are several kinds of ways to improve the availability of the system. I can use hardware measures, I can use software measures, and I can use self-semeasures in order to improve the system. For example, if I'm talking about CPU, I can detect errors inside the cache, inside the buses, inside the processing itself, processing block. I can, in case of memory, detect memory errors in storage, may add checksums on bus, same things. So there are several measures, several ways of improving the hardware by adding checksums and things like that. I can also improve the high availability of the system by improving the support from the IT team or from my vendor. One thing that I've been using right now is to create virtual machines. So if the physical machine crashes, I can move my virtual machine to another physical machine and the sets will be available. And the most important thing that is the focus of the rest of this talk is that I can use predictive analysis. I can detect in advance before my hardware actually crashes, and then I can take some measures in a preventive way in order to make my system more available over time. So the hardware needs to provide some way to detect and correct those errors and to detect when the components are being degraded. And of course I need to have user space tools that allows me to analyze those data information that are provided by the hardware in order for me to improve the system and to make preventive stops when myself won't be affected. Well, in the case of the Linux kernel, since the beginning we have some source of measures to work on high availability. The first ones probably are on the storage block layer that has CRCs and other measures that allows to improve the availability of my storage systems. But in the recent days we are starting to add more things. Probably the most important one that happened on kernel 2.6.32 where the addition of the machine check architecture, MCA. This is something that Intel started on, provided on pension for machines. And it has some blocks inside the CPU that detects errors on the CPU and on the components that CPU talks directly. It can provide information about memory errors, about bus errors, about CPU caches and about the instruction fetch itself. There is one tool in this space called MCA log that records those information that allows you to decode that information and to detect what is happening, provided that the error doesn't crash machine. Because it crashes, the user space tool won't be called and there is nothing that can be done. The thing is, the interface between the kernel and the MCA log is based on a very weird API, basically what the kernel exposes to user space are the values of some registers inside the CPU, special registers for MCA. And it's up to the user space to decode that information. And the decode depends on what CPU model, what CPU steps and the MCA log needs to know everything about that specific CPU model. So it needs to be decoded on the same machine as here, of course, otherwise you may not be able to get the proper error. Another feature that were added on the Linux kernel is called EDEK. EDEK basically does hard detection on memory and PCI buses. PCI ones are really old hard, newer hard, don't have EDEK supports for PCI, only for memories. And the thing is, interface for EDEK, it is a more upper layer. It passes the information about what DIM or what DIM parts that were affected by some troubles. Both MCA and EDEK have trace interface too. It provides interface via traces, kernel traces. And this interface is very interesting and I will explain a little bit more about that in the next slides. In the case of EDEK, even if the memory error is fatal in the crash machine, as the decoding of the error occurs inside the kernel, it provides a consistent information to the space when the machine crashes. So the space can know that certain D-memory had a problem and maybe it's time to replace that memory by another one. There is one detail on the EDEK, is that on most cases we are talking directly with the memory controller, but newer systems and a few weird architectures don't allow you to access directly the hard. So we need to talk about the BIOS and BIOS will then report the error is what we call firmware first. A very recent addition to those hard capabilities started on PCIe errors. These were added about six months ago. So right now you may also get errors on Linux from PCIe errors via traces events. This way you can also detect when the bus is not working, when the communication between the main board and the PCIe device is not fine, maybe bad contacts, and you can actually work proactively to improve that things. PCIe is able to correct the errors too. So even if the device is actually working, you may have troubles there and it may mean that you need to replace that hardware on some time, going down and check the contacts and things like that. The thing is we have those features inside the kernel, but when looking on the other side there are very, very few user space applications actually getting those informations and doing something useful of that. There were basically two tools before three months ago, which is the MCA log, two open source tools. MCA log focus only on Intel MCA type of errors and there is the edac tools that is the complementary part of the edac kernel support and that works with the DIM label types of errors. With that situation in mind I decided to create another tool that can consolidate all those source of errors inside just one tool and provide one consistent way of getting those errors out of the system, which is the Rave Demo. So the Rave Demo it collects the errors from all those three subsystems, MCA, edac and PCIe error subsystems. It gets those informations together via traces events, so it connects, it hooks inside the kernel tracing subsystem, it starts to listen to what is happening there and depending on what happens it reports, it logs on a database, it allows you to have persistent storage for those errors and you can later analyze and use whatever you need when it is time to do some intervention on those machines. We also worked on adding newer features to improve those supports inside the kernel and the Hasim is able to use all those features that are very neat, very nice. So looking what happened inside the kernel on 3.6.2, what we had basically there were the dex subsystems, but the only thing that the dex subsystems do on that time is to write things in the message, so you need to take a look on that and see if something bad happens. Also the way the dex worked on that time is basically what you call rank. Let me explain a little bit more about that. This is how the memory works in practice. We have matrix of rows and columns where the data is stored. Those cells are grouped on banks and those banks join together what we call rank. So the memory has one, typically a DIM has just one rank. When people started to need more and more runs, what people started to do is to, they started to use the two faces of the DIM chip, the DIM, in one face they put one rank, in the other face another rank, that's called dual-sided DIMs. And right now we have even some DIMs that have four ranks inside. Well, the subsystem on that time, the dex subsystem, where rank-based, well, I cannot replace just one rank. If the DIM fails, I need to replace the entire DIM, if it has two or four ranks, all four ranks will be replaced. But even so the system identifies rank by rank and it's not easy to associate on what physical DIM rank is inside. That's one of the troubles on the dex subsystem back to 3.6.32. The MCA log also had its own crappy interface that basically passes registers to user space. And all the decoding of the error were done on user space only. On 2.6.32 we actually added support there for the canal tracing on MCA, but there is no true, there were no tools to use those trace events. So it's there, but nobody uses it because there is no user space counterpart. Well, on canal 3.5 we wrote the entire dex subsystem in order to allow it to support those new types of architectures that started on Intel chips. And then we can now see DIM by DIM and not rank by rank, depending on the memory controller. And we also added there the same kind of support that were already in the MCA, which is the trace events mechanism. This is just an unsiljared slide to show what changes between the original versions of the deck to the modern systems. Originally the PC, well, just like the upper picture there, we have a south bridge and nothing bridge. In CPU the memory were connected directly to the north bridge. In most cases I was not able to get any information from that. Then we started to add a memory controller, chipset outside memory bridge or in some cases inside the north bridge. And we start to have some control about the error correction mechanisms inside the DIMs. And on modern architectures the entire north bridge went inside the CPU and that's what we have right now. On Nehalen, on Sandbridge, on Hanwell, on those new architectures on ARM64. There are some special kinds of big machines with lots of CPUs that also have an unsiljared chip that talks with the memories called SMB. Well, on Canon 3.9 we also had a new driver inside the kernel for covering those cases where we cannot talk with the north bridge or with the memory controller because it may have other chips inside, it may have SMBs, or maybe the hardware is just talking directly with the memory controller and if you try to do it at the same time we will race with the BIOS. So we need to add this kind of drivers that goes through the BIOS and the BIOS will then provide us the information about the errors. We also added the trace events for PCIA, IAR and then finally on Canon 3.10 we added a new series of useful features for race events. Basically it changes a lot inside the tracing facilities. Before Canon 3.10, if we open trace event device nodes we need to be polling each have time because I cannot, for example, call read and wait for the data to be there. There are some issues inside the subsystem, the kernel that didn't allow this kind of things. So we had to be polling constantly that consumes power that makes the CPU spending cycles. That's not a good solution, so we added inside the kernel a way for us to wait for an event to happen instead of leading to polling each have time. We also allowed the trace interface to be used by more than one application. If you are using any kernel before 3.10, if you are using the trace interface nobody else can use because there is just one pipeline. So if you change the configuration on that pipeline all other applications would be affected. After Canon 3.10, I can create a special pipeline just for monitoring the has events. Another problem that used to have before Canon 3.10 is related with the time stamps of the events. All the trace interface were originally added for measuring the performance of applications, the performance of the kernel. So it has very high precise time stamps there but they are not reference to the system clock. I cannot associate that information with the time stamp, for example, system time stamp. In start in Canon 3.10 we added inside the tracing feature a way to associate that with the system uptime and I can then associate with the time stamp of the machine. The has event tool can use all those features that are provided by those sufficents. Ok, let me go into a little more detail about the form of this first and how the first. As I said before, on several cases I can go directly inside the memory controller and get information there. What happens nowadays is that bios manufacturers are very lazy on changing the bios information between two similar systems. So maybe I may have two systems that look similar. The motherboard layouts a little different. They use the same bios, but the number of dims are different. The numbers of the dims sockets are different. That information that is installed inside the bios are the same even though. So the bios is not reliable enough. So the decision when people start the dexam system where to go directly inside the hardware because there I know that information is 100% fine. It's also faster because I'm going directly inside the hardware. But the thing is there are some systems that the bios also are reading information from memory controllers. And the hazards that are used to store the error information when I read the information from that hazard, the hazard is cleared. It's cleared on reads. So if the bios and the operational systems are trying to read the same information from the memory controller, either the bios or the operational system will get information. Worse than that, one will interfere in the order because I may read part of the information here. Then I got input by the bios. The bios will read another hazard. So I won't have the information perfect here. Neither the bios will have. So the error will be lost. So on decisions where the bios are directly talked in the memory controller, I need to shut down that interface via the hardware and I need to rely on bios to get information. That's what happens with lots of vendors. They have their own proprietary mechanism to get their information and those interfere with the kernel. On the other hand, the information provided by the bios right now is generally incomplete. On several cases, we cannot track to a single dim because the bios don't provide that information. So it is a straight off between those two. That's why we have the two approaches inside the kernel. With that in mind, we started with the new tool which is the Hazemon. It's currently available on Fedora since 2018. We expect it to be also used by other distributions but it will take some time. That tool has only two months. So it is brand new. Basically it hooks inside the kernel trace events. It creates its own event there. And it waits for some error to occur. When the error occurs, the tool reports it to the space. It also allows associated the labels that are printed inside the main board with the addresses inside the memory controller. And there are also some needed tools inside in order to test the subsystem and to test the tool itself. This is an example of a kernel 3.10 running on a sandwich machine, this specific machine. This machine has only two dual rank dims each with 8 gigabits of memory. So if I call this tool for the Hazemon, and ask for the layout, it will print that information. Basically what is saying here is that at channel zero is lot zero. I have one dim at the first CPU, memory controller zero in this case is first CPU. And on the second CPU I have just one dim memory inside the channel zero is lot zero. That's what I have physically on my machine. I can also check how the kernel knows the label information about that memory. I can use the print labels, I'm not sure if this is visible. And it will show, in this case it doesn't know the name yet. So it will just show the same information that I had. CPU zero, CPU one, channel zero, dim zero. That's the information that it provides. But then I can read that information from a special file using the hash labels. It will read the information from that file. And if I call it again, it will now provide me the real information. Dim A1, dim B1. That's what's said inside my motherboard. So if an error occurs on one of those dims I am able to tell on what dim the error actually occurred. That's what I'm doing here. I here started that tool and I used the fake injection tool in order to emulate that the other error occurred on that machine. In this case I have two correct errors that were fake regenerated. The first one had dim A1, the second one had dim B1. Here is the location. Memory controller zero is not zero dim zero of that memory controller. I can of course generate a summary of the errors and as all those errors are stored inside a database I can later do more statistics and more things inside that. We are right now in the phases of adding new features inside the hash dimon. We are negotiating with Intel and other companies in order to add some engines there in order to improve the analysis and provide something more helpful for user space. As I said before this tool is brand new. We are wanting help and we are wanting to add more features so feel free to send us contributions, suggestions, so on that we can improve the tool and make it better for everyone. That's what I have prepared for this presentation. I am now open for questions. I would prefer if you could make the questions inside the microphone. We have one microphone there. Questions, doubts, complaints. Let me try to repeat the question. He is asking about what kind of trouble that may happen on account before 3.10 if I use this tool and what other software that could be interfered by the hash dimon. What happens is that currently the trace interface is only it is there on the system but it is not actually used by any other tool except if you for example want to master the performance of one application. If you for example try to run perf, perf will open the trace interface and it will change the things there. It will reconfigure the events that are being tracked and it will also read those events using the same device node as the hash dimon. If the hash dimon is running perf will interfere on that because it will change the things that are being monitored. So that's why you cannot use perf if you run the hash dimon on those machines. After 3.10 we create a new instance and in this perf it will not use the new instance, it will use the general one and it will want to interfere. Yes, let me see from the studio question. You are asking what methods have been taken to allow someone to evaluate if the new system are actually providing the right error reports, right? This is something that we didn't actually address. This is something that would require to have some sort of hardware error injection. Actually, in the case of Intel, for example, we work together with Intel on some drivers for them to double check using the hardware error generator tools to provide this kind of information. One thing that has been added recently on Intel machines, but this may be disabled by the bios, is what we call error injection features. Intel modern CPUs are able to inject errors inside and if you are injecting the error, it will inject just like if the error started from the device itself. If this kind of features, you can actually get a very, very precise way of the error would be produced by the hardware itself. If that feature is enabled on your bios, you can actually test the end-to-end error detection solution. As I said before, most bios vendors disable that feature inside their bios, so you may need to ask them to open, to put the bios on developer mode for you to be able to enable that feature and do that sort of testing. The idea is to let it run since the boot time, so it won't be consuming CPU because it will be just waiting for those specific hardware events, so if the hardware is okay, it should not be consuming anything, but just a few memory spaces because it will be running there all the time. But you started at the beginning that it will be running all the time, so it started really fast. Any more questions? Okay, thank you.