 Good morning everyone. It's very nice to have you to join this presentation and I think it's so early. So have a nice day. And this topic is enabled SRLB on PowerMV mode on the Power7 platform. So it's very platform and hardware specific. I hope you can still enjoy this session after my presentation. This is the agenda of today. First I will show some basic concept of the IO architecture on the Power7 platform. And I will show you those platforms will introduce some restrictions to enable the SRLB and our solution on that. And at last it's the future world. First is the concept of positionable inner points. A positionable inner point PE is a separately assignable IO unit. It could be a single or multiple function of IO adapter or multiple IO. IOAs possibly including upstream switch and bridge. PE concept is defined in the Power Architecture platform requirements. It's called PAPR and you could download that on power.org. So PE concept is very important on the Power7 platform because when we want to, for example, in a SRLB environment, we want to give such function to the guest. The purpose of this, we need to put this function into the PE concept so we can assign this function to the guest. So that's the requirement to pass through this via to the guest. This is an example about how the PE could look like. This is a general architecture at a big CPU attached to a PCI host bridge. And it could introduce some PCI bridge and attach some PCI IO adapter. The first is kind of PE is basically include many PCI devices. So we can group a lot of PCI devices into this only one PE. And another kind of is called the device PE. So we just put one individual function into this PE. It depends on what kind of the function you have. So on the Power platform, in order to form a PE, we need to give the PE several resources. There are PE number and best device function PDF range to make it sure what kind of device it can include. And the MMI or the memory mapping IO address are going to be the NBMA address and MSI address. Here is a general concept about PE. Why we have PE on a power system? Because the reason is that the PE is an isolation unit. So when an error occurs, the system needs to know which PE is in an error state. So we can do the recovery based on this PE number. So this is a feature on the power system. So we can do something error happens that can be recovered to this particular PE, do some recovery at the system stage, to this PE and do some recovery. And in order to make this PE understand by heart here, on the power IO architecture, we have several hardware tables to be configured so that hardware can understand which resources. The purpose of this hardware table is to map the corresponding resources such as PDF range, MMI range, NBMA and MSI branch to map to the PE number so that the system can understand. This IO or this transaction just happened by this PE and we do this recovery for this particular PE. So the first table is called PE LTM, the PE lockup table match. So we can see the table is a brief one. It has, each PHB has a table besides 128 and each entry represents one PE and the corresponding PDF, the range this PE could have. And each PE has one and only one entry in the PE LTM, which means there's one map. And when the error happens, the system could get a PDF number and how we'll do the scan from top to bottom. So the first one match the PDF range, it will return this corresponding PE number to the system. So okay, I got this, this guy is in around error state. The next one is called MDT, it could be 32 bit or 64 bit, there are two tables to map to different range. This is the MMI space, a diamond, you should make sure that this MMI is belongs to which PE. And during the put up stage, this table is, the system will give a range of the MMI space to the kernel and the kernel melts that and do the calculation to divide this range equally. There are 128 entries for this table and this range equally and the two maps so that when a range happens, it will disable each entry of this table to present a piece of the 128 range in this MMI. And in this table, which is different from previous one that one PE could have several entries, because in previous table, every PE just gets one entry in this table. So this means if a PE have a very large MMI space requirements, we can have several entries in this table. The next one is the, or the TV team, the TVE validation table. This table is used to do, to validation the DMA address, and to do some, the IOMMU map from the DMA address to the physical member address. And each entry in this table will present a range of the DMA space with the same size, but the minimum size in the power platform is 128 megabytes. And also this, in this table that PE could have several entries. For example, this PE may require more than this number of DMA resource, it could have several entries mapped in this table. And this table has a pointer to the IOMMU table to do the address map. The next one is for the MSI address called MSI validation table. The NVT creates the MSI alignment across existing TVs. So when the MSI address to do the, to write some data to the MSI address, the system can know this address, this MSI address belongs to which PE. Well, we have so many hardware tables to help us to find out which area happens to which PE and do this kind of a coverage. But this kind of hardware tables on current play power platform really have some restrictions to, to for the SROD card. The first one is the limited number of TVs. As we see that the PLTM just have 128 entries, which means the total number is 128, we cannot exceed that. But this one works in the previous stage, when the VF is not involved, because we have, but we could have several DCI devices or several buses included together. So seems to have no big problem. And when the VF is involved, since, as I mentioned before, on the power platform, when we want to assign VF to some gas, we have to put this VF in a PDE. This is the minimum assignable unit in power platform. So we have to put the VF in individual PDE, otherwise we will not assign this VF to the gas. So the current idea is that we just leave the original design there and put this VF into an individual PDE so that we can put those VF into PDE. And after, even we can do this assignment as the previous slide, but it will introduce some problems. Because originally, this is the original PLTM table look like that we assigned the P number from 0 to 01 and going on. This works because, but if we just put those VF PEs behind those bus PEs, it will not work because, as I mentioned, in the previous PLTM slide, the scan will begin from top to bottom. And if a VF has some error happened, this VF will be, of course, will be included in this parent bus number. So how do we scan this table? Even just a VF has a problem. The bus on the scan stage, it will hit his parent first. So it will think, okay, this whole bus get an error and it will reset the whole bus. This is what we want to see. And we just reorder the whole assignment. So previously we assigned those PEs from 0 and then we reorder that. We assigned those PEs from the bottom. So during the scan, we will hit the VF first. So this is what we should change. And then is the MMIO assignment. As we know that when we read the PCI device, there's a PSI-V device, it will have some special VF bar. But the VF bar is a continuous range. And the size of each VF, the MMIO range could have, it's not decided by the system. And at this one, at the deep yellow, this is the MMIO assignment. In the MDT table, we have 256 entries in the system. And at the boot time, we divided the whole system MMIO range equally. And each entry in the MDT table represents an exact range of the MMIO range. If we just put a VF, we want to assign a VF, this range. But for example, the second line, the VF bar size, we read from the PCI complex space that this is the VF bar size. But each VF just occupy not exactly for the MMIO alignment. So we cannot just assign a VF to an exact PE, just like in this button in the MDT table for segment zero. We say this belongs to PE number for VF1 or PE number for VF2. And in the second one, we say this range belongs to PE for VF2 or PE for VF3. This is a conflict. We cannot do that in current hardware. We have several options. The first one often is that it's for the VF stripe to shift the VF bar. Since IBM have some contact with some particular hardware device manufacturer, we have met this kind of problem in products, the Power VM platform. So we asked the manufacturer, say, okay, we have this kind of issue and we need to provide some special capability in the PCI complex space to fill that. In original PCI device, the VF, when we read the VF bar size, it is continuous. But when we set this VF stripe, the VF bar size is not continuous. We can give a shift to this VF bar. So VF bar size is exactly shifted to meet the MMIO alignment. So we can just put that into the, to say that, okay, segment zero just belongs to the PE VF1 and segment one belongs to PE VF2. But this solution is, is not good because just the only, for example, I know that Emile X Lancer card could support this kind of capability. Not all the card could support this kind of thing. So it's not, so I don't think that after discussion, we think this is not a very good solution for this kind of issue. And we have option two. Option two is to, it means to set the system page size to meet alignment. There is a field in the SRV card called the system page size. We set this to, the VF bar size should be the multiple integer of the system page size. This is what the specification says. So the idea is to set this one to the MMIO alignment so that VF bar will expand to meet this alignment. So the kind of conclusion, the result is that so we can put each VF into his PE. But unfortunately, even the specification says that not all the, I think most of the devices or firmware do not support this kind of feature. So we cannot rely on this. Then we come out of our choice currently. I think it's not good if anybody have some good ideas and let me know. Thanks. This is not good to make work. Our solution is that in the MMIO point of view that the VF and the VF are in one PE. So we just have, just like some bus PE, we put all those VFs and the PF, the parents of PF into one PE. So because on the VF side, just like the bus PE when it scans through the table, first it will not conflict on the previous scan. And then the MMIO window, it's really belongs to, it's really belongs to the same range. And so we put those, when we have this segment one, segment zero and segment one, we say, okay, this segment belongs to the bus PE. And to the bus PCIE, this will expand to a bus. And we say this segment, MMIO segment belongs to this PE. And when to do the recovery, we will, when the some VF in the aerospace, we will do the recovery to all those VFs and VFs. But we know that it's not good because when a VF has some, in an aerospace, it will recovery all those VFs and VFs. This will introduce some, so when, suppose one gas is running something wrong in this gas, and all other gas have this, those VFs will see this kind of error. It's not good, but it's, it will enable the VF. So this is our solution, but if you have better idea, please let me know. Yeah. The other thing is that something wrong happened in a VF, and it will, and the VF is one of the guests. And after we do the recovery, this will affect other guests. Yes, this is the, and a lot of this solution. Yes, other guests will see this error. It's not good. Another one is the VMA space limitation. The first restriction is that, as I said, the TVT table, each entry represents a range of the VMA and the minimum size is 200, 120 megabytes. To some extent, it's as big because for a VF, maybe they don't need to, don't need some, so big rent. And also due to the space limitation of all entries are valid. This means, even we have 256 entries in this table. So, apparently, we can have 258, 56, a DMA range for those PEs, but on platform company, I doing the development. 428, 4 megabytes is available for the DMA. So, it means, this is, on current platform, there are 256, 8, I think, DMA range. In this example, there were only four entries are valid in the total system. So, which means we just can have 14 PEs to enable on the system. This is the limitation because we cannot imagine that a SIV card just to have one or two VF functions. This is not, otherwise, customer will not buy this kind of card. It's not, they have no great benefit from that. But this is the limitation on our current platform. And, for example, we have a list of both PEs. Since every PEs will get a DMA range to make those devices to function. And, for example, we just have four valid entries in this table. So, we could have P0, P1, P2, for example, for VF. And it has three VFs, but we just could enable the first VF, VF0 to give the P number three. But other VFs could not be enabled. Even it can be assigned P number, but could not be enabled. So, this solution is just like the previous one. We merge all those VFs in this PF to a big PE. But this solution also, like the previous one, have the same drawback on the previous one. So, one of the VFs gets something wrong. Other VFs will, other guests have the same VF in this PE will fill this kind of error. This is the drawback from the solution. So, probably we just enabled 32-bit DMA address for the DMA. On the power platform, it supports 60-bit DMA too. So, we may have take a look on this and we think this will have a big range for the DMA. So, after we enable this one, we may have more spaces for the PE. So, we can just, we don't, we cannot. So, we can revert our solution in the previous slide. We can assign every PE individual DMA range so we could have better solution. But this also has some problem because not every device supports 64-bit DMA address. So, we may, some device may work, some device may not work in the 64-bit DMA. So, this also under investigation. Yeah, this is the slide for to show the disadvantage of this in this picture. So, for example, if there are, in this picture, there are PF0, PF1, and PF1 have several VMs. All the VMs belongs to the PE, PF1 plus PE. So, all this, the last five functions belongs to one PE. The green means that everything is fine. And for example, the PF2 gets some error, error state, satisfied system. And all those, and all those functions belong to this bus PE will be in a recovered state. This will be seen by not only the whole stand of the gas. So, this is, this is not good. And this is another example that the difference is that PF0 and PF1 belongs to the same bus PE. For example, we have several physical functions for PCI car. This same that the PF2 gets some error. And the whole, the total, all those functions in this bus PE will be in a recovered state.