 getting the device based on an entry in the device page. At the end of the day, what needs to happen is a device is created on the vertio bus and then once that is done, the data path actually works directly between the device emulation code and the vertio device. So the host driver here supports a device create iOctl that allows you to create a device which has information on the location of the V-rings and interrupt information and status. It has a field for status. This is as per the vertio spec and then once the device entry is created, the vertio mic module on the card side is able to then create a vertio device and then you have your own business in the rings are functioning and network packets can flow and so on. So if you are familiar with how vertio works in the case of Kimu and KVM, what happens is the vertio device is a PCI device that gets discovered using the normal PCI discovery mechanisms and then the PCI, there is some code here that allows you to set up rings and then the data transfer is set up between the vertio net driver and the network backend. So in the case of the coprocessor OS and the coprocessor driver, basically this vertio mic block here is playing the same role as the vertio PCI so it is in the business of creating the device and so on and setting up the rings while working with the host side. Once the rings are set up, the data path actually flows between the backend and the driver. So the next part of the driver is high performance data transfer or PCIe. So for that we invented an API called Symmetric Communications Interface and this API, the symmetric stands for the fact that this API is symmetric between the coprocessor and the host so they are sort of peers now and sort of nodes over the PCIe network, on the PCIe network. The primary goal of this API was to provide PCIe transfer bandwidth as close as possible to what was available through the hardware so that is about 7 gigabytes per second. Just as a data point, the TCP IP implementation provided us a data bandwidth of 400 megabytes per second and then the next goal was, you know, we had this, we had multiple programming models, at least two and we wanted to create some common infrastructure and code that would support these programming models and so in terms of data paths, the data paths that skip supports is between the host and the coprocessor and between the coprocessor so that basically allows for the offload model as well as the native model and symmetric models of computation. Now in the native model I also said that the communication between coprocessors that are on different platforms is through some sort of infinite band adapter. So the skip by itself is not involved in the communication but that data moment happens because of the, through the infinite band adapter but skip is involved in setting up that communication so that is the third data path that is supported by skip. Sure, yeah, yeah, well because it is PCIE, you will effectively get the same bandwidth between the host and the multiple, multiple micards. This is, well there are, there is a factor here and one of them is that the network stack operation itself on the mic is, these are simpler x86 cores that have been augmented with x86 that may be, I am not sure because there is some serialization in the TCPIP stack and so on. So, yeah, I mean if you implement a lot, I mean if you implemented sophisticated network driver you could probably get some parallelization there, you are right, yeah. So, I mean to achieve that performance, the PCIE bandwidth performance, we have the concept of, concept of RDMA or PCIE essentially and then additionally we also support just send receive for just two-sided communication that is very similar to sockets and then for the lowest latency message transfer between the coprocessors or between the host and coprocessors it is actually possible to just M-map memory on a different process, on a remote process and then that, that gives you less than microsecond latency, message latency. So, the, before I continue with description of SKIF, there is the, there are a couple of concepts here. So, one is the concept of a SKIF endpoint and it is, it is, it is very similar to a socket and it is essentially, it is a pipe to a PCIE node, it does operate in a loopback fashion and it is actually ends up being bound to a port ID. So, you have, you know, just messages traveling between these endpoints and then two endpoints they form a connection. So, once you have a connection form, you can do RDMA or that connection, you can M-map memory or the connection and so on. So, with respect to the API, the functional grouping is as, is as shown here. You have, you have connection setup APIs that is, those are equivalent, almost equivalent sockets. Once you have the connection setup that is send receive or messaging on, on that, on those endpoints then, then you have the RMA APIs but, you know, before you this RMA or folks know, popularly known as RDMA, but before you can go to RMA you need some mechanism to register memory that is pin down memory and expose it to the remote side. These RMA operations are asynchronous so we, we don't have, they, they, they can, they can proceed in parallel with computation so, there is, there, there is a need to know when the RMA operation has completed for sure. So, that is the RMA fencing API and then lastly we have the remote memory mapping API which is just actually M-map. Well, yeah, so the, the API provides its own control plane this, through the send receive messages so you can, you know, send small data at hopefully a lower latency than, than TCP IP. So, that's used in the messaging layer here and then, but the RDMA itself is, yeah, it's just a DMA between process address spaces, yeah, yeah it does. So, so basically the, the SCIF API has two versions, it's, it's the same code but it exposed at two levels one is a user space and kernel space and in the kernel space implementation the RDMA stack, the OFET stack is layered on top of the SCIF API so that you can have RDMA or PCIe but you could also just if you wanted, so for example in the, in the, in the offload case we have a library that talks from, that does RDMA from user space so that, that, that makes use of the user space interface. So with respect to the connection and send and receive it's, it's very similar to sockets it's just, I mean the implementation act at this point is just a character driver so this, all this happens through Ioctl's into the character driver and then there is also a kernel mode, kernel mode, these interfaces are both available in kernel mode and user mode so essentially in this case these end points are, these nodes are opening up end points one of the, one of the nodes puts its end point into the listen state then the other guy you know sends a connect message as soon as a connection is accepted the new end point created here then once you have the new end point created you can do send receive and RDMA and you know M-map and so on. So that's, that's a connection API. The next slide just brief description about how it actually works. So at the point of creating a connection we have a, we sort of create a receive queue on both sides of the connection. In this case we have this particular process P0 that wants to send a message. P1 here is on a different PCIe node so it's blocked waiting for the message and this is the, it specifies message 1 as the destination of the message and simply so there is, there are drivers on both sides so this is the kernel implementation of skiff and they are talking to each other through some sort of ring buffer implementation. The message skiff send implementation itself is through some, is through an Ioctl interface it, the message gets transferred to the remote end points receive queue and interrupt gets sent and then the message is copied over. So these messages are relatively low latency for you know less than about, less than about 1k in size. The PCIe network itself is assumed to be reliable so there is no sort of check summing or retransmissions and things like that. It's just a message sent over a ring buffer essentially. So that's, and so you know you have two sides that connect up and then you know then they can exchange these messages. So for example these messages could be just some, in the RDMA case for example it could be just some RDMA control information. So it's very similar to how RDMA is done on you know on the OFET stack or infinity band and so on. So next I move on to memory registration. So the, I mean as I said earlier we want to do RDMA, RMA or PCIe and so in that case the goal is to be able to transfer data from a buffer in this particular process to a remote process. So for that to happen the pages, you know the buffer has to be exposed to the remote side. So it's pages have to be, the physical page location has to be sent over to the remote node so that the remote node can then do DMA to BOP0. So the memory registration is the process that enables that and the pages are pinned because you want to access it from the local DMA engine and also the remote side is going to be accessing these pages. So that's, I mean these are the two reasons obviously that you would need to pin the pages. So I mean at a high level this is RDMA using a local DMA controller. So what we need is the physical addresses for the local side and the remote side. So the next slide actually explains some of the steps in the memory registration. So I mean at a high level we need some identifier for the buffer once it's registered. So and these buffers are identified using something called registered address space. So in the previous example buffer 0 was registered with the skip driver and what is returned is an offset. These offsets are per connection. So every connection has two address spaces, a local registered address space and a remote registered address space. So when this registration happens the page, physical page locations are sent over to the remote node and so they appear at offset 0 in the remote registered address space of the process P1 of the peer process and similarly process P1 wants to register its buffer. So that happens and offset is returned and similarly this particular offset 1 is reflected into the remote registered address space. So once you have this particular, this identifier essentially this offset is nothing but an identifier for the pin buffer, you are able to do RMA operations between these buffers. So this is the prototype for the skip register. Essentially you can register a virtual address and then you can also specify the protections whether you are allowing the remote process to write to that virtual address and so on. So this is the, so once the registration has happened, so in this case buffer 0 is registered locally, the remote end registered buffer 1 and sent over that particular offset over the send receive API and now you can use this, you know the RMA API essentially to do a transfer from offset 0 to offset 1, so the local offset to the remote offset and so the flags allow specification of whether the transfer is synchronous or there are some other things that we had to do for OFET. For example, if the last cache line arrives essentially something to do with the ordering of the transfer and so on. So that is the API that is completely based on offsets and then since the local, you know, at on the local side we can actually use the virtual address also. So basically we just added a V prefix to the write to API and then you can pass in buffer 0. So I mean this has the overhead of pinning the pages at run time and then unpinning once the DMA transfer has completed. So for long, for buffers that are registered over a longer period of time it is actually preferable to register it a priori and then keep using the set of buffers in the RMA API. So since the RMA APIs are asynchronous, so we need some mechanism to synchronize with the completion. So we have two mechanisms, one is the blocking mode for synchronization and then the other is non-blocking. Here the non-blocking mode essentially involves, you know, involves writing a value to a memory location. So for example, what happened in this particular picture, we have issued RMA 1, RMA 2 and at this point we issued a skip-fen signal which is actually going to write a value to an offset in a registered address space. That registered address space could be local or remote. So it could be, most probably it is going to be on the remote address, it is going to be remote address so that the, so that the local end has issued DMAs, the remote side is waiting for the DMAs to complete and so the skip-fen signal is essentially guarantees that this value appears on the remote side only after the RMA transfers that were issued prior to the fen signal have completed. So this RMA 3 has no relation to this particular fence so it can complete later on but the skip-fen signal essentially guarantees ordering. So this way it is sort of equivalent to the fence instructions that are found on the CPUs. So that was non-blocking or polling synchronization. Now with blocking synchronization it is very similar, the APIs are a bit different where you mark a particular fence and then you can issue your RMAs and then issue and continue computation or issuing further RMAs and then you go on to wait for the particular, at this point if the RMA has not completed this particular thread would be blocked waiting for the RMA. So that actually brings us to the last API here and that has to do with just M-map and going back couple of slides so we have the local buffer registered here and it appears in the process virtual address space as well. The remote side has registered a buffer and so it appears in the remote registered address space for this particular connection. Now we want to be able to M-map that buffer so that we can do some sort of synchronization over shared memory and this is nothing but an M-map call and once you do an M-map call specify offset 1 as the offset within the connection it automatically takes the remote registered offset and then the buffer appears in the local virtual address space. So that is the, so that's how M-map works on SKIF. So finally this is an example of the SKIF usage where the Linux RDMA APIs are layered on top of the SKIF API, kernel level API here. So the stack is, you know, starts from MPI application all over, you know, in user space all the way down to the SKIF interface in the kernel. So this particular driver here provides a software emulated HCA so it provides the Q-Pair, send receive Q-Pairs and completion Q-Pairs and so on. Essentially emulates Q-Pair that you find on an infinite band hardware adapter and then it uses it actually, so this is for communication with another co-processor or with the host or the co-processor and so it basically sets up a connection with a remote SKIF driver and once that connection is set up it's able to send some control messages using send receive and use the RMA APIs to do the transfers. The, actually what, so what actually happens on platforms is that you, most of the HPC platforms that are using the microprocessor today also have an infinite band adapter. So since the low latency message here actually is actually going through the kernel, the low latency messages or the small messages end up still going via the infinite band adapter if it's available on the system and the high bandwidth transfer happens through the SKIF driver here. So it's a mixed approach on systems that where you already have an HCA it uses different paths for small messages and bulk DMA transfers. So this is an example of the SKIF RMA performance. So when the DMA is initiated from the host we get pretty close to the available PCIe bandwidth at about 256k of, at about maybe 1 megabyte or maybe it's 2 megabytes but quite early on at 256k we are about 6 gigabytes per second. So the DMA transfer bandwidth is a little bit lower when the DMAs are initiated from the coprocessor and that's mainly because the coprocessor is operating with simpler x86 cores. So just the, all the code that is needed to get from the offsets to the physical pages and then program the DMA controller and so on that particular code gets in the way and so at the effective DMA bandwidth you get is lower. Finally, with respect to the code status and future plans, so the patches have been submitted for the OS state management and the word IO device support. We expect inclusion in the 3.13 kernel. The future patches are going to be for the DMA engine itself that's on the microprocessor and it's usage in the word IO devices and then also the entire skip API, both the kernel level, the kernel level skip API and the IOctal interface. So in summary the mic driver enables an all Linux platform so you have Linux running both on the host and the coprocessor. The new features that we have introduced are for IO for PCI endpoints, stolen the idea from virtualization and then the skip communication for high bandwidth bulk data transfer. A lot of this code actually is actually machine independent. So a lot of it involves just messaging over ring buffers and translating virtual to physical addresses or sending page lists over to the remote side and so on. So I mean in general that kind of code should be adaptable to other pieces of hardware that have a similar usage model. So maybe there is hope for a common subsystem for these kinds of PCI devices in the future. So that ends my talk here. Thank you. The driver handles, driver is doing what KVM is doing basically the interrupt relay and so on is happening through the driver. We are using, for word IO itself we are using the tap interface but it's actually through user space. We can use V host, we are looking into using V host but the slowness actually we have also implemented a native virtual ethernet driver that just does transfers through the kernel. So it's basically native. So that also you know it maxes out at 400 megabytes per second. So what we think, in general the basic interrupt routing is from DMA to a processor on the mic. So it's generally the same processor taking the interrupt. That's another, yeah. Yeah, if you can do multiple queues. We haven't added the multi queue support so hopefully we can add that. So you could, so there is actually, we are planning to work on a direct socket interface through SKIF so basically it's similar to what you are saying there. Yeah, we haven't done that but that's, that's, so it's a, at the base level it's a PCI device. It exposes a sort of an aperture into all of the mic memory. That's, yeah. Not really. So in the case of the InfiniBand adapter, it's actually through some kind of setup code. Once there is some setup code involved but once that setup code gets out of the way, the mic processor is actually directly talking to the InfiniBand adapter. Yeah, you can do that. Yeah, if you have something like, maybe things like SRIOV may simplify it but maybe that's not needed. Yeah. But you do have the, you know, when the NIC is pulling data, it's a read across. It's a PCIe read, peer-to-peer reads which actually have not performed as well as writes so that's another angle to keep in mind. So I have links to the, there's a backup slide there that has a link that, yeah, that has links to the entire stack. Yeah. But it's not, so we are upstreaming this driver into the kernel. So it's open source but the driver itself, parts of the driver are not yet available in the Linux kernel. They have their own ring interconnect that talks to the local memory. It's not quick path but it's equivalent. Well, the lowest latency is actually if you directly unmap memory, right? But that doesn't happen in the OFET stack. So, you know, the way the OFET stack works is because they have the IB transport also available. So they end up choosing the IB transport for lower, for the smaller messages. Yeah. Yeah. Outside the box communication is via the InfiniBand. There is no, there's no NIC on the device itself. Well, the OS is Linux. So anytime you make a system call, the OS is going to get invoked. But there are some system services like some demons and so on that are bound to a particular core. But, you know, in terms of the OS, it's just Linux. So it just, you know, whenever it's invoked, it will run on the other course. Yeah. Yeah. It's symmetric. It's a shared memory configuration. So I need, yeah. Yeah. Hopefully, yeah. So for example, if the card OS or the coprocessor OS crashes, and we have mechanisms to pull that crash dump over to the host side, similar to Kexec creating a crash dump and so that, you know, customers can send over the crash dump to us so that we can debug it. Yeah. So we propagate it to the driver. For example, RAS errors is one class. So, yeah. There is some persistent logging of the error, but it's propagated to the host driver. Yeah. We also have to make sure, for example, when those kinds of errors are happening or happen rather, we have to make sure that the other devices that are interacting, the other coprocessors that are interacting with this particular coprocessor are, you know, their communication is shut down and then we can go through some kind of reset process for this particular one and then bring it back up again. So, yeah. So the host driver has some sort of keep alive mechanism and things like that to make sure that the OS has not crashed and the coprocessor. Actually, it's, I mean, I haven't read the public documentation myself, but it's, you know, any write that you do from the host side will invalidate data in the coprocessor CPU's caches. So you don't need any special fencing or something like that. And when the coprocessor writes to system memory, that will also invalidate data sitting in the host CPU's caches. So it's almost like just like shared memory, essentially. I don't understand your question completely, but are you referring to direct cache access or something like that? Yeah. Yeah. So what happens is the coprocessor memory itself is mapped right combining into the host CPU. Yeah. Yeah. So it's like a graphics adapter in that sense. Yeah. So on my, on the first slide, I had something about at least the current generation of mic provides, I think four times the performance per watt as the Xeon processors. So it's, yeah. I mean, it's power savings, yeah. Performance per watt is the main vector here. So these are, actually do fit in with that, you know, that philosophy of smaller processors. These are much smaller than the Xeon processors. The majority of the logic in these processors is taken up by the SIMD instruction pipeline. And so, you know, it's, so it has x86 for the, for the compatibility and things like that, but I think that's in the works. Yeah. I think the next, so the next generation is will be, will be available actually in both form, both form factors to be a bootable as well as a coprocessor. In that case, coprocessor is a, is a migration path for folks who have stuck. Yeah. Yeah. Thanks. Thank you.