 Ladies and gentlemen, thank you for waiting. We are delighted to have you all here. We are now ready to resume our seminar program for this NAC Breakout Speaking Track. We would like to invite Mr. Shinji Abe, Product Architect, IT Platform Division NAC, and Mr. Masahiko Takahashi, Principal Research Engineer, Green Platform Research Lab Latrice NAC, Mr. Kohei Kaigai, Lead of the PGSTOM Project NAC, to talk about OpenStack as a Research Disaggregated Platform Controller. Mr. Abe, Mr. Takahashi, and Mr. Kaigai, please. Thank you, Kami, for our session today. And two guides will explain about OpenStack as a Resource Disaggregated Platform Controller. This shows the agenda. First of all, I will explain the expertise of our architecture. The system we were talking about is based on the expertise of technology that was developed by NVC to build the Resource Disaggregated System. And the next topic is Orchestrating Resource Disaggregated Architecture with Express Eater by OpenStack. Takahashi-san will explain one actual implementation of Resource Controller. And finally, Kaigai-san will explain one usage, PGSTOM. Express Eater over our architecture. I'm Shinjabe. I'm in charge of development of Express Eater system in IT Platinum Division. I said the word Express Eater many times. Actually, simply said, Express Eater is PCA Express switch over Ethernet. Express Eater can extend PCA Express beyond the confines of a computer chassis via Ethernet. Most important thing is transparency. So you can see the server and Express Eater NIC. NIC has an Express Eater engine. This engine converts the PCA Express packet to Ethernet packet. And the other side, IU expansion unit with PCA Express card. This unit has also Express Eater engine. This engine converts the Ethernet packet to PCA Express TLP. So we can place the IU device in the remote side. This is the best concept. And this shows comparison between the actual PCA Express switch and Express Eater system. AppStream port in PCA Express is equivalent to Internet Express Eater engine for CPU side. So this engine is shown as PCA Express bridge. Downstream port is equivalent to Express Eater engine in IU side. So actually, PCA Express switch has an internal PCA bus to route the packet. So instead of this, in Express Eater, we use Ethernet switch to route the packet. So even if there are a large number of Ethernet switch between CPU and IU, this addition is not different. So Express Eater is one example of implementation of PCA Express switch. So Express Eater is fully compatible with PCA Express spec. This shows how to configure the system. So Express Eater can support multiple hosts and many number of IU devices. So for example, this shows four hosts and 10 IU devices. So to connect the host and IU device, Express Eater manager software assigns a grouping ID to each host and IU device. So in this case, device A is two, B is one, and C is three. So as a result, four systems are configured. And this ID can be set from 1,000 to 4,000, and the ID is used as VLAN tag to isolate the network among host. This shows the Express Eater IU distributed system. So mainly, this system is used for cloud system in data center. So recently, some customer want to use GPGPU or acceleration of FPGA to accelerate analytic processing. So actually it is difficult to prepare this kind of system for server vendor because some customer want to use only one GPGPU and another customer want to use three or four GPGPU with some NVME. So for data center vendor, it is difficult to prepare the various server. So one solution is to prepare the almighty server that have four GPU and four FPGA with some NVME. But that kind of system is very efficiency is very low because most GPGPU and FPGA is not used. So for that kind of problem, so Express Eater can be solved because all IU devices are isolated from computer node. So we can assign GPGPU and FPGA NVME to a CPU in accordance with customer's requirement. So this is a solution for data center. And two controls system configuration. So we will release Express Eater Management Library and SDK. So this library has three kinds of API. Sheepra, Java API and REST API. So REST API is used for OpenStack controller. So we've already shipped Express Eater product. This is an actual system in Osaka University. There are 64 servers and 70 IU devices including a large number of GPGPUs. We've already deployed. And this was a product line. We've already shipped 1G and 10GNIC client and the expansion line box. And also now we are developing a 4DG version. So actually each NIC has two ports. Totally, this can support the 8G, Gbps. So it can support the BI8 PCI Express Gen3. And also now we are developing an expansion unit having four PCI Express slot. We will ship them early next year. Hi, my name is Masaiko Takashi. I'm going to talk about the actual implementation for OpenStack controller for Express Eater. I'll skip this one. Actually, as I said, Express Eater can realize resource desegregation. And resource means a computer node and devices such as remote PCI Express devices. And we think resource desegregation architecture in the cloud achieves high availability and cost efficiency. And this is not only a NIC but also an interlock scale architecture and C micro server have the same architecture concept. So to control the resource desegregation architecture by OpenStack, which module is suitable, what do you think? I think maybe the bare metal machine controller, yes, which is ironic. So this picture shows the actual resource desegregation architecture under OpenStack. In the bottom of the picture, there is a resource pool connected by Express Eater. And through the Express Eater manager, Ironic sends and receives a command to control the Express Eater. And Ironic shows the actual bare metal machine, same as original Ironic, but its machine spec is different. I mean, it's not a constant because if the resource are attached or detached, the machine spec is changed. So the original Ironic has a class of chassis, node, and port. And in Ironic, we added the fabricator which contains device and device pool classes. And Express Eater driver is a driver for a fabricator. And EESV is installed into the node classes and EEO is related to device classes. And fabricator can control the connectivity between EESV and EEO. So our current implementation is based on Rice House, but we are now porting to Kilo and Liberty. And fabricator is a kind of abstraction layer. And we have developed Express Eater driver because we have an Express Eater, but we think interlocking architecture and C micro server driver can be implemented as the same manner. So these are the additional Ironic client commands. Most of the commands are related to device pool and device classes. And some commands are related to node because node needs to be attached to the devices. So this is an example of an Ironic client. The first one is a device pool list. In this environment, there is one device pool using Express Eater driver and connected to Express Eater manager with IP addresses. In this environment, Express Eater manager is executed in local host. The second one is a device list. In this environment, there is four devices. Two of them are network interface card and the rest are GPUs. So when you see a detail of the device, So these are the additional APIs commands we made. So this is an example of an Ironic client. In this environment, there is one device pool using Express Eater driver and connected to the Express Eater manager with IP addresses. And the second one is a device list. In this environment, there are four devices, network interface card and GPUs. When you see a detail, you can see a group tag which means group ID of the Express Eater. We mentioned that Express Eater is controlled by group ID, but you don't need to care about the group ID when using an OpenStack API because Express Eater manager can control the group ID. So the second example is node operations. In this example, node has a node spec and to attach and detach the devices, actually you have two choices. One is specifying a new node spec to satisfy. In this example, compared to the original node, in this example, you can add a network interface card and a SSD. And the second choice is a specific device, a UUID directory. In this example, you can attach an actual device. So we are now developing on-demand device attachment based on the IO load of the machine using a sailometer and heat template. This is what we call device authorization. And typical use cases, for example, web servers suddenly get tons of requests and you need to add another network interface card. So we also support GUI using Horizon. The upper browser shows a node and device connection. The lower browser shows a device list sorted by categories. In the future work, we are going to develop some framework to safely operation because, for example, if you remove the storage before un-mounting, the data might be corrupted. So you need to un-mount before removing the storage. And also, we want to support the shared devices such as SRO, IOV devices. Thank you. Okay. In the next part, I'd like to introduce how EXP ESA works and how does it variable from standpoint of application usage. Since I try to introduce is PIGISROM, that is an extension of POSQ SQL and allow to off-load SQL query workload into ZP-ZPU devices. First of all, let me introduce myself. I'm Kaya Kohler of NEC. My main role is to productionize PIGISROM and also I contributed various open-source community like POSQ SQL, SQL Linux and so on. PIGISROM project initially launched as my personal development project and then NEC determined to fund this project for making new businesses. And now, we try to make new business opportunity using this technology. What is PIGISROM? Here are two core technologies. One is SQL to ZPU native code generation. It allows users to use ZP-ZPU without special syntax to SQLs. And one other idea is asynchronous and pipeline execution of SQL engine that allow to hide latency to transfer the data from main memory to ZPU devices. The brief flow of the execution is once user gives SQL code from the client, then POSQ SQL database pars the query into internal data structure and make a query execution plan for better and cheaper query execution. Then it executes query like scanning relation, joining input stream and so on. PIGISROM works within POSQ SQL as an extension of the module. It intermediates the world of POSQ SQL and world of ZP-ZPU. Once PIGISROM injects its execution plan into a POSQ SQL, it hooks part of query execution plan to use ZPUs. Four more details. As I introduced, once SQL query is given, parser make a parse three and query optimizer makes a query execution plan. And usually query executer learns query plan to scan the relation or making relations join aggregations, sort and so on. The upcoming POSQ SQL version 9.5 has an epoch making feature. It is a custom scan and join interface that allow extension module to replace a part of query execution plan. And PIGISROM works as a custom scan and join provider. Once PIGISROM allows to replace query execution plan, it replace part of responsibility of POSQ SQL itself. And the replace logic executer could call dynamically generated and it send dead chunk rate from the relation. And this own execution engine execute could call towards dead chunk rate from the database. All of operation hidden from the user perspective so user can run the SQL query as usual. This is one example of SQL to CPU native code generation. It is an automatically generated GPU code and it describes these code code. This is a part of automatically generated code and it includes a formula expression come from SQL query. That is equivalent to the portion in the red box. It is automatically generated so no need to care about user perspective. And PIGISROM compiles these code code on the fly to run. On the other hand, once we turn off PIGISROM, POSQ SQL runs on the built in query execution plan as usual. It is fully executed on CPU sequentially. So we cannot utilize power arrays by GPU. When we use ZP-ZPU computing, people usually say ZP-ZPU has a much higher computing capability. However, here is bottleneck to copy the data from host side to device side because CPU has more than 100 GB per second bandwidth and CPU has more than 300 GB per second bandwidth to access data from its own RAM devices. But once we try to copy data chunks from host side to device side, the bandwidth of PCI Express is much lower than RAM accesses. So we usually need to care about these bottlenecks. In case of PIGISROM, how does it tackle this? Pipeline. When we try to scan a bunch of relations, it contains mass amount of rows. We split a part of relations with multiple rows. Usually, a chunk contains much 100,000 rows. In step one, CPU reads a chunk from a storage or shared buffer to the DMA buffer. On the next step, a CPU kicks a DMA send and makes a progress buffer read. Since the DMA transfer is asynchronous process, so CPU can make advance buffer read process concurrently. On the next step, a CPU kernel will execute the data transfer on the previous step and the chunk I plus 1 is kicked by DMCore. And concurrently, CPU can read the next chunk. This process iterated multiple times. Then we can see only first step is attacks for data transfer. But next and righter part included within the buffer read process by CPU. So we can consider this DMA transfer course is part of the entire execution process. No need to synchronize DMCent and receive. Once we designed the software carefully. So we showed how PG Strong works towards existing possible SQL. This diagram shows the query response time when we joined multiple tables and horizontal axis shows a number of tables to be joined. And vertical one is query response time. So smaller response time is better. The red bar, sorry, blue bar shows result by possible SQL, usual possible SQL. And red bar shows result by PG Strong on EXP either. And green bar shows result or PG Strong, but connected by a physical PCI express. It shows PG Strong accelerate wall of table join process multiple times faster than existing possible SQL. And here is a little bit difference between red and green result. Result by EXP ESA has little bit disadvantage than physical connection. However, it is not so large. We can pay attention another benefit from EXP ESA. Not only table joining, we can use PG Strong to learn a little bit more complicated workloads like scan with complicated qualifier like a numerical formula. And one other example is aggregation to make a simple histogram. In both of cases, PG Strong will accelerate query response time and EXP ESA will provide almost equivalent result toward physical connection. The well-designed software architecture will hide this DM transfer latency and how to prove our capability of ZPU. Inside of ZPU, we try to run a table joining in parallel. In this diagram, ZPU cores run hash joining by five parallel cores, but actually this ZPU hash join is executed by much thousand cores in parallel. It is the reason why PG Strong accelerates real-life query workload. And eventually, how PG Strong tries to solve the real-life problem like one example we like to solve is OLTP and OLAP integration. In usual enterprise systems, here are two databases typically deployed. Since one database OLTP handles mass amount of write workloads, the key workload of OLTP is updating of the record. And one other kind of database is OLAP. The core of OLAP workload is analytics. The reason why we have to have two separated databases is typical OLAP workload that includes table joining aggregation have disadvantages on a formalized dataset, usually used on OLTP dataset. But once we could accelerate table joining and aggregations using ZP-ZPU, here is no reason why to have separated databases. It is an future vision of PG Strong. We try to improve the capability and functionality of software to get these purpose. And for the purpose, ZP-ZPU is a mass-requisite component. So EXP-ESA will auto-deploy ZP-ZPU within a cloud environment without typical restrictions like power supply of ZP-ZPU or physical form restriction of ZP-ZPU card. So we're happy to integrate our database solution on top of EXP-ESA solution. So that's our future project roadmap. We stand on error of post-rescue version 9.5. The current target of a dataset is hundreds of gigabytes. But future improvements of PG Strong will catch up existing high-end commercial solution rapidly. Within a few years, we will catch up these solutions to care about 100 terabytes. Thank you. If someone has a question. At this moment, no production system uses PG Strong. It is under development. It is the highest priority to improve software quality and debugging to production adoption. Does it make sense for you? It uses CUDA. In fact, I tried to use OpenCL, but module vendor release OpenCL driver individual. It's software quality and detailed behavior is not compatible. So it is a headache for us. So we switched to CUDA. Any questions? Okay. Thank you. Thank you for participating.