 Hello, thank you for joining my session. My name is Haido Sugiyama, Chief of Data Retard. I'm based in Tokyo. I'm here to talk to you today about hardware evolution, Composal Compute on Opposite and Kubernetes, and Retard Enterprise Relax. With leveraging domain-specific hardware accelerators such as NVIDIA data processing unit and inter-infrastructure processing unit, for the future of desalinated infrastructure to shift from traditional CPU-centric architecture. I'm sure many of you have had a Retard before. But I'd like to quickly introduce Retard from my perspective and share why I think our values matter for you. The foundation of Retard is open source. I like to say open unlocks the world potential. This is our vision statement, Retard, where we have been using the open source deployment model to build Sophia for almost 30 years now. In that time, we have used the power of open technology, open standards, and open collaboration between people and organization to empower our customers and to advance the state of innovation for everyone. I often describe open source as innovation engine for the global industries. Today, the cross-industry collaboration in open source development communities can deliver innovation together. If we look at Retard, you'll see that there are over 150 million software repositories, controlled by over 15 million individual developers, and more than 2 million businesses are collaborating. That is an important thing. Of course, we know this number includes folks and the duplicate account, but it gives you an idea on the order of magnitude. This number includes companies that traditionally have been called users, collaborating with vendor to drive their own feature into the open source project that they use. Our open source deployment model is much more efficient than any kind of traditional waterfall approach for vendor software deployment. We live in the world where every business, depending to some degree on IT, and increasingly, enterprise are implementing their business value in software that may be as part of the supply chain, back office service, customer interactions, or the various products and services they provide to their customer. Software is the world, is one of catchphrases often used to describe. Open source enables capturing value with software more efficiently and at lower risk than any other model. It offers access and a guarantee right to the code, along with ability to customize the differentiation of the system and the system. This is the key point of this development model, while collaborating synergy, even between competitors. The key point is you stay in control. As Richard, we collaborate with competitors all the time to advance underlay components. Open source empowers you as business to be in control of your software-implemented business value and avoid locking. Let that act as catalyst in this model, helping you to maximize this value. As we run cyber war with software, the complexity of our software stack increasingly deep within that complexity of physical work. In response to that, we are seeing evolution in the certain patterns. Hybrid cloud has become predominant operational paradigm for modern IT across all footprint from public cloud over the data center to the edge. Linux, containers, and Kubernetes are becoming the preferred way to deploy and manage this hybrid cloud computing. Safety into the distributed computing, also known as the edge computing, due to the growing interconnection between software and the data that is distributed in the cyber physical world from which generate under the limitation of the low latency. Data and AI are physically interesting here because it affects all software in such that application become intelligent application which run from the data to improve their behavior. Another way to think of it is not just software is a world. AI is eating software now. Data and AI have become driving theme for the IT investment. I like to share the concrete example of how related open source approach benefits customer which is opendata.hub. Opendata.hub offers deployable open source reference architecture for managing data and AI work across on the open system or other Kubernetes variant. In this project, let's collaborate with open source project hardware and software partner as well as customers. Today, we are driving integration and interoperability between different component in the broader data and AI space across data center core and edge. The outcome is capability for the customer to construct their own optimized platform by picking the best of great solution for their purpose from the broad ecosystem. This platform combined enablement for the data science touring with the management of the underlying software stack, supporting intelligent application end-to-end. Let that is operating this reference architecture framework for community use as part of operator-first initiative. We are expanding the concept of the open source collaboration into the operational aspect beyond just the software self. The result of the work reflect to the letter downstream offering and service as well as the partner that contributing the open data.hub project. At MLConf online event this October, my folks will write, introduce enterprise new system running on open data.hub together with Stanford Sack Gym. I'm feeling that there are many players such as Sack facing the same pain point at current infrastructure due to lack of scale in AI data management. And I agree we need for bridging edge to core for exchanging real-time data. Actually, I'm contributing for the developing new data centric computing architecture in ION Global Forum for more scalable data management across edge and core. It can help enterprise new system project. To meet the demand of the AI data management, we need evolution in hardware system architecture that is a main theme in my session today. The challenge behind thrashing computing architecture is the end of move along and then scaling. CPU are not getting faster anymore. We are facing the increasing need for speed, data and AI requirement, as well as concern over power consumption. So we can just keep scaling up the CPU. This is why we need domain space hardware acceleration that is underlying trend of the industry. We will see it from CPU centric architecture to new architecture to increase network and the stage performance with domain space hardware accelerator. So in this session, I'll cover three things. Firstly, I update the current network innovation. And then I'll share with you new computing architecture leveraging domain space hardware accelerator. Lastly, I would like to discuss the possibility of further change in software stack and network stack to realize AI integrated communication. By the way, this is a virtual event. If you have any questions, please add your question to the chat box so that I can respond online during this session. With that, let me update current transformation in the network side. This slide shows the 5G new radio roadmap based on the 3GPP. In 3GPP, the 5G specification was standardized as the release 15th and the commercial service in Japan were launched in March 2020. World 5G network service has been rolled out, mostly phone, PC, and CPEs. The release 16th standardization was complete in June 2020. 5G is enabling the B2X, ILT, and furthermore. 5G will cover the communication business in the area of the smart manufacturing, smart city, smart architecture, automotive edge, healthcare, and more. These 17 and the release 18 are prologue to the 6G. We are exploring the beyond the 5G to our 6G. We are now starting to work on the next implementation challenge prior for the beyond the 5G to our 6G, such as end-to-end network slicing, including radio access network and the real-time data analytics for the digital twin world. 5G spec in core network and the radio access network are actually impacted to the change of the computing architecture design or maybe has defined based on the trend of the computing architecture. Compared to the previous 4G-based NAV, 5G system architecture is suitable for the architecture element defined as a cloud-native network function to bring intelligence, flexibility, agility for new scalable service furthermore. The architecture model, which further adapt principle like a majority, the usability and the self-containment of the network function can be chosen to enable development by latest cloud-native content technology and Kubernetes-based software technology. Service-based principle applied to the 5G control plan separating from the user plan function. That means the user plan function can be off-roaded into the domain-specific hardware device for the 5G control plan run in the cloud-native environment such as the Kubernetes cluster environment. Currently, UPF can be off-roaded from the CPU to SmartNIC and the P4-based switch in addition to the DP-DK. Also, the 5G system architecture allows compute separation from the storage to store their context in data storage function that can be off-roaded. We can think about adoption of the persistent memory for more lower latency data access as a USDF structure data storage function. So in the 5G evolution, network function workload is transforming. 5G control plan workload and service-based architecture are shifting to the Kubernetes-based cloud-native network function. While UPF and the data storage function will be off-roaded into domain-specific hardware device that can be implemented with the Kubernetes device plugin feature. In the 5G and the 6G era, UPF and the data plan of traffic will need to be off-roaded in the domain-specific hardware device because CPU-centric architecture is not getting enough to meet real performance requirements. We are contributing to several activities in the community such as online alliance, TIP, Magma, open-air interface, and focus on the cloud platform development with adapting the domain-specific hardware device technology to develop the software-defined functionality. It's more complicated than traditional CPU-centric architecture if we set up each network function manually, especially for GPU implementation. Without accelerator card such as FPGA, GPU, or ESIC, it cannot provide massive MIMO scalability in public service. In the 5G new radio access public network, we will need to support more scalable massive MIMO. Currently, most run vendors are implementing the CPU look-aside mode with the accelerator card, but still need some level of the CPU consumption. To increase the massive MIMO scalability and access speed furthermore without increasing CPU consumption, several run vendors are now trying to implement inline mode over combination of the XPU-based domain-specific hardware device. I will explain this accelerator evolution later. In the middle of hardware evolution, we are now seeing every day NIC becoming smarter as a domain-specific hardware device. Let's check into the use case of we-define NIC capability area in the fundamental level we are looking at. All the features are now supported, actually, frustrating TCP-UDP segmentation and encapsulation, all XDP and PTP timing, SRV and others are supported, and then you have a crypt off-road with a crypt engine on chip. You can do inline IPsec off-road, TLS acceleration, and you can look at EVPF redirect for TLS. Then you can do quick HTTP-3, where HTTP running over UDP, you have TLS acceleration. 3GPP is also considering quick for interconnection between each 5G core control plane in the service-based architecture. Program flow also can be uploaded to ASIC and FPGA. This is to program and manage Vseach or VRouter to provide SDN. You can have P4 switch over for open-flow control over Vseach, and be able to provide service like a simple fly encapsulation, decapsulation of various protocol. In addition to new protocol, like coming with segment routing support for MPS, SRV-6, port mirroring bonding and QOS, all of the features we have with the Linux scanner, then they can be off-roaded to the NIC and matching five tuples. All of these now can be off-roaded into NIC. With program FPGA, you can now do service such as a broadband network gateway, 5G user plane function, and the 5G radio access network, ECPRI interface, and FEC. These are available to be provided by NIC vendors. This slide shows the example of the current feasibility of the disarray radio access network implementation based on the design in the order alliance. Dial in the top of the slide shows the network function work load for radio access network and edge. 5G can deploy user plane function at edge for match access edge computing. So actually the user's data can be released at edge on UPF deployment. Since 5G new radio, this one unit functionality has split, CU central unit control plan and the user plan and the DU disability unit. CU handles the IP packet in mid-4 and the back four, and the DU handles the rear two traffic in the mid-4 and the front four supporting ECPRI. Diagram in the bottom of the slide illustrates the implemented device for each work load. UPF and the CU user plan can be uploaded into the SMARNIC. CU control plan can be run on the CPU. DU layer one functionality run on the accelerator card with the DPDK BBDev library while layer two run on the CPU. Compared to 4G, 5G control plan is shifting to the cloud native work load on the CPU centric architecture while uncrowded native work load such as UPF and the CU user plan are uploaded to the domain-specific hardware device. Through 5G evolution, some network functions are transforming to the cloud native running on the CPU centric architecture and some network functions are offloading to the more hardware domain-specific device to meet network performance. 5G evolution advances the telecom edge computing but it's not our goal. We already started to look into the beyond 5G to a 6G that will be covering the end-to-end for AI integrated communication. We know there is a demand that service provider need to provide AI-related service based on the behavior of the end-users device. In order to deploy each domain function to each right place based on the business demand, we will need more composable compute in this area of infrastructure across user device, telecom edge, telecom core and the cloud, I think. The network interface card is now becoming smarter. Let me update evolution on the computer architecture side. Cloud provider have spent the last decade trying to convince us that all compute cycle are equal while the industry has focused on the scale rather than performance. Virtualization in that sense is based on the abstraction from the hardware into the generalized representation which really is opposite of the specialized hardware for acceleration. If you go to the public cloud and you will find over a hundred different machine type representing the hardware specialization and the sizing option, I say that is a strong indicator for this current trend to shift domain-specific hardware device deployment that needs composable compute architecture in this area of infrastructure. Computing architecture is now transforming from the hypervisor-based CPU-centric architecture and code server to composable compute architecture in this area of infrastructure. Left diagrams show the traditional model of the software-based virtualized hardware with the hypervisor and CPU-centric architecture. Center diagram shows an example of the SMART-NIC integration. In this model, we have the software control plan running on the hypervisor in CPU-centric architecture and many network functions that can be afforded to the actual low-level network processing to the FPGA in the SMART-NIC. Right diagrams show the new ex-PUs composable compute architecture such as NVIDIA DPU and Intel IPU which are the result of the innovation of the SMART-NIC. This is a modern architecture that can fit the disarray infrastructure across end-user device, TECO-AGE, TECO-CORE, and the cloud. We are now looking at the setup model where the four network stack, including the control plan, moves from the hypervisor to the independent ex-PU subsystem. It essentially can move the same function to the subsystem and therefore enable you to use this, not only in hypervisor virtualization model, but also in bare-metal model. In this case, the lifecycle of the virtual device function now is independent from the tenant workload cluster, removing binary dependency. This is tightly connected to the growing preference for the bare-metal deployment of the container platform and the edge system of virtualization, which is also driven by the increasing demand for speed and efficiency versus the diminishing return for the overhead of the software hypervisor-based virtualization. As we traditionally use virtualization for security isolation, like a left diagram, the move to direct hardware access and the bare-metal deployment creates issue with a trust boundary like a center diagram. Simply put, once you try to do the bare-metal service, too many people or application and the service will have privileged root access to your machine in case, for example, different teams are responsible for the application platform, networking and storage respectively. Too many control domains are conflated. With XPU subsystems such as the DPU and IPU like a right diagram, we gained the ability to move the function that we traditionally implemented in the software hypervisor to the processor such as the ARM processor. As it achieved hardware-level isolation with replacing the software-level isolation at the traditional virtualization, it can offer significantly higher security at higher performance and better control point with a stronger isolation than existing virtualization model. This also allows us to define more stable and cleaner interface between services that utilize domain-specific hardware device and divide the dependency of the hardware-specific code. And while we already were able to offload the actual data plane processing to traditional SMARTNIC, the XPU in addition allows us to offload all subsystems and the more complex processing. This means that we no longer have to spend CPU cycle for management task and control plane, execution endpoint, advance firewalling, or for example, deep packet inspection at scale. The XPU subsystem can be orchestrated in an edge-like approach running without entering the Linux or an integrated and dynamically orchestrated model using OpenSafe. The new system architecture can be exemplified by the Composal Compute with XPU SOC subsystem, such as NVIDIA DPU and Intel IPU. In this architecture, the CPU main system and XPU subsystem can be at peer level for the control of the each computing system. The DPU or IPU, in addition to the domain-specific hardware function and acceleration, have their own compute capability and also has its own independent processor, such as ARM processor. This enables software-defined device function and does hardware virtualization without a software hypervisor stack. This also enables the security isolation at hardware level, even in the bare material, use case and acceleration, and enables off-loading of the whole software subsystem rather than just specific task. Let me add further more about the advantage of the new Composal Heterogeneous Compute architecture from the cloud service provider angle. Cloud data center looks pretty much like a CPU-centric architecture showing in the top diagram, such as the classic enterprise data center. But that had changed, and it's becoming to shift with the Composal Compute architecture that XPU subsystem, such as DPU and IPU, run separately from the CPU main system showing in the bottom diagram. We are starting to see these two architecture diverges, and the reason for this is that in the classic data center, everything is owned by one party. So this top diagram is a typical server in the classic enterprise data center. The physical infrastructure, the hypervisor, and the application are all owned by the one party. In this case, it's a bank. In the cloud, the workload and the system owned by different ones, the tenant and cloud service provider. At the bottom diagram, all the tenants so far run on the CPU, but the servers are built on the cloud infrastructure where different architecture has emerged. They have dedicated processors that run the infrastructure function in the cloud. And it's a new category of the processor, XPU subsystem, such as DPU and IPU, the cloud service provider software runs on the XPU subsystem, and the living generating guest software runs on the CPU main system. So for example, a bank financial application running on the CPU will be clearly separate from the cloud service provider's infrastructure software running on the XPU subsystem. This XPU-based composable architecture has several major advantages. First, the strong separation of the infrastructure function and the tenant workload allows the tenant to take full control of the CPU. So enterprise guests or digital service provider who provide the AI business can fully control the CPU main system while CSP maintains the XPU subsystem to manage infrastructure. Second, the operator can off-road infrastructure tasks to the XPU subsystem. This helps to maximize the utilization of the CPU and the cloud service, also helps to maximize revenue. And third, XPU subsystem allows for fully diskless server architecture across cloud and end user edge device. So let me explain each of these further more. In server with the XPU subsystem, infrastructure and tenant workload are clearly separated and the tenant workload is running on the CPU and the infrastructure software running on the XPU subsystem. The immediate result of the disk is much better isolation between the two. So for example, if I have a spike in the infrastructure load, it will no longer lead to performance issue for the CPU side. That is a very good property, but more important, it now allows the tenant to take full control of the CPU. So for example, the tenant can bring their own hypervisor or Kubernetes worker and run it on the CPU. But at the same time, the XPU subsystem can still confine the tenant to the virtual network segment or specific storage volume that allows more flexible architecture. The second advantage of the XPU Composite architecture is about infrastructure function offload. So modern application today are often structured as a microservice that incurred real communication overhead. In some cases, the majority of the old CPU cycle are actually spend the infrastructure overhead and this offloading technology can help to reduce the CPU consumption. In this case, XPU accelerator can process these very efficiently. This is not only a performance optimization, but if you are an infrastructure operator, you can now take 100% of the CPU cycle off the CPU and rent them out for the guests, which helps to maximize your revenue for the overall system. The third advantage of the XPU subsystem is that it can enable the migration to the fully diskless server architecture. This is a big architecture change. Traditionally in the cloud data center, you will have disks attached to the every single server. As a tenant demand for the disk space is hard to predict, you have to over-presure each of the disk servers. Basically, attach more disks than you really need and you end up with a lot of the standard capacity, so capacity that can be utilized in a good way. With XPU subsystems such as DPU or IPU, you can move to an entirely diskless model. At all storage is in a central storage service and when the customer starts to work on the server, the provider basically creates a virtual volume on the storage service via management network. Then based on that virtual volume, XPU subsystems such as DPU or IPU create new NVMe SSD. As this virtual NVMe SSD shows up on the PCI Express bus, just like a regular SSD, this will work with the most operating system and we can now boot from that SSD. Regarding the performance, the actual storage traffic between the storage server and the workload on the server makes a fast pass. Meaning there's no involvement of the any CPU costs. It's low latency, it's high throughput with maximum flexibility. It's very powerful solution. So with a strong separation of the tenant workload and infrastructure with adding XPU-based domain specific hardware device that allows efficiently off-road infrastructure function. And this ability can move to the diskless architecture. I believe that XPU-based composite architecture will be a common element for the future distributed infrastructure across customer edge, TELCO edge, TELCO core and the cloud. This slide shows the geo-oriented distributed data pipeline running ARML-related workloads on top of the diskless server across TELCO edge, TELCO core and the cloud. We can implement XPU-based domain-specific hardware device supporting diskless functionality and remote direct memory access reaching multiple sites. The last issue I'd like to highlight is about mismatching hardware and software abstraction. Even when running the bare metal container rather than virtual machine. A challenge is the traditional concept of the virtualization with the software hypervisor is becoming a conflict with the trend toward the distributed infrastructure with XPU-based composable architecture. The contradiction between domain-specific hardware device and the virtualization forces us to punch hole into the virtualization of selection leading to sometimes observed interdependency of over playing pass-through of PCI device into your virtual machine at which point you are not virtualized hardware anymore. I argue that we were increasing the using the hypervisor only for partitioning of the CPU and for the security isolation. The key idea of the container is to incorporate the dependency of the software in its own name space to reduce a conflict between different components in the complex modern application stack. However, container boundary and specialized hardware enablement often don't align such as a user space support for GPU compute. We often find ourselves in the situation where container abstraction break down at the hardware interface creating unwanted dependency. So we will need the better architecture compartmentalization of the hardware capabilities. The solution to this problem comes with a hypersift with a new feature called the KCP and the development that allows for the separating worker pool and the control plans and to run the different control plan as managed service on the consolidated management cluster. Hypersift and the open cluster management allow to manage each isolated resource for domain-specific hardware device running infrastructure function and the CPU main system running tenant worker growth. This can offer clean compartmentalization of the domain-specific hardware device platform addressing the mismatching of selection layer that compound the already exploring complexity of the software stacks. Especially this mean moving to the higher level API and abstraction at the service level. It matches the hardware architecture to the concept of the orchestrating and encapsulated service that's container and the Kubernetes generalized already on the application software side. You could look at this as a containerization of the domain-specific hardware. The result is a trend to designate the CPU-centric system and move toward compartmentalization and the composable compute model where compute capability are so far defined. Lastly, let me wrap up future possibility in the middle of the hardware evolution. This slide shows device stack combined network infrastructure function and AI-related tenant worker growth. Domain-specific hardware device can support infrastructure function of AI ML tenant worker growth as well as network service worker growth for 5G edge and radio access network. Control plan of the network service can be off-roaded into the domain-specific hardware device. What we are missing is the interconnection interface between tenant worker growth and network service worker growth. At this stage, we still need back-to-back connection or additional Ethernet switch to exchange traffic between network infrastructure and the guest tenant worker growth because CXL-based designate computing architecture is not ready yet. CXL construction is now working on the next generation PCIe bus that will be becoming the possibility to designate entire classes of the system components such as DRAM and the past memory. We will be able to exchange data through the memory such as the past memory between CPU main system and XPU subsystem in near future. Okay, let me conclude. In this current industry trend, we are seeing emergent change in system architecture from CPU central city to independent intelligent subsystem with their own Composub specialized compute capability. This new Composub architecture gives you the benefit for enhanced security by isolating CPU main system and XPU subsystems such as NVIDIA DPU and Intel IPU with delegating the resource management for Composub compute cluster by each owner. Also, this helps to increase network performance and the storage performance through aligned abstraction for Composub compute with its own domain-specific hardware accelerator. This leads to a harder system architecture that matched the container and Kubernetes model for orchestrated service. I talk about hyper shift and open cluster management for Kubernetes multi-cluster environment. Hyper shift is a minimum Kubernetes control plan that can manage each resource of Composub compute. Hyper shift and open cluster management have to improve lifecycle management and allow access to broader ecosystem and the faster innovation. We still are in the middle of the hardware evolution. There is one missing piece which is a memory access technology at lower latency. CXL Consortium is spending time for next generation PCI spec including to enable direct access to domain-specific hardware accelerator attached memory. Also, we are seeing several developer community activities such as RDMA overconverged inside version two and remote person's memory with RDMA to exchange data via memory between two sites of a network. Sophia stack and network protocol stack are designed based on the available hardware environment. So far, there is no direct memory access technology to exchange data across network in stable manner. This is one of the reasons why we need many network interface to transit users' data across tech edge and core. Currently, there are many overlapping network protocol stack between each network domain such as 5G radio access network and 5G core network because of the traditional computing hardware environment. If we can enhance memory access technology to bridge specific memory from edge to core for exchanging data directly instead of network protocol stack, I think technically we'll be able to eliminate overlapping network protocol stacks and directly exchange data in specific memory with inference engine or integer application at tech edge for AI service. ION Global Forum addresses these challenges of memory access technology to realize AI integrated communication of all photonic network with data-centric disaggregated computing. I'm expecting further innovation by community activity such as what ION Global Forum is working on to realize AI integrated communication and digital twin computing world. That's all in my talk. Thank you for paying attention. Here are the difference risks. You can access to get further information. And if you have any question, please add question to the chat box now. Thank you.