 Hello there. My name is Vatsan Kasturi. I'm going to be giving this KubeCon tech talk, along with Nivedita Vishwanath on high performance networking for distributed DL training and production KADS. This presentation is about sharing your experience and creating 800 GPU cluster using 100 DGX or Rocky protocol using Ethernet network. I will cover the first section and Nivedita will go into the second section concluding with performance numbers. So the agenda of this tech talk is broadly split into two sections. The top level is a gray colored boxes, which indicate control plane operations and setup, which we have to do. The bottom level side is the green boxes indicate the network, the speeds and the feeds, how do we integrate RDMA, SRIV, and then the performance numbers coming out of this exercise. So obviously the stack has to be laid out first for the compute environment. And then we have to launch the DL applications as a container in the KADS environment and collect the metrics coming out of the telemetry to monitor how the cluster is performing, along with fairness, multi-user, multiplexed environment so that the users can get a fair share of the compute. The multi-node data center consideration has to be involved here for the design of the cluster itself and depends on the heating and the cooling, the rack layout, along with the fine network, which are listed here, what we call as the East-West knockout IPM mind storage and internet network. The EDR, HDR, NDR, PCR are bandwidth related, the EDR is 100K, HDR is 200K, and NDR is 400K links. And how do they actually communicate GPO to GPU in the East-West direction? And how do we integrate the Nodes, RDMA, SRIV, and CNR into the Kubernetes cluster? And then we conclude the production, the RITS experience, and also the performance in the second half. I'm not going to go into the details of this listing here, but you can see there are fine networks listed here. There are speeds and feeds discussion, another design consideration in the data center. And Ivedita will cover the production Kubernetes cluster, multi-node workloads, and also the CNR control flow. So what is the multi-node cluster MNK8S? MNK8S is a 100 node DGX1 V100 cluster connected by the Ethernet links for GP to GP communication over the ROCK EV2 protocol. We use RDMA extensively to move the data, and the cluster is set up so that we can provide horizontal scaling of BGS in the data center, subjected to the condition of rack space, power, cooling, and heating. The orchestration is done by the K8S controller, while the jobs are actually time-shared, as I said over here, controlled by the custom bat scheduler for fairness of the GPU as well as the time being used on the GPU. Data scientists are the users who are running the ML jobs. It could be a PyTorch or TensorFlow inside the container, or they are getting launched by the K8S in these worker nodes. So when you set up a cluster like this, there is an expectation of the KPI, and here's what we have experienced and what you have considered as a good KPI for ML clusters. We need to design low latent MNK8 clusters, and it's one-off network to keep the cluster design simple. And what we also observe is instead of being 16 GPU on a one node, if you run a 1,400 GPU, we have an improvement factor of 70X, and this is specifically on the ML performance. Faster model convergence for the data sending means faster performance numbers coming down into increasing productivity, and also converging faster into the accurate states when you have 8 billion parameters to operate with. The network needs to be matched to the multi-node cluster, GPU targets, so they don't become a bottleneck. This particular picture has two images sandwiched, the one on the left and the right. The left image is inside the hardware part cluster, and you can see the perforations on the floor, but the airflow will come out and push it out. On the right side picture is on the back side. We call it a power side, and the heat will come out and leave the data center. There are basically six switches here, each of them 128 ports, 100 gig switches. And there are four rings we'll talk about in the subsequent slides, and two of them are used for leads to go up the north-south traffic. There are terminologies used in this tech talk, so I have listed them here. I'm not going to go through them, but you can refer them at their terminologies, which you don't understand. In the multi-node cluster, these are the building blocks, moving from left to right. There is a data center. There's power cooling racks. How do you do the placement of the DJI to the CPU nodes? And this is typically done by the side ops and the networking team. And then there is a perimeter network, and then the power network is right here, which has a KADS and a BGP layout in the switching along the VLAN on the lower end. You do have monitoring metrics and logging associated with both the networks as well as the user jobs. This particular block is where the users actually get to see and interact extensively using CLI, GUI, and launch the container jobs in this particular infrastructure. The KADS actually controls the worker nodes, and there is an API server, obviously, associated with this. This particular picture talks about five different networks, which are the component of the traffic inside the cluster. There's IPMI to control the BMC and the host of the KADS control traffic going north-south, and the API servers can be located anywhere. This particular green thing, GPU to GPU network, is a short span in cluster bandwidth heavy, where all the NDR, LDR, HDR traffic actually flows. And they're very sensitive to RTT between the nodes. And you want them to be non-blocking and ensure that there's performance in this particular network. And that needs to interact with storage network. It can be in the leaf of the spine to fetch the data and also write the results and the logs. And obviously, the data center needs access to the internet for pooling images or datasets or for other purposes. This electrical orbit has 10 blocks which are listed here. Each block represents a rack, and each rack has like 10 DGXs. They all go into the inside ring. There are four rings of the donuts listed here of different color. These are called nickel rings, and four of them actually connect to the mix in the DGX ones. Each of the ring connects to one of the mix in each of the DGXs, and they form a ring. And so each node is only one hop away. There is uplink associated with this, which is about 6.4 terabits if you need to go north south to go to the storage or other needs. But when the GPUs are communicating with each other, they are only one hop away with low latent round trip time. Many of you are aware of this core spine leaf model. This is basically the tower switching for north south, and the hundred nodes are stuck here in the racks. With six necks, four of them used for the east-west and two of them for the north-south. And as I indicated earlier, GP to GPU traffic is significant, and you want this to be high-performance, low-latent network. This is the backside of one of the units, which is called A100. I'm just giving you a sample of how the east-west will connect. There are four here on each side of this box. In the DGX1, there's going to be four links, and the A100 will be 800 links. The storage network will move north south through these particular links, and then the traditional IPM network is listed here. This is a layer seven stack, which most of you are familiar with, and this talks about the infinity band and the rocky. And as you can see, only the lower layers are a little bit different from the L4 onwards. It's identical, and one can use either a rocky ethernet and infinity band, and you can reconfigure the mix. Obviously, the switches have to be different between the cool network. This is a little bit expanded version of the same picture, but with a stack here, going through the EDP in the version two, and an infinity band network layer in the version one. I discussed this particular slide earlier about EDR, HDR, MDR, 100 gate, 200 and 400 gate, and how do they relate to the PCI bus, do they saturate the PCI bus? As you can see, as they move up the bandwidth and the line rate increases, they're going to saturate this PCI bus 3.0, and you have to move up to the 4.0 network. So this is a good slide to actually relate what the nodes can do in terms of PCI standards associated with the mixed bandwidth and what the expectations of the GPUs are when you operate the cluster. This is another view of the network we're seeing inside the box. There's a motherboard and lots of memory, and then the Kubelet and the container runs here. There is a, and that's our traffic going through this particular NIC, and the Kubernetes actually operates that along with the metrics we collected here. This particular complex is actually the east-west complex where the switching happens. You can use IB or Rocky associated with the GPU for each node. We had to extrapolate this for 100 nodes in the cluster. So whether they're design considerations for multi-node cluster. So aligned with Ethernet switching model, interoperability, and ubiquitous Ethernet network is a vehicle where you can install this. Avoid placement logic. Fundamentally ECMP and others, two-layer routing will have some hiccups. So we prefer to do one-hop L2 network, and so we don't have complexity while you're operating the cluster. This particular one is a 100-node cluster, but you can even have longer cables. You can actually expand that to more racks. And I'd say we indicated there in one of the pictures, it's a 16-rack structure, but you can expand it to 2550 as long as there is switching structure. Align and leverage NCCL, which is a nickel ray structure in the libraries to avoid switching congestion and output ports. Nickel does do three rings and orbit rates to ensure that the traffic flows smoothly while you're actually bringing up the cluster. Avoid generic storage traffic mixing for the RDMA and nickel ring so that we have a clean network on the east-west. And that actually integrates with the SR-IVC and I, and the RDMA interfaces are actually exposed inside the container. So this is a traditional switching structure which most of you are familiar with. So edge, spine, leaf, and torque. The storage can decide very close to the leaf if you want to have a local traffic or move it up a little bit into the spine. And the IPM at all actually connects it to the BMC and the switches to managed. So I'm going to pass here and hand this off to Nivedita. We'll talk about the control plane and Nivedita will conclude the performance. Thank you. Thank you, Watson. Hello, everyone. I'm Nivedita and I'm glad to be co-presenting this session. Now, Watson talked about the best practices and network protocols to be considered while designing a high-performance multi-node cluster in a data center. I would like to talk about how we leveraged those design decisions to enable high-speed networking in Kubernetes. We have an on-prem production Kubernetes cluster that is mainly used by our internal users. And most of these internal users are research scientists that are running deep learning applications that often require hundreds of GPUs. We have enabled support for running multi-node workloads through a custom job controller, the upstream MPI operator, and a custom batch scheduler. The custom job controller registers a custom resource definition or a CRD for a multi-node job with the API server. Users can submit multi-node jobs through the CRD, either through the CLI or UI. And when a multi-node job is submitted, the custom job controller watches for these events and creates an MPI job for the multi-node job. The MPI operator then creates the ports for the MPI job, and these ports form a gang, which is then scheduled by the custom batch scheduler. Now, the container image has the necessary libraries for the framework MPI and Nickel. And when all the ports are running as a gang, they communicate through Nickel. Now, the production cluster we have is a shared cluster. As Watson mentioned, it's made up of 100 DGX1 nodes. And since this is a shared cluster, we have implemented features like gang scheduling, starvation handling, backfilling, as well as support for user quotas, DRF, and dynamic job priority through our custom scheduler. We have a logging and monitoring pipeline that feeds into dashboards and alerts and helps with day-to-day operations. Now, in order to run distributed deep learning applications, high-speed networking is extremely crucial. But before we go into how we enable high-speed networking, let's take a high-level look at the topology of a DGX1 node. Each DGX1 has eight V100 GPUs that are interconnected by NVLink in our cluster. It has four Melanox 100 gig Nix for RDMA and dual copper 10 gig Ethernet Nix. The four RDMA interfaces form a rocky network or a ring that is used by Nickel for GPU-to-GPU communication using the one-hop VLAN switching fabric. Nickel can detect the fast interconnects between GPUs, both within and across nodes. And the Nickel ring formation also helps with output port congestion at the cluster. Now, this process is described in detail in the link that is shared on the slide. The dual Ethernet Nix on the DGX1 node are primarily used for storage and Kubernetes control traffic. So this gives us two distinct networks, one for rocky and one for storage or control, resulting in an isolated environment for performance-centric workloads. To enable high-speed networking for multi-node jobs, we need to make the multiple interfaces on the host available inside the Kubernetes port. This is achieved through a combination of the SRIOV device plugin and multi-CNI. The SRIOV device plugin discovers the RDMA interfaces on the host and registers them with Cubelet. The RDMA interfaces can then be managed by Cubelet as allocatable resources. So once a port that is requesting these RDMA interfaces in its resource spec gets scheduled onto a node, Cubelet starts moving the network resources from the host namespace to the port namespace to the registered CNI plugin. Now, since the DGX1 node has multiple types of network interfaces, we use multis as the CNI plugin, which delegates its call to the SRIOV CNI plugin and the Flannel plugin. SRIOV CNI configures the RDMA interfaces in a port, while Flannel configures the default at zero interface. All the components in this flow are upstream components that have been used with minor customizations. Now, for each physical RDMA interface on the host, we create two SRIOV interfaces or virtual functions. So a DGX1 node that has four RDMA interfaces will result in eight virtual functions. Each full node port will only use four of these eight VFs, one corresponding to each physical interface, and the remaining four VFs are used for monitoring and storage network creation. Also, the affinity of any VF to a particular GPU is decided at runtime by Nikhil. The first image on this slide highlights the four physical RDMA interfaces on the host, and the second image highlights the four SRIOV interfaces that are created from the physical interfaces and renamed in a port. Since this is a production cluster, we are constantly monitoring the traffic on the RDMA interfaces. This is done by capturing the values of the transmit and receive counters, and the image shown here shows the traffic that you typically see while running a machine learning application. Now, we often see a sawtooth-like pattern, and this is a result of the two phases that generally occur in every epoch of an ML app. The two phases being the compute phase and the communicate phase. In the compute phase, the GPU workers are fetching from storage, performing computation, creating MPI barriers, and there is no communication with other GPU workers, hence the dip or the valley in the traffic. In the communicate phase, the GPU workers share the result of the previous phase with other workers, and this results in the peak in the pattern that you see. Now, we ideally want to maximize the compute time and minimize the communicate time in an algorithm to improve epoch training efficiency. In addition to monitoring the RDMA traffic, we also run Nickel All Reduce routine and other collective primitives as part of a pre-flight check prior to running multi-node workloads. The All Reduce test captures the average bus bandwidth, and it's run with different configurations where we vary the MPI rank per node, the number of threads per rank, the number of GPUs per thread, and so on. From the table and graph, we see that even with increasing number of nodes, we get a pretty consistent average bus bandwidth of about 45 GBPS, where the theoretical light rate is 50 GBPS. Here, we also have captured the performance of a PyTorch BERT job that we have run in two different phases, and we see that the throughput scaling, which is measured in sequences per second, is close to the ideal value even when using up to 256 concurrent GPUs in a shared multi-node cluster. In conclusion, we wanted to share some of our findings and observations for DC placement and design. Following the best practices will result in an environment that gives good performance for AI ML workloads, where bandwidth is concerned. Comprehensive understanding of the network flows will provide guidance for designing multi-node clusters without congestion at EDR and future HDR-NDR rates. What's to be kept in mind is HDR-NDR rates are applicable for Tor switches as well, and not for DC-to-DC connectivity alone. Regarding the Kubernetes control plane, we found that the upstream components and plugins in the community are easy to adopt and customize, and by implementing features like gang scheduling, starvation-handling coders in our custom batch scheduler, we were able to achieve seamless scheduling of multi-node workloads while maintaining fairness for end users. I would like to thank the entire team at NVIDIA for their help and support, and I would also like to thank KubeCon for giving us this opportunity to show our work. Thank you.