 Good afternoon, everyone. Thank you for joining our session. I'm amazed that there are many people in our session. So today, I'd like to talk about our challenge to provide a simple container as a service on ZIP cluster for our in-house company users. First of all, please let us introduce ourselves. I'm Tsuyo Kama, a software engineer working at NTT Communications, a telecom company in Japan. I'm Yoshihumi Sumida. I'm also a software engineer working at NTT Communications. OK. First, this is a reminder for our presentation. Firstly, we'll talk about our goal, motivation, and requirements. And next, why we provide ZPU resources as containers. And then we compare among several container-related tools, as in the title, we surveyed and evaluated various open source container tools. Finally, we'd like to tell you how we realize our ZPU container as a service. And you'll see a demo how our users use the environment. OK, let's move to the main part. At first, our goal, motivation, and requirements. Our goal is to provide simple containers as a service ZPU cluster on-house users. The ZPU cluster is composed of many master nodes and seven nodes like this. The master nodes manage some ZPU machines as clusters and schedules the ZPU container to the seven nodes efficiently. On the other hand, users can access ZPU resources via their master nodes. In our challenge, we focus on easy management of our resources and the easy deploy for ZPU containers as a service. OK, next, our motivation. We had some ZPU servers but mined them individually that wasn't efficient. So we'd like to manage them as a unified resource cluster. Our ZPU cluster should be able to include different ZPU series. This could be a problem of NVIDIA driver version difference. On the other hand, more and more in-house users would like to use ZPU resources for machine learning and their VQ return analysis, et cetera. They want to focus on their own tasks. So provision of ZPU resources should be as easy as possible for them. Therefore, we need to provide our ZPU cluster as a cloud service. For example, on demonstrate service and share the ZPU resources efficiently. OK, to achieve our goal, we have to consider some requirements from both the side and provider side. From user side, why is that users want to deploy ZPU container easily? For example, all they have to do is to specify the number of ZPUs they want. They don't care and be a driver version. Necessary driver files and ZPU resource management, right? And that is that they want to use Docker because they're familiar with Docker image and Docker Serial usage. Next, on the other hand, from provider side right class, they want to share ZPU iteration. Our container scheduler should avoid attaching busy ZPUs to new containers. We'd like to bind one or more ZPUs to one container and let its container process can see only its bound ZPUs. We don't consider sharing one ZPU among multiple containers because it's difficult to share ZPCore and ZPU memory efficiently. Our goal is deployment for providing ZPU cloud service. And providers also want to distinguish containers lifecycle according to task types. There are two task types. One is temporary batch task. And the task is executed once. For example, training tasks in machine learning. In this case, how is container lifecycle? Our container should keep itself soon after the task is completed in a container. The other, on the other hand, next, there is a long running service task. For example, using Jupyter Notebook. Our container process should keep itself alive until users kill it manually. It could be more efficient for ZPU resource management to utilize these task types. That's because the other container process couldn't remain vain. Okay, next, I'm talking about why ZPU and why we provide ZPU resources as containers. And here, I think, I don't have to show you the importance of ZPU in detail. Nowadays, ZPU is used in many fields or workloads and some public cloud providers begin to offer ZPU instances. Okay, next, let me talk about the reason why we provide ZPU resources as containers. First, I'll talk about our first try to provide VM virtual machine based on ZPU cloud. We had provide ZPU instances on our OpenStack private cloud. We utilized KVM piece of puzzle and attached ZPUs to VMs physically. However, this approach has three problems. The first problem is about an everyday driver. Users have to install appropriate version driver every time creating VMs. The second problem is that we cannot monitor ZPU status using NBM management library NWML. KVM piece of puzzle requires you to bind the dummy driver to host machine CPUs. And the third problem is that once users create a specific environment in a VM, it's difficult to run various applications. Okay, could our NB driver version unmatch may cause application to robot? Okay, now, let's move on talk about our solution using container. Container, for example, Docker, can resolve previous problems. Here, we select Docker because of our user's requirements. Okay, next, actually, how user did solve our problems. As I said, in case of VM instance, users have to install appropriate NB driver for ZPUs every time creating VM. When using Docker, on the other hand, they don't care it. Once the provider installs the driver on host machines or users have to do is create and destroy containers. Users just manage on their life cycle and don't consider and don't care it real. And the providers can monitor ZPUs using NVML because of not using dummy driver for host machine CPUs. And the next, about a much version problem among application, cloud circuit, and NB drivers files. Docker image, in this case, can resolve this. Application and cloud circuit can consolidate as the image previously. And NB driver files can be injected inside containers as a volume. However, in addition to this, we have to care the limits between NB driver and cloud circuit version. Next, I'll show you an efficient solution. Okay, NB Docker is a useful tool for ZPU usage. Let me talk about NB Docker. It's a Docker dropout tool to use and isolate ZPUs inside Docker containers. There are ready-to-use images of CUDA and various deep running frameworks for NB Docker. Then, let me show you the internal NB Docker. It wraps just Docker run and Docker create command, adding Docker server options amount to nested NB driver files like this right. And NB Docker plugin can detect NB driver files for these options. It can find or NB driver arrays and binary on the host. And I'd like to talk about how NB Docker solves a master version problem between the image version and NB driver. It uses a special label in Docker file like this. For example, if the driver is too old for running the code version and error occurs before starting a container like this. Okay, next, I'll show you the comparison results among some Camdr tools. We surveyed and compared some container-related tools and decided how to provide our container service on ZP cluster. Actually, we surveyed and verified this third function. And this is a point of view of our comparison. I'd like to pick up some points and talk about these. First, specify the number of GPUs. This means that users can specify the number of GPUs and allocate multiple GPUs, not only single, to containers by themselves. Next, ZP isolation. This means that let each container process see only its own GPUs. And the GPUs which is already used by the containers will not be attached to new containers. Finally, exit batch tasks. This means that container process can be carried automatically when a user tasks in a container are successfully terminated. Then, okay, let's go through the evaluation actually. At first, Enviadocla. And the first story, how to specify the number of GPUs. You could use the environment variable and base GPU like this. Okay, and the next ZP isolation. For example, I specify the GPU IDs like this slide using a base GPU and create a container and check the GPUs both in a host and a container. In a host, we can see all four GPUs on the machine. On the other hand, in a container, we can only see the specify the GPUs ID zero and one. So, Enviadocla could isolate the specified GPUs in a container. Okay, next. But however, when I created two containers that GPU IDs, both containers got the same GPUs. It's not fine because users may create containers attached to with the GPUs. As a result, the isolation for Enviadocla is not enough for our requirements. Okay, next. Let's move to Docker swarm, swarm mode. Enviadocla itself didn't have the function to manage cluster. So, our next approach was to survey Docker native clustering tool. Docker swarm is a native clustering tool for Docker. And Docker 1.12 got built in orchestration future swarm mode. However, now, Docker itself cannot minus CPU resources like CPU or memory. And also, it hasn't been supported by Enviadocla. So, it cannot inject necessary Enviadriver files into containers automatically in Enviadocla. To conclude, it doesn't satisfy our requirements, management of Enviadocla. Okay, next. As we can see, only Enviadocla or Docker native clustering tools couldn't satisfy our requirements. And as I told you before, at first, we provide ZP Cloud OpenStack. In this context, we tried to use container-related OpenStack component. For example, Magnum. Magnum is just a way to deploy a container orchestration engine, COE, on OpenStack. On the other hand, OpenStack joins the continuous management service and it's a relatively new OpenStack project. So, we selected them and verified the visual it can satisfy our requirements. However, now, ZP resource is not supported. It's a fatal problem for our requirements. As a result, it doesn't satisfy our requirements. Okay, next. Let's talk about Apache methods. So far, with Enviadocla, Docker-related tools and OpenStack zone, from here, we'd like to show our survey and the variation of major theories, Apache methods and Kubernetes. First, Apache methods. It's a cluster manager and provides efficient resource isolation and sharing. It has master and serve nodes and you can control its master by various methods frameworks. Okay, next. Let's see methods DPSupport status. Methods version 1.0 at Enviadocla-ZPSupport. It got a function to manage ZPU resources same as CPU, memory, and disk. However, Enviadocla-ZPSupport in method is only available for methods containers, not Docker container. And I'd like to talk about methods framework ZPSupport. So, method C2S-F has ZPSupport on the previous slide, but methods frameworks also must have ZPSupport. However, none of the frameworks support both ZPU and Docker for now. Okay, next. Let me show how to specify the number of ZPUs and the ZPU isolation for methods. We use methods and the marathon version like this. When you specify the number of ZPUs, you find numbers, initiation code, and like this. And I actually created the containers from this file and the accelerator ZPUs in container and the ZPU isolation. In Enviadocla, it has a problem that same ZPUs were touched in different containers. On the other hand, in methods, it's able to divide different ZPUs into different containers like this figure. So, methods can assure ZPU isolation. Okay, next, methods Docker support. Before I said, now Enviadocla ZPSupport is only available for the methods container only, not Docker container. Methods container supports Docker image, however, cannot use Docker API, Docker CLI. It's a regrettable point for our requirements. Our users want to use Docker API because they are familiar with it. Okay, finally, let me show methods related status. Methods may support Docker containerizer with ZPU. In next version, 1.2, as we can see, this is GitHub commit logs. However, Docker support with ZPU in methods frameworks like marathon seems not to progress. Okay. And now, let's move to Kubernetes. And Kubernetes is one of the major container orchestration engine from Google. And Kubernetes has various features for orchestrating container. For example, and cluster management, auto-scaring, auto-hearing, storage orchestration, and batch execution. And Kubernetes has some original concept of container management. It manages container as a group of one or more containers. And this concept is called POT. And POT shares its storage and the option about how to run containers. When creating containers, you have two ways to deploy containers. And simple deploy a POT or controller. And firstly, let me talk about using POT. And POT is the most simple deployment method in Kubernetes. As I said before, POT is the minimal unit for managing containers. And deploying a POT simply creates a group of one or more containers. On the other hand, controllers can define how to create and manage POT with a general function. And there are various kind of controllers. For example, job replicaset and deployment and so on. And in this talk, I'll touch on job controller later. And next, let me show how to manage container lifecycle. When user creates containers, they generally prepare a manifest file in YAML or JSON format. In a manifest file, user define the specification of container. For example, POT's kind, container's name, and image name, and so on. And users can create or delete containers from the command line interface or web UI. And to submit Kubernetes master node through REST API or web UI. Now, let me talk about whether Kubernetes GPU support satisfies our requirements. And before version 1.6, and Kubernetes was not mature enough. But in version 1.6, and over support GPU scheduling well, and it satisfies our requirements. In other words, in Kubernetes, can assign multiple GPUs to one container, and each container can occupy its own GPUs. And so next, I will show you the result using this version. And to specify the number of GPUs, define the number in the manifest file, like this. And here, it creates one container, and we can see it's ready to GPUs inside the container. What about GPU isolation? And in Biazoka, the same GPUs are attached to two different containers. And but in Kubernetes, it can distribute different GPUs to different containers properly like this. And next, how about the batch task? And the batch task runs once when the task is completed successfully and the container is automatically terminated. In Kubernetes, we can use job controller to realize batch task, and just define the kind to job in the manifest file. Like a pot, we have isolated GPUs in the containers. A user can check the status of task to see whether it's completed. We can see that the container is terminated automatically. And okay, we have seen the result of these five tools, and I summarize them into this table. And you can find the Kubernetes is better choice and many points of view. In this result, the method is comparable to Kubernetes. And because the methods also provide GPU isolation and to specify the number of GPUs and so on. And that Kubernetes is superior to methods in some of Docker support. In the last part of this talk, I will talk about our decision how to realize our GPU container at service. And based on the results of the comparison of what we choose is Kubernetes. At the point when we submitted the call for paper, methods had advantages to the other tools to satisfy our requirements. And this is to say that methods doesn't support Docker, and batch is GPU isolation is better than other tools. However, the recent release to Kubernetes version 1.6 had both and Docker support and pretty good GPU isolation. And so it made us to change our mind. So this figure shows the architecture of our environment pretty free. And our Kubernetes cluster have five nodes, now including one master and four GPU-thread nodes. And users control the cluster from the users who are commencing or logging into the master node. And we also provide an FS server as external storage for users. And based on the experience of GPU cluster deployment, let me share some tips for providers. And at the beginning, I'll talk about how to enable GPU container. And we need to do three things on each thread node. Firstly, install NVIDIA driver. And secondly, install NVIDIA Docker. At last, we're running Kubernetes at a certain parameter to it. And next, I'll just explain about the first one and the second one in details. Firstly, install NVIDIA driver on each thread node. NVIDIA driver includes various libraries in counter module. In particular, include the driver library and NVIDIA driver library and need it in the containers. And so they will be injected into containers later. Please note that do not install CUDA toolkit on the thread node. And because it will be included inside the offshore Docker image by NVIDIA, and this is to avoid version-dependent problems. And secondly, install NVIDIA Docker on each thread node. And to enable container to use GPU, and provider needs to pick up the necessary driver files and inject them into container. But providers have to know which are the necessary driver files and where are them. And because these files exist in many directories and NVIDIA Docker help us to do this easier. So NVIDIA Docker programming can detect the necessary files to be injected and aggregate these files under one specific directories like this. In this slide, we call this directory NVIDIA Docker volume. After that, just by mounting NVIDIA Docker volume into container. And user can utilize GPU inside the container. And please note that NVIDIA Docker itself can automatically detect NVIDIA Docker volume and mounted into a container. However, when using NVIDIA Docker volume on Kubernetes, and user needs to specify the passive NVIDIA Docker volume explicitly in the manifest file. And NVIDIA Docker is nice. But there is still one version problems because the passive NVIDIA Docker volume includes NVIDIA driver version number. Like this, and two nodes have different paths. And particularly, and this will be a problem when your cluster includes multiple kind of GPUs. One solution is to unify the path name by creating symmetric ring, each shape node. Then you can use the same path for NVIDIA Docker volume in the manifest file. After the deployed GPU cluster, there are several additional tasks for providers. Sometime, user wants to select a specific GPU. For example, when user wants to do heavy workload like machine running or artificial intelligence, they may want to use a Tesla P100 GPU. And this is a high-performance GPU. And provider touch level like GPU type two, each shape node according to GPU series. And then user can enable the level in their manifest file. And beside the task about initial deployment, provider have to do monitoring during the daily operation. Monitoring is needed because if GPU are here remaining, provider needs to add GPU resources. And or I'll buy CID users to release unnecessary GPU resources by terminating or deleting POT. We had to look on two methods for monitoring. At first, Kubernetes itself provides GPU resource monitoring on each node. And by running kubectl describe node and provider can see a number of GPUs each node. And but this function doesn't work well currently. And second, provider can monitor GPU resource on each node by NVML. NVML gets some metrics about GPU. For example, and GPU utilization rate and used GPU memory rate and so on. And therefore, our choice is limited to NVML to monitor GPU utilization rate only for now. I showed some chips for providers. And next, let me talk from the user's side. We assume that users have two ways to execute the task. And this is, there are the subject task and batch task as mentioned in our requirements. So next demonstration, I will show you each steps of the work flows. So now let's move to our demonstration. I'll show you our demo movie. And during this movie, and I talk about how users use our Kubernetes cluster. And let's start the movie. This is a figure of our environment. And one master node for slave node. But first I log into the master node. As you can see, Kubernetes says one master and the fourth slave node. Next, I log into one slave node has two GPUs. And you can see two tesla ket and GPUs in this node. And next, I'll show you how users use our environment. And there are two kinds of users' tasks, service and batch task. First, let me show you batch task. Once the batch task is finished, the container will be terminated. And to run batch task, user create manifest file like this. For example, I define a kind to job. I want to do two GPUs, tesla P100. And do some training command using MNIST and inside TensorFlow containers, a container, sorry. And next, I create TensorFlow job. As you can see, job is created. And next I confirm my requested to P100 are assigned properly. After a while, I check my job status to see it's completed. And so now I can check the training results. You can see batch task results in this way. And the other type of user task is service task. It's long running and keeps container alive. And similar to batch task, user creates a manifest like this. I set kind to replication controller for managing port, a container port bound to host port. I want one P100 GPUs with the digits container image. You can also use Kubernetes web UI to create container and just upload the manifest file. And you can see the digits part is created. Now I can access to digits UI on my browser. And here I register MNIST dataset. The dataset is being examined. So the training process using a coffee will begin. Then you can see the real time running state and GPU status. The demonstration end here, thank you.