 And today, we are going to talk about some experience I learned from setting up the large, not so large, sizable AI server infrastructure at my university. OK, it's not moving. Ah, OK. So we know that the AI starts going to change the world because we are moving fast from first, second, and third industrial revolution to the fourth industrial revolution that involving the use of AI and big data and cloud to drive the new innovation and changing the way we live, we play, and we conduct business. And here's some example of research being conducted at Kasesa University. For example, we have professor working with company on damage prediction when you have the car accident, how to quickly using mobile phone to actually take the picture and assess the damage. Or we have a very good faculty of agriculture. So how about if we have the drone image and we count size and number of plants so we can quickly estimate the production, crop production in the whole province in one or two days, instead of sending people and going to the field and counting how many products the plant can come out. Or we can detect the face recognition. This is the real from video. We can pass the video through the AI and which minute you have Bella and Pope showing and also their emotion. And this can be used to drive the advertisement for the company. And all this research needs a power of AI. The last one is food classification. Can be used for tourist industry. So we actually need the immense power of AI system to drive the research in robotics, machine vision, engineering, genomics, food and agriculture, nanotechnology. And finally, last year I managed to have two funded projects to build two AI. Infrastructure one in the main campus in Banken, another one in campus in Sri Lacha for the EEC. So the first system is up and running and the second system is almost up and running in testing. So we get funding from Thai government and from the part about 50 million baht. The Thai government funding is also being used to upgrade the network to AI. I'm pleased to tell you that KU is now having 100 gigabits per second campus backbone already. So we are moving fast to a super infrastructure. So how to build the high-performance HPC AI? I have to sit with vendor and system integrator because we don't have that much of the knowledge in Thailand how to actually really building such a system. And we identify three major challenges. The first one, system optimization. The second one, how to pick the right mix of open-source software and hardware installation and management. And the third, we have to deal with various kind of workload that is kind of conflict between traditional genomic bioinformatics, computational workload, and the new deep learning training and inference system. And it needs to be serviced by a single set of machine. So if we look at the deep learning, which is the key application that we want to support, here is the normal processing step. You get the data from your storage. And then the data movement from solid state, right solid to your memory. And of course, most of the data scientists and AI researchers use Python. And Python sucks in the data from solid state drive to memory. And then you get a data frame from Python memory to low-layer libraries such as TensorFlow and Caffe, whatever. And that library pumping data into GPU. And GPU crunched the number because basically the deep learning training is a large crunching of vector mathematics. You do the vector calculation or tensor. And finally, you update the model again and again and again. And you suck it out to your memory and save it to solid. See, for example, this is a process that you want to distinguish the cat from the dog or whatever that people are doing. So we look into the system perspective. We need to read from remote storage because most of the training data we are involving with millions of images. For example, a large data set from sensor has to be stored not on the AI machine, but on the fast solid somewhere. We have to move into solid state drive, starting to memory back and forth to GPU and back again. And model, although it's quite sizable, is still much smaller than the data that you use to train the deep learning. So when we sit with vendor, talk about that, the design engineer, we decided on a few things. First of all, we need the autofast high-speed network to be able to facilitate the quick movement of all that information that we require. And there are two competing technology, InfiniBand and Gigabit, Ethernet, 100 Gigabit. And finally, on the main campus, we use 100 Gigabit. Why? Because we want to plug the whole infrastructure into the core network so that when user taking the data in, they can move the data from the core network band into the AI machine with the 100 Gigabit speed. InfiniBand is also good for low latency. But again, on the second side, we decided to use InfiniBand because on that second side, we have the parallel fast system as a back end for big data. So we decided to use InfiniBand because we have no need to connect the system to the core network. And the core network over there is only 10 Gigabit, something like that. And of course, each node on the AI machine must have the solid state die. And we use the NVMe base solid for the speed. But the drive is smaller, much smaller than the solid main. And for computing capability, deep learning is very friendly to GPU. So GPU is the only choice. But don't forget that although GPU is important, the core of the server system is important as well. Because when we put the data from memory down into GPU, the data movement have to be facilitated by CPU. And then also, many, many core on the CPU will actually make every process much faster. So it ends up that we have the 64 core for each node using the Intel Xeon Go. And another consideration is how much memory we should have on the system. And the rule of thumb we learn from NVIDIA. Of course, we talk with NVIDIA all the time. In HAP-Pacific, for this project, we need at least twice the size of the GPU memory. So we count the card, multiply the storage, and then we get the size. But my researcher reported that the more memory, it more memory, because we need more memory than this. So software, we decided to use container. The whole system is managed as a container cloud using the open source OKD, which is the one that can enable you to build. But we need resource management as well. Because without resource management, user can lock the hardware and then not return to the pool. So one people can build the system and control everything. And we don't want that. We want freedom. So we have scheduler as a mediator to control the resources, the system, set up being that. And also, there are a few workload tied from CFD to deep learning how to accommodate that. You see, for example, if you're running the Jupyter HAP to teach students, and suddenly some guy coming in running a large, huge, humongous, deep learning training, your students will freeze up. Right? So that rendered the system not usable. So we tried to look at the solution, and finally we decided, hey, we're just giving them each machine. So we have four servers in the main system in Banken campus. One server for GPU, we 100. Each GPU card is about seven telephops. So we give them about 30 telephops to train deep learning. And then we have two GPUs for AI training and two GPUs for traditional computational work and also management node. So it's come out something like this. Final system is up and running now. And we found, amazingly, our researcher managed to run it full time. We need more power, more power than this. So we have one node as a management installed, OKD, Slurm, another is also running. So I'm not going to go into detail because I think you are free to access the PowerPoint. And we have second system, the design different. This time we decided to go for second product, DGX1 from NVIDIA. And DGX1, having eight, we 100 cards in a single system, much more powerful than the four. Because you can use the whole memory, five, 12 gigs, and then bam, training it. So it's running now, but we haven't had the benchmark result yet. And we have the solid node using the cluster five system, I think, to feed the. So when you plan the deep learning machine, you have to think that what makes AI powerful is a big data. So it's the same thing. The system has to accommodate the big data and have a way that you can pump that human gas data into the AI service machine. And also, if you need to do the inference as well, you have another tier of inferencing machine. But we don't have it here. Because our system assumes that when you train the model, you can inference somewhere. For example, putting your model on Amazon or Google Cloud and running. So we have our researcher benchmark it, and she did compare it with Google Collab using the flower identification benchmark. And also two of the CNN, complex CNN, million of parameter, 23 million parameter running. And also, we do the GAN. You can use the image generation to change the horse into the seabla now. And we test it on our system. And we found that we can deliver much better performance. Look at the time. The purple one is Collab. Our system is beating them down much, much faster. And also, another researcher are working on optimizing the secret net. Secret net is a small CNN network for HDY and embedded system. So this is totally different. Because currently, you can have device like Camilla. And you can actually do the inference quickly on the NDYs as well. But you need a proper model and parameter. And the work on optimizing the model, usually running four days. And it's decreasing to today about two times. Really heavy computation. Already, this is only two projects. There are 10 projects of more. But I don't have information yet. And we found something interesting. But I don't have the answer yet. Our researcher trained the system using the local server and the remote server. And we found that the data put in from remote server getting more speed up than local server, which is, interestingly, against our instinct. But we get about three times on four GPU, which is quite good. I mean, 2.94 times. That is our algorithm is quite scalable. But there are many existing challenges for this. I can talk with you more, but I have only 15 minutes. I'm sorry. It's a little bit fast. First of all, how to properly manage GPU resources for multiple users. So we are learning how to properly set up the scheduler container. Should we launch it first? Or should we have scheduler launch the container and then later user can use it? That kind of thing. So and also the data and course setting how to make it as automatic as possible for the researcher or else they have to understand the structure of the system, make it difficult to use. And also we have to tune up the computing GPU data transfer and container. But what we think is that we are learning is fun. So we make it work first, and then we make it work fast later. So at least the system is up running, giving a service. And then we learn along the way. That is the spirit of exploring, and we can change everything. So in the future, we are exploring how to do the federally kind of grid line connection between two sides, because the KU is connected to 10 gigabit unit. So we kind of want to build a single view of the system. And the fun thing is that I'm participating in ASEAN level project that tries to build the same amount of ASEAN country to link supercomputer in ASEAN together. I am one of the key representative from Thailand that working on the design. The singularity is explore. It's a new thin-level container that allows you to actually using the GPU resources more efficiently. It's being popularly used in national research in US. We are going to try it out. And we are building the research community. And we are also building the research program to allow the system. So this Friday, I'm going to talk to the omics group. And finally, it's a lot of fun learning experience. And all are based on free and open source. Thank you very much. Thank you so much, Dr. Phucong. Thank you very much.