 안녕하세요, 여러분. 파이선 컴퍼러스에 오신 것을 환영합니다. 이 프로젝트는 기완킴입니다. 제 이름은 기완킴입니다. 한국의 시스템에 오신 것을 환영합니다. 이 프로젝트의 타이틀은 GPU의 글로벌 에트마스플릭 모델 파이선을 사용하는 글로벌 에트마스플릭 모델 Recently, I've been talking about the technology to take advantage of various modern processors for large-scale scientific computations. I have four main parts in my presentation, as follows. First, I will introduce some modern processors focused on CPU and GPU. 두번째, 새로운 proceed to make a new simple python module named pymeep Which I developed for multiple-platform third, I will present the reasons of applying pymeep to a global atmost spring model Finally, I will talk about the summary and future plans let's start with the first part As you already know, if you are interested in computer hardware, there are a variety of processors in the computer market today, in addition to traditional CPUs. The most representative of the modern processors is the graphics processing unit named GPU, developed by NVIDIA and AMD. GPUs were originally developed for dedicated graphics processing, but now they allow general proposed computations. The many integrated cores named MIC, developed by Intel, was made by integrating over 60 CPU cores aimed at the GPU market. In recent years, FPGAs are rapidly evolving to improve performance by directly programming the large gates to be suitable for a specific numerical algorithm. On common feature of these modern processors is high computation performance with high energy efficiency. They are also massively parallelized. This graph shows the trend of the computing technology over the last 30 years. Conventional computer performance was proportional to the clock speed of a CPU core. And the clock speed had been doubling every 24 months according to the Moore's law for a long time. However, since the mid-2000s, the clock speed had hardly increased due to the power consumption limitations. Instead, as you can see at the bottom of the lines, the number of cores is increasing rapidly. For reference, recent GPUs have thousands of cores with about one gigahertz clock speed. Personally, this trend seems to last a long time. Let me introduce the difference between GPUs, representing the modern processors and traditional CPUs. First, the percentage of transistors assigned to the arithmetic large units named ALU is very different. While the CPU allocates more transistors to cache and flow control than the ALU, but GPU allocates most of the transistors to the ALU. Therefore, from a point of view of the probes index, which indicates how many arithmetic operations can be performed per second, the GPUs are much higher than the CPUs. The second is that the memory bandwidth of the GPUs using the GDDR5 or HBM memories is much higher than the CPUs using the DDR3 or DDR4 memories. However, in order to take full advantages of the GPU's performance, a highly parallelized algorithm is required. Otherwise, the performance may be lower than the CPUs. These tables show the specifications of some GPUs and GPUs sold on the market. Please note that these are not very up-to-date. Take a look at the CPU and the GPU released at comparable times marked in red boxes. The memory bandwidth is more than four times the difference, and the flow is deeper by more than eight times in single precision and five times in double precision. I did not make a direct comparison in the table, but I heard that the difference is greater on the most recent NVIDIA Pascal GPU. If you have heard so far, I might look like an NVIDIA employee. No, I'm not. The GPU just has its advantages and disadvantages. In short, GPUs are very well suited for data parallel computations, with high performance and high energy efficiency. However, if your algorithm is not data parallel computation, then the GPU may be useless for you. Also, since the GPU is connected through the PCI Express bus on the motherboard, data communication with the host memory might become a serious bottleneck. Okay, my important question about the modern processors is this. How can I utilize these various modern processors in our model? The model currently being developed in my company is written in fully Fortran. So, directive methods such as OpenACC and OpenMP are considered more natural. As you can see in the examples below, if we insert the appropriate directives before the loop block, the compiler generates the code for the target machine. These methods look elegant and cool. However, in light of my personal experience, the directive methods are easy to start with, but they are difficult to achieve the high performance. On the other hand, I think that each process has a native program language which is designed suitable for it. For example, CUDA for NVIDIA GPUs, OpenCL for AMD GPUs, and ISPC for Intel MICs. Since these languages are parallel oriented, they seem to be more beneficial in maximizing the performance on the processor. So, my question has changed like this. Is there a way to integrate code written by various native languages such as CUDA, OpenCL, and ISPC? So, I made a new small Python module named PyMip. PyMip stands for Python-based machine-independent platform. The goal of PyMip is to switch the process and the language easily. PyMip provides three components as runtime environment, build system, and generalized array variable. PyMip is powered by PyCuda, PyOpenCL, and Python Ctypes module. Python Ctypes module is used to wrap Fortran, C, and ISPC libraries. This shows the schematic layer structure of PyMip. The main code of an application program is written by Python. It includes the parts that have little effect to the computational performance such as configuration, flow control, pre- and post-processing, and so on. The mass computation parts are written by low-level language codes for the target processors. PyMip compiles the low-level codes and imports them as Python functions. PyMip also manages the array variables depending on the target processor and imports the GPUs and Intel Mic have dedicated memory space. This is a very simple example of how to use PyMip. It calculates the equation g equals a times x and plus y. The x, y, and g are vectors and a is a scalar. The code below is a simple Python version using the numpy module. Now, let's convert this code to compile languages as Fortran, C, CUDA, and OpenCL. First, these are Fortran and C versions. And the OpenCL and CUDA versions are as follows. Note that these versions do not have a loop statement explicitly compared to the Fortran and C codes. Finally, this is a Python main code using PyMip to integrate the previous four low-level codes. The set-up part initializes the data, the following centers specifies the processor and language types. In this example, CPU and Fortran are specified. The next part defines the initialized, the generalized arrays for target processor. The next part is to compile and import the functions defined in the previous low-level codes. The Python function named FUNC is a wrapper of the functions. The next part is to call the functions with the generalized array arguments. At the end, the result is checked with reference data. This is what I want to emphasize here. If I change the processor to GPU and change the language to CUDA, this program will run on the GPU without any modifying the other parts. This is my goal, to easily change the hardware by simply modifying the options in this line. A simple performance test was performed using a two-dimensional wave equation. The two-dimensional wave equation is defined by the Laplacian operators as this equation. I discretized this equation using the central final difference method. Using this equation, a simple circular wave can be simulated as shown in the animation. Problems of the two-dimensional wave equation were written in Fortran C and PyME-based Fortran C CUDA and APNCL respectively. The red box shows that the performance of PyME-based Fortran is about 4% lower than that of pure Fortran code. While PyME-based CUDA has a good performance on NVIDIA GPUs, PyME-based APNCL does not perform as expected on Intel J1 CPU. Anyway, PyME did not suffer a big decrease in performance compared to pure Fortran and pure C and confirm that it shows a good performance in NVIDIA GPU. I will now introduce the result of applying PyME to the global atomizing model being developed in my company. The global atomizing model is to make greed horizontally and vertically on the earth as shown in the figure and to solve atomic equations and motions at each greed point. The atomic model consists of three parts, dynamical core, physics process and data simulation. I applied PyME only to the dynamical core of these three parts. Rather than a complicated explanation of the global atmospheric model, it is much more effective to view a movie clip published by NASA in the United States. This movie clip shows a simulation result of a GOS5 model developed in NASA. Oh, I'm sorry. Can you help me? Absolutely tiny here. You can see nothing. Thank you so much. This movie clip shows a simulation result of GOS5, which is a model developed by NASA. It is interesting that clouds come and go. And there are some typos. My country Korea is in. The global atomizing model simulates the atmosphere flow on the earth like this. You can see this movie in the NASA homepage. The global equations of the dynamical core consist of the following conservation equations. Pro-Nast variable in these equations are horizontal and vertical wind speed, temperature, pressure, entropy and water vapors. The random grid had been widely used as a grid on the earth as shown in the left figure. However, the higher the grid resolution, the lower the parallel efficiency. So, we use the cube's sphere grid as shown in the right figure. The cube's sphere grid consists of rectangular elements and their internal Gauss quadrature points as shown in the figure. We use the numerical method as a spectral element method for the spatial derivatives and the third-order lens-kutta method for the time derivatives. The spectral element method has excellent parallel scalability. I counted the code lines of the model being developed in my company. The chemists of model name. The total number of lines is about 239,000. And about 144,000 except for comments and blank lines. Note that the dynamical core, which covers about 60 to 70% of the model's work-lock time, is about 5,600 lines, which occupies only about 4% of the total lines. I changed only this part into CUDA and APNCL codes and integrated them using PyMEAP. Many scientific computations, as well as our model, take a lot of computation time in a small portion of the code. That's why I think the methodology using PyMEAP is useful for many scientific computations. There are about 45 subroutines used in the dynamical core. I converted these subroutines into CUDA and APNCL codes. Originally written by Portran. This diagram shows schematically the workflow of the dynamical core using PyMEAP. This figure compares the work-lock time of the dynamical core using PyMEAP by Intel CPU, NVIDIA GPU, and Intel MIC respectively. The horizontal resolution of the model is about 100km. It's very low resolution. The focus time is one day. The work-lock time by the two-secret Intel GPU, which has 16 cores, was 3 hours 47 minutes. Based on this time, running APNCL code on the Intel MIC was about 2.9 times faster. When running CUDA code on the one-on-NVIDIA GPU, it was about 7.4 times faster. Using two, three, and four GPUs with NPI, they were 5.9, 8.2, and 11 times faster respectively. The specifications of the processors used in the experiment are shown in the table below. I think that using APNCL for Intel MIC seems a bad idea. I plan to try ISPC instead of APNCL later. Note that I don't want to say NVIDIA GPUs are better than Intel CPUs from this experiment. I would like to show you that using PyME makes it easy to use various modern processors. I finished the presentation. I recently found a new project named OCCA similar to PyME. While PyME requires that the main program should be Python. OCCA allows that the main program can also be Fortran, C, C++, and Julia. So if OCCA has more advantages than PyME, I might change PyME to OCCA. I also heard from my colleague about a new metaprogramming project named RUPI. RUPI generates optimized target codes for the modern processors from one-shoot code, which defines loop structure and dependencies. The nice thing about RUPI is that I can apply various optimizations such as loop tiling and loop unrolling without any modifying the shooter code. If I can combine OCCA and RUPI as shown in the figure, it would be very great. Let me summarize. I propose a methodology to utilize the modern process for large-scale scientific computations. I made a new Python module named PyME to integrate low-level coders such as CUDA and RPCA. It was a very big realistic problem as the global atmosphere model. I have two main future plans. One is to apply OpenACC Directives to the model in Fortran. My boss wants it. The other is to rewrite the model using a metaprogramming such as RUPI. And I'll check if OCCA can replace PyME. Thank you so much for your attention. The problem that I see when you have a high-level language, which is compiled down to a low-level language like CUDA or Fortran, etc., is that when there is an error, you have to debug that. How good is debugging functionality? For instance, you get the right line numbers where the error is. How does it work debugging this system? I'm sorry, I don't understand your question. Let me rephrase it. When there is a mistake, an error, a problem, you need to debug the generated code. Does it work? Do you have an easy relationship between the generated code, the original code? What tools do you use? In my presentation, I did not use metaprogramming. So I converted Fortran code to CUDA and OpenShell and see manually. I did not use metaprogramming, but I had a plan to use that. Do you have any ideas on where we can use this technology outside of scientific domains? Outside of, let's say, large-scale data processing like you have here on a atmospheric model? Any other industries where this might make sense? Where do you believe this technology can be used outside of the scientific world? Actually, this method, I think, is a general method. But outside of scientific programming, there are alternative methodologies. I think there are alternative good methodologies. For example, I watched Numba. Numba is a good choice for other applications. But many scientific programs written in Fortran, that's a problem. We cannot throw Fortran because many scientists only use Fortran. So I integrate Fortran and other low-level codes. So it's not a question but maybe an answer to the question that you just got. On data science, I see the best use case other than scientific environment. In the data science industry, where you crunch a lot of data, you always or a lot of other use cases are based on NumPy or stuff like that. And libraries, recent libraries in deep learning industry, like TensorFlow or stuff like that are very meaningful for this kind of usage. So maybe it's worth a try using your library to do data science other than in the scientific industry. Okay. Thank you for your comment.