 for GPU. It consists of 12 NVIDIA V100-equipped servers. In all cases, our systems are equipped with InfiniBand interconnection. Here you see specifically the Avituhou supercomputer from the front and from the backside. It's a system with 150 servers for computations equipped with non-walking InfiniBand interconnection. And also we have some management and storage servers about our system with GPU NVIDIA PSVA V100 GPU cards. It consists of 12 servers from Fujitsu Primergy with 128 GB RAM for each server, dual CPU, 25 cores. They have SSDs and some hard disk drives, and they are equipped with InfiniBand. And here the big data system it has for computing eight servers which have this 3TB RAM for each of the system. And they are with four CPUs each. So each of these CPUs has 22 cores in total, 88 cores per server, 3TB RAM per server, and 32TB of SSDs for each server. Mavanox InfiniBand HDR at 200 bandwidth per second is the interconnection here. And this system of eight servers they are connected to with the other data servers as you see here. So these servers provide a Wustre file system to all the others. So they are not accessible for computing, only providing storage. But in total we have more than five petabyte of storage as you see here. This is provided from the disk enclosures where each enclosure provides a total of 1.6 petabytes of disk storage. And we have also SSDs, quite a lot of fast SSD storage here. But as I said, my talk will be more concentrated on using the Avid2Ho system. Here you see the peak performance from CPUs. In total 50TB comes from the CPU. Also we have peak performance from the accelerators which are Intel Xeon Pi, 362TB in total. So we get total peak performance of more than 400TB. And we have the real measured performance achieved using the LIMPAC benchmark, as the system is visible in top 500. This was 264TB in total. The max power usage is approximately 250KW. What we have observed during the LIMPAC test was something around 200KW, probably a little bit more than that. In normal operation we reach something around 100KW, probably because the applications usually do not stress that much the system. Here you see more detailed hardware features of the system. 150 servers almost identical with dual CPU of this type Intel Xeon E5 8 cores. Each hyper-trading is enabled. You can use it if you gain something from it. Our experience was that most applications gain from using hyper-trading, even a little, from 15 to 20 nice applications like LIMPAC do not gain from hyper-trading. For the RAM we have 64GB per not for the CPU accessible RAM. From the core processors we have in total 300 core processors, Intel Xeon Pi 7120p. Each of these core processors has 61 physical cores, but the hyper-trading there is 4x. So in total we get up to 254 threads. This is a logical course on the core processors. Here you see some summary about the total cores on CPUs with hyper-trading and also from the accelerators and from accelerators with hyper-trading. Now we reach quite large number when we use the accelerators with hyper-trading enabled and our experience has shown that 4x is maybe too much because the memory is 16GB for each of the core processors. So one should be careful not to exceed this amount because this can lead to a crash of the accelerator and even restart of the whole server. But nevertheless we have quite a high number of threads and in most cases it is usable and advisable to use maybe 120-122 threads per accelerator. The total amount of RAM as you see is a combination from CPUs and accelerators. We have a bit limited disk storage in this system, but as you noticed the other system has petabytes so it's not a problem to use that storage for storing files backup and so on. The interconnection is non-blocking. It means that the communication between any two servers is not affected adversely from communication of some other servers let's say. The latency is around 1.1 microseconds between any two servers in this system and this is the bandwidth, this is line speed from the interconnection. So here I provide for those of you that are interested to use this system which is free for researchers especially Bulgarian researchers. So you can go to this address and see our policies and the forms which need to be filled and provided in order to get access to the system. And here some idea about the software environment. So it's Red Hat Enterprise Linux. The execution is governed via a batch schedule. So it is assumed that your jobs are submitted in the queue and here you see some non-exhaustive list of the software that is available on this system. The open source software is installed under OptX software. We also have Intel Cluster Studio for development and we are in the process of acquiring some new software that may be quite useful in the future. So this was my brief presentation of our systems and their features from let's say the outside. Usually when somebody obtains access to this system then one can see what exactly is installed. One can check and so on. And I'm going to show you how this can be done. So I will stop this sharing and I will start another sharing. So you'll share this screen now. So of course the system is running clean. So the access happened through some SH client. It can be accessed from anywhere. And once you get inside the system then it's advisable to go to this software directory and you can see what kind of software. I hope you see my screen. You can see what kind of software, what versions and so on are available. And here we have some directory documentation which may be useful for you. Here we have some tar balls from previous trainings where you can see examples. And in the documents directory you can see this Avitoho best practice guide. This guide was actually prepared for praise some previous version of the praise projects. And this guide is quite exhaustive about how to use the system, how to develop or how to run existing codes in more or less optimal way. So I think that this guide is very useful for anybody that just starts to use the system to see how exactly to submit jobs. Will, I'm sorry to interrupt you Itzanela. Just there is some writing in the chat from the participants that they cannot see. Is it possible to increase the font size a little bit? Okay, okay. I will try to increase the font here. I will try to increase the font of the screen. Is it better now? Yes. So okay, as you can see, this is the directory where this guide is located. And here you can see what kind of software is installed of type open source software. Also, there is the Intel compiler available in Intel. Many different versions are installed and we always try to upgrade to buy the new versions and they can be used. So regardless of what is available in the guide, I will try to give you some idea how to make use of the system. So I copied here some examples that we can use. What I usually do when I do some testing and development. So I prepare some script which just does some sleep here. So I can try to submit it now. It goes to this big climate queue because when I was dealing with this script, I was using this queue. But for you, it will be some queue related to computational chemistry, dynamics or something like this. So the queue is something that you will obtain information which queue to use after your form is approved. But here you see that we request four nodes and we say how many cores from these nodes. It's advisable to Excuse me, please, but people are asking if possible to increase even a bit more. To me, it's also a bit difficult to see what's on the screen. My screen is very big. That's why I see it too well. Let's try even more. Is it better now? This is much better. Okay. So let's stay with this. This is okay. So here the most problem. The most important thing is always to use full nodes. So although it is possible to use partial nodes and to define that you are going to use just one CPU core and so on. But this leads to sharing of nodes between different users, which is very problematic. And if a job is not going to use more than let's say three cores, this is maybe not a job for the supercomputer actually. This is a job that should be run somewhere else. So even if you are going to use less than the maximum number of cores, please always request full nodes. And for real high performance usage, we expect that the people will use many nodes, even far more than four nodes. Our system has 150. So it is difficult to obtain the whole system for yourself, 150 nodes. This is quite rare to achieve a situation where you can really use these 150 servers. But let's say using half the system, 64 nodes, this is quite normal. And one can expect that such a job will pass and will finish at some reasonable time. Of course, in some cases there are problems because people run very long jobs and then such a long job takes all the servers for it for a long time. And so they cannot be used for jobs from other people. But in any case, 3264 servers should be possible to be used, but always use all the cores in the servers. So request here ppn16 means take all the cores from these nodes. And once the job starts running, you can see here what is happening, lots of jobs are running. I can see my own job here. So I run here three jobs. This is the last one. If I want to get more detailed information about this job, I can do this in this way. So I can see here where this job is being run. There is this situation that we are trying to prevent people from logging into nodes that are used by other people. So for example, I see that somebody is using server 108. If I try to go into this server with SSH, this should not work. Although if I try to go to some of these servers that I am using, like this one, this should work. As you see, I was able to go there. And if you are running because the job is just a SSH script. So if you are running something meaningful inside this script, then you would be able to see what exactly is happening with your job. You can do things like top and edge top. Maybe edge top is more interesting, shows more powerful and useful output. But because the script was empty, we do not see here some what. However, my idea was to demonstrate some actual usage. So I prepared here something that is maybe useful for users from your domain. For example, here in this script, we have how to run Gromax. So we have Gromax installed somewhere. We can see here from this operator. So in order to make sure we can use this executable that has been compiled, we should put into the pub the libraries from certain gc version and also from Intel compiler. And then we should do this. So we are going to have in the pub a GMX MPI. And if you are preparing a job script that will actually do the running for you, you have here, this is just the name that you assign to this script, how many notes this script was for two notes. All time, this is not required. But the queue that you are using, like this queue, it has larger, it allows jobs that run for more time. So if you specify 24 hours, this simplifies the work of the bar system to prioritize jobs and also enables us to see that your job is going to finish faster than the maximum for the queue. So if you know in advance that your job is going to finish in 24 hours, then you can specify something like that. You can specify some, it's advisable to specify more. So if you expect it to run exactly 24 hours, specify something more than that, to be sure. So here we have some tricks related to how to obtain from inside the job to see where the job is running. And then to prepare a notes file, actually, a list of the notes that are in the job. Here I did already this. So I have in this file the list of hosts where one of these jobs that I submitted is running. So once you have this list, you can use it inside the script or as now we are going to do outside of the script. So here you can see some line that was considered to be useful to run Gromax inside this job. However, I'm going to modify this line. So I will run this. I will first go on to one of these servers, go to this directory. As you see here in the script, it's important to switch to the directory from where the job was submitted. Otherwise, it will start from the home directory. And this is not very good idea because your data is not there. So you should have some operator to switch to the correct directory. So once we are inside, we are going to create some useful line to start this Gromax. However, I think I don't have now this active. So I will repeat them. So I vote what means to be voted. And now presumably I have access to this executable. So I can run it in any way that I need. So now I already had some lines, right? So first of all, we should specify where to run. This is our list of hosts. How many MPI process? So how many I can run? Let's say four. I have four servers. So if I run four processes, I can use all these servers. I need to specify also how many per server, because sometimes the system will put all the processes into one server. You should be always careful not to do this because I have had situations with people that they just do MPI run and they specify like 100 and something like this, but without having proper access to the list of nodes. And if you do not specify, and if the system itself doesn't recognize where your nodes are and what's happening, it's possible that it will launch 100 processes on the same server. And because the servers are in general powerful, they may be able to even cope with this process. And you're going to see very high load on the server. Maybe sometimes the server will crash because of using too many, too much memory and things like that. So that is why it's always important to properly specify which hosts we have access and how many per host. So to have not too many MPI processes per host because they divide the memory. So if you have one host with 64 gigabytes of RAM, if you launch one MPI per host, then MPI process per host, this MPI process will use up to 64 gigabytes. However, if you launch two processes per host, they divide the 64 gigabytes and they use 32 gigabytes each. Okay, they divide if they use in some symmetric way. So the best way to use the parallelism that is available is to combine MPI and OpenMP. So the best, I would say, the theory and the process seem to agree that this is the best way to combine MPI between servers and OpenMP for the same server. So here, this server has 16 physical cores. So your best idea would be to use these physical cores through OpenMP and to use multiple servers through MPI. So in this case, four MPI and then for OpenMP up to 16 or 32 and so on. However, in some cases, this is not exactly the case. I have tested an observed situation where Gromax was best using two processes per node and maybe if you have two MPI process, then maybe 16 OpenMP or eight OpenMP threads per process and so on. So here, I'm not a specialist in Gromax, but it seems that Gromax controls the number of OpenMP threads with this parameter here, n to n, n-t-o-n-p, which I'm setting here to four. And then there are some other options. This number of steps, I will put 10,000 steps. And these are specific to Gromax things. So this line should be able to start Gromax on these four servers. Let's see if it works. So it says using four MPI process, the first parameter that we have here, np4. And then four OpenMP threads, this was from the second parameter here. This numercontrol is not truly necessary, but it doesn't hurt. One can launch directly here, gmx-mply as executable. And we can also see what is happening on these servers, but I will stop this sharing and share from the other console so that you can see exactly what is happening. So I will stop this sharing and share the other console. I think you're seeing this one. Now I will need to increase the font here also. So I'm holding that note. As I explained, you have the right to log on to notes that are yours, that are running your jobs. So you can see here some vault happening if you run edge top. So you see four threads are running here. The memory is not heavily used, so we have free memory. And we have here everything in green, which means that the process, they are not waiting for something to happen for IO, things like that. So if you see here less than 100%, then you should consider what exactly is happening. Swap is very undesirable. It is expected that the jobs that are run on the supercomputer are not going to be swapped. It's just available for some extreme situations. It is there, but you should try to make sure that the swap is not used almost. Here, some small swap is used for some system reason, but we have a lot of free memory. So in this way, if I stop this one. So you can do anything you like here. You can go to the directory where your job is running. You can see what is happening. You can see the lock, things like that. But you're not required to do that. If you submit the job through the batch system with these lines inside the job, then the job is simply going to run and you don't have to look at it at all. So here, we can again see what is happening with these jobs. So you can see the status. Currently the status of all these jobs is running, but sometimes they are queued. When a job is waiting in the queue, there are probably some reasons that there aren't enough resources for the job. Sometimes the job enters some bad state and sometimes it needs to be cleaned. In my experience, what happens sometimes, there are some applications that make use of some system resources that in some cases are not cleaned up properly. For example, the semaphore situation, there are applications that use semaphores and if the application crashes, the semaphores still remain as used, but they cannot be used by anybody else. And so even your next job will not start properly. So if you observe something like that, you can try to clean the semaphores yourself or you can just submit a request for us to clean these semaphores manually. There is a comment that claims them, but it has to be issued on the proper note. So this is just something that sometimes happens. If some note becomes defective, let's say in this way, it can create problems. If you cannot do anything about it and your job enters a note that is in some strange state, then you can just submit a request to deal with this situation or you can submit a job that just takes the note without doing anything and then try to obtain different notes. Like you see here, I obtained these four notes and in a previous job, I have these four notes. In our work, what we have done before was to have like testing jobs that test whether the notes respond properly. For example, you can use simple tests that does some still MPI communications inside in order to see whether everything works. Here, I have some example program like this Hello MPIC. So I just don't know how good you are in programming and how interested you are in this because you could as well use ready-made applications. As I showed, we have a lot of applications from the domain of computational chemistry installed here. Not all of them are accessible to everybody. There are some that are with license and then they are accessible only to a group of people that have appropriate license for this. But for the open source applications here, it's no problem to use them and you don't need to be a specialist. Nevertheless, even if you are not a specialist in how to compile the application or how to develop software and so on, even if you are only using, still there is a lot that can be done. And you saw here some options that were used in the comment line. And here in this, in the Gromacs launching, you don't see too many of such options. This pin option is an option for the application Gromacs to pin somehow, I suppose, process to course. But there are lots of options that are available to you when launching MPI job. In the manual, you can see how exactly you can launch. But this Intel script, I think it is a script, maybe Python executable, MPI exact hydra, it has many options. And these options can be very useful and they can impact not just the correct execution. Currently, we have correct execution with four MPI process and four open MP threads. But here I have some, I don't know, unfortunately, is divided into several lines for you. But you can see here, apart from this option, which just specifies which node to use. But there is an option to set the stack size for each of the open MP threads, so that in this way, it's not too big. And also to specify the what type of affinity will be used. These are options specific to the Intel compiler, in this case, because it is the compiler that was used. I think for this executable, it was in the Fortran compiler, combined with the Intel MPI library. So this is an option for how to deal with these open MP threads inside. And there are many possibilities. This is just one possibility that apparently it was good for this executable. This option, I don't even remember what was the meaning of it, and so on. So these options specified with this set environment variable for all of the MPI process, this was related to the open MP. After that, you see certain options related to the Intel MPI library. They are specific for it. And here, we specify the best possible way how to use this library and the available fabrics. In this case, we have an InfiniBand, and we have shared memory inside the node. So if I'm launching more than one MPI process per node, which I was doing when testing Gromach, I can use shared memory for communication between this. And this is better than using, for example, TCP, which will be much slower and with less bandwidth, less latency, and so on. And even there is in the Intel library, there are options to specify which algorithm to use for which kind of collective communication. So I don't know how knowledgeable you are in using MPI, but MPI uses either peer-to-peer exchange of messages or collective communications, where some collective communications in some systems are implemented even in hardware, but sometimes the library provides these collective communications as some combination of peer-to-peer communications. And for this, some type of algorithm is used. And Intel, in this case, provides a setting to change which exact algorithm to be used for which collective communication. In this case, we change the algorithm for the gather vectors collective communication to be algorithm number two, because probably I tried all of them and saw that this one was best in this particular application even. Here again, we have OMP moon threads to be set to one in this case, but this is very specific for the type of application that you are seeing here. It was using mainly MPI, multiple processes per one server, but this is just for this application. Usually, OMP moon threads should be 16 or 32, rarely less than 16. And there are many other options for the Intel MPI library, which there are some very useful documents where it is described what you can try and you can freely try this. It's totally useful. I will share here one document which is available at the CSC, which is a partner in price, I think also. And there are many other such documents which describe how exactly you can tune an MPI application. So here actually, it's Intel document, which is available in this site of partner. Here you can see what kind of options you can use. And I showed you that I have tuned the gather vector, but how did I know that it is important even to tune this one? So through the using of this benchmark profiling with this option, I MPI stats it was enabled. I saw that this type of collective communications were used a lot. And so I tried to change the algorithm to see what happens. And here you see many other options that can be used. It's sometimes if the program is crashing, you can enable debugging with this Intel MPI debug with higher level so that you can see why it's not progressing. There were many situations where we had to specify something in order to avoid failure of some applications when using lots of processes, lots of threats. For example, there was some scalable progress option. This double was perhaps the best option even in our system. And you can see here this one was important whether it's enabled or disabled. Enable I think was the good choice. But not always these are required and not always they are useful, but they are possible. Here if you are interested, you can check. Probably this is not going to crush the system setting these options. So you can experiment, you can see what happens, what works, what doesn't work. For example, this spin count, it means how many times when waiting for message, the process will adjust the loop and check again and check again that's making use of CPU. Sometimes this is not good. You can make it to just check only once and then switch to the different way to wait for message, which will not use the CPU, but then the latency is increased. But the use of CPU is decreased. This may be useful if you use hyper trading. If you are not using hyper training, probably it's not worth playing with. So there are many options that are not that important. They don't always improve things, but you can hope to get maybe 20% improvement through playing with these options or you can just ask what is suggested for you or what is known to work for some particular applications. I will stop sharing this one and switch back to the other screen.