 Hi everybody, my name is Frédéric Wotled and I'm a system engineer and I'm in charge of the management of the HPC infrastructure at the University of Namur and I'm very pleased to show you a quick overview of what we are doing with easy builds at CECI. First of all, what is CECI? CECI stands for the Consortium de Vecimaux de Calculation Attentif. It's a consortium of high-performance computing centers of the five universities in the French part of Belgium, so in Mons, in the University of Namur, Uliège and ULB. And the CECI is founded by the scientific research and the well-known region. And all the facilities of the Consortium de Vecimaux de Calculation Attentif are accessible by all the researchers of the member universities. So, there is about 450 active users on CECI and you can see here a big quick overview of the field or application that runs on the CECI clusters. There is five clusters, each one in each institution and the CECI cluster is in fact designed to accommodate a large diversity of workload and needs for the researcher. And for example, the two Hercules and Dragon 1 has been designed to cope with sequential jobs and the three other ones are designed to run parallel jobs. And currently the CECI infrastructure is being upgraded. The metro tree has been upgraded last year and this year we will upgrade Hercules and Dragon. You will see that for Namur, we will install AMD EPIC processor. And in Mons, they will install Skylake as well as several nodes with NVIDIA Tesla V100. The CECI also provides a common storage that is visible for all the frontend and the compute nodes on all the CECI clusters. In fact, each cluster has its own gateway to access this distributed storage and the technology behind this storage is based on GPFS, renamed Spectrum Storage by IBM. The main purpose of this storage is to allow the user to transfer data between the cluster but also to store all the scientific application, part of the scientific application we deploy at the CECI. In terms of the build of the scientific application, there are several objectives. The first one is to allow the profitability of the science by allowing the user to reuse older versions of the software. Also we aim to provide optimized versions of all the software we install. We try to version all the HDPs as well as the infrastructure code we use. We want to also to install all the scientific application in an automatic way. Also we would like to have a system that is flexible enough to cope with new hardware, new node architecture and also to cope with the new needs of the user. There are different challenges. The first one is that the CECI is the multi-site consortium with five HPC teams with an heterogeneous actuar with several architecture and each team has its own workflow, its own hardware, etc. Also we plan to set up an instant federation that allows the user to submit a job from one front end to any compute nodes on the CECI. And also we have a bunch of commercial applications such as MATLAB, GOSHA and VASP and we must take account of the license, limitation and restriction. For the design choice of the setup, the first one is that all the software installed at CECI is optimized for the active texture of the nodes. We use mainly, we use compilation firms to compile other applications or the compilation is done directly on the compute nodes using SLAM dash dash jobs. So we generate packages using FPM. These packages can also be installed on the distributed storage. And the use of packages allows us to uninstall cleanly all applications without breaking something and all the packages generated are archived. So we can do great application if needed. Also, as I already said, a great part of the application are installed on the distributed storage. And of course we use EasyBuilt to install almost all the applications at CECI. In terms of the directory layout, you can see here an example of the directory layout currently on the distributed storage. The first thing to describe is that because we have a commercial application with restrictive licenses, we need to install application locally on each cluster to avoid problems with the legal teams or something like that. And the rules here are freely available applications, mainly open source software uninstalled on the distributed storage with the exception of the Intel Java and Poplan compiler. And all the restricted software, all the commercial software or software that needs registration are installed locally and are only available locally on each site. In terms of dependency, application installed locally can depend on applications installed on the distributed storage, but not the other way. Let's take a look at the first level here. We can go closer. There is a naming convention here. The first part identifies the distribution with the version and in fact is based on the uncivil fact as the uncivil is the tool we use to deploy automatically the nodes but also the compilation farms as well. So for example, you can have a Red Hat 7, maybe a 9 or something like that. The middle part is in fact the identification of the processor in terms of processor family, model and stepping. And the last one is the interconnect. So for example, currently on the common storage, the following architecture are installed, and the support will only pass in a phenomenon, etc., none is for non, it's for gigabit connected cluster. There is a special case, it's the no-arch, it's for architecture independent software. Typically at CC, it's mainly to store databases such like NCBI blast or sweet spot. Again, the second level, we use the installation framework. You see that currently we use only two installation frameworks, the easy build and manual installation. Manual installation is used for applications that we don't have yet recipes to install it or for example, to install the blast databases, etc. And we support easy build but the organization, the territory layout is set up to accommodate another installation framework if needed. The level three is based on easy build release. So in each sub-pre here, you have all the modules of software, compilated with the same version of the toolchain. And so for example here, you have all the applications compiled with 2017 B, first and Intel toolchain. There is a default release set CC wide. So all the clusters have the same default release. This default release is updated once a year. And each release here are independent of the other. The user always has access to the other release and can test future release. And there is a special case for what we call TIS modules, it's for toolchain independent software. It's mainly designed for applications that need a specific version of a compiler. For example, Gaussian is only compiled with a specific version of the Portland compiler. Also the case for software bundles has binaries, such like Matlab, Mathematica, Crystal, etc. Then we upgrade this module every three or four years. As the easy build release, the older release are always available and the future release can always be used as better. So it is the view from the storage on the nodes. Every month, no arch and depending on the architecture of the node, we month the correct subtree. This means that on every node, the path to a certain application is the same, but it's point to an optimized version of the cluster, of the other applications. So it's allowing us to, if it is possible, to run parallel application on different nodes with different architecture with the same path. You know that several users love to out code paths in their submit script. So in this case, it's also always the same path on every nodes for the application. As far the build procedure is concerned, there is no common workflow between all the CC sites for the moment. For example, on one site, the build is directly done on the HPC nodes using easy build dash dash jobs, or it is the case on Unamu. We set up a compilation form with virtual machine. Each virtual machine emulates all the architecture present on the HPC cluster. At Unamu, we have four different architecture, two in production, one for the test cluster and one for the cluster for the students. And so all the completion is done in parallel and on each machine that emulates the correct architecture. Then we generate R-Pen packages that R-Pen packages are archived, then installed on the common storage or the local storage depending on the license. Then we keep all the R-Pen. If we need to upgrade one software, we increase the version of the R-Pen. So we keep all the version of the build in case we have to downgrade the software. So at Unamu, we have one build node, a virtual machine per architecture and one install node per architecture. For the user point of view, we plan to set up MOT7 for all the CC clusters. We currently, we only support a minimal number of toolchain, namely FOSS Intel, as well as the Portland compiler, we need to compile a solution. And we use the default easy build mononaming scheme. There is no plan to use another mononaming scheme for the moment. And all the users have access to MetaModule, I will show you later how to use it. So the principle is to avoid the module available to list a big list, a big pile of modules. So when we run a module available on the cluster, you only see MetaModules, the TIS modules and the packages built with the default release. If the user want to have access to an older release or a future release, you have to run the one MetaModules corresponding to the release you want. And then the default release is unloaded and the module available only show the older release in this case. So the user can change between release and ModuSpider always shows all the modules available in every release and every version and release of this. So the user know which MetaModule need to load to access specific version of the application. It's quite important to allow the user to use the older version of the application and normally the older version are free. So we don't change anything so the user can redo his simulation. In terms of the future works, we plan to set up a generic build based on a generic architecture. So it's just a way to provide non-optimized build. For example, if we set up new nodes with new architecture, so when the optimized build are not yet available, and it is also useful to provide to the user packages to be installed on the local workstation at Udamur, the user want to have the same software installed on the workstation than on the HPC cluster. So for future release, we plan to enable the Airpath linking and the work is almost done. We plan to automatically update documentation after every build. So the user, we have on the web page all the tables that tell him on which cluster you can find which version of an application. So as a conclusion, so at CC, we provide optimized build for all the architecture on the system. We allow the user to load the older version. We give access to future release. The user, for this case, the user just have to load a meta module and we support special case for application that only depends of specific to chain. Thank you. Any questions? Thanks for your talk. Could you put back your slide on with the module available screenshot? I'm just wondering, because you use the flat naming scheme, but it seems you work on software hierarchy. No, no, no, it's only based on release. So each release is independent and when you use, for example here, module load release 2006B, you switch the release, it's not hierarchy, just based on the release. It actually is a hierarchy, you just don't know. So you only have one level in the hierarchy, which is the tool chain. And when you're loading a different release, what Elmot sees is a change in the module path. It knows it has to reload all the modules or it will change the modules later. It actually is a hierarchy, but it's probably not managed through EZ-DOTE. No, no, no, not at all, yeah, same as they have a new liquid build stages, yeah, exactly. More questions? Very nice talk and very impressive and I have two questions, actually, brief ones. One is why do you keep your license restricted software on different file systems? Why don't you protect them just with ACLs, for instance? Yes. There is some ACL that, yes, we use Unix permission to block the user for using licenses restricted software. And these ACLs are per site, there is no global ACL, so each HPC team has its own ACL to allow user to use commercial application on each site. For example, at TUNAMU we have MATLAB installation and MATLAB is only available for the local user. I was wondering because on our side we had the idea that we have a module which is available for every user. Every user can load the module but can't execute the respective software because then the module just shows an information up on loading that the software is restricted and in the case of, let's say, an MD, a user has to print out their license, send it to us and then we enable it and it's just less work. That's why I'm asking. Also, the second question, you compile on VMs and I'm wondering that's kind of dangerous because then you have to maintain two software trunks locally on your actual hardware and it always has to be in sync with the respective VMs and whether that is actually necessary because in this case we compile with the shared jobs on our machine and if it fails we know if it's pending due to the missing system library then perhaps the node wasn't installed correctly or we made a mistake and then we see it immediately. So the VMs used in the build farm are the exact mirror in terms of installation that the prediction nodes. Because we use uncivil and packages, if we can rebuild nodes, I see a buyer, just reinstalling all the RPM, if we break something we can go back and reinstall from a specific check point here. Okay, thanks. Thank you again. Okay, so hello everybody. My name is Luca Marcela and I'm talking to present the CSCS site updates. CSCS is the Swiss National Supercomputing Center located in Lugano in Switzerland. And the talk is based on the information provided by my colleagues Teofilos and Guillem as well for the scientific computing support and computer and data services support groups at CSCS. So we will talk about the usage of this build at CSCS, the timeline of this usage and the use cases that we have on two production systems that are called Cache and Escher. And they are used by the National Meteor Service in Switzerland, Meteor Swiss. And then the main production system, the usage of this build on the main production system which is called pitstain and it's a Cray XC system. Then we will talk about the usage of Jenkins in order to deploy the CSCS software stack using poor request integrations in GitHub and the use of pipelines with Jenkins. And then the final remarks with some future plans in order to adapt EasyBuild in our use cases. So this is the timeline of the usage of EasyBuild at CSCS. It started in 2015. Then with the help of Kenneth and Peta it went to providing Cray software stacks as well, Cray tool chains as well on the system in collaboration with Guillerme. And over the years it increased and now it is fully deployed after the end of 2016 when we had a major upgrade we deployed all the software stack using EasyBuild. So with few exceptions in some systems like the Meteor Swiss system for software that is somehow quite specific and still we are working on deploying that within EasyBuild. There is also the need to convince some users about the usage of EasyBuild once you actually deploy some recipes. And in our repository the number of poor requests for new recipes is now over a thousand. So it is widely used also by our users. Now the HPC systems that we have at CSCS are listed here. So on the top you have PitStank which is the main production system which consists on two types of architectures. One is a Haswell processor known with the Pascal GPU. The second type which is called Cray XC50. The second architecture called XC40, Cray XC40 consists of two sockets with broadband processors. Other systems are listed here. I mentioned already the Meteor Swiss systems, Escher and Kesher which are a Cray CS storm system with multiple GPU on a single node. And then we have here listed two let's call them general purpose system Leone which is a large memory system dedicated to a company. And Fulen which is a virtual system built with OpenStack that is based on an Intel breadwell, Broadwell and Skylake architectures for Skylake for FET nodes that come also with GPU nodes as well. And for the moment I think that Fulen is our test bed I'd say for using generic tool chains because in the Cray system we were limited to actually using the Cray tool chains most of the times. We played a bit on the Meteor Swiss system with the GMVolf tool chain but now it's Fulen is the one using the standard Intel and GMVolf tool chain. We use GMVolf because for our MPI test Mvapage perform better with respect to OpenMPI. So that's why we were speaking to that. This is the architecture description of the Meteor Swiss system as I said it's a Cray CS storm system which has two sockets with 12 cores processor and 8 NVIDIA Tesla GPUs. Now the easy build software stack was deployed initially by Guillermo in production in 2015 and initially we were forced to build also basic things like the GCC compiles because the initially provided compiler by Cray was not employed correctly the vectorization as it was supposed to. Pit stain the main production system is based on Intel Haswell processor and Pascal GPU on the XC50 nodes and the XC40 compute nodes as I mentioned before has problem processors. At the moment we are improving a bit so the number might change a little bit in terms of total scratch capacity and for the rest it's more or less the same. So pit stain which as I mentioned was it underwent a major upgrade at the end of 2016 after which we fully deployed a software stack with VZ build. It's now number 5 in the top 500 so it's filled the fastest in Europe and number 18 in the green 500 and this is according to the November 2018 list after the supercomputing and it mentions the Cray XC50 partition it refers to the Cray XC50 partition only. Now according to the in terms of software that we release on the main production system we have on the left software that is based on recipes already available on the EasyBuild repository that you can find on the EasyBuilders GitHub of Uganda. The Python recipe here it's actually based on Cray Python so it's not exactly the stock EasyBuild recipe we do that because Cray already provides two Python modules and Cray separates Python 2 and Python 3 so we also follow this separation so we will still continue separating Python 2 and Python 3 and this is just to avoid simply duplicating things that are already present like NumPy, SciPy are already present in the basic installation provided so we will stick to that. On other systems we use the standard EasyBuilder way of building the Orbit in Python. And Wolf, CP2K and Gromax Boost, GSL well CP2K actually and Gromax I think for Gromax we use a custom EasyBlock that was provided by Victor I don't know if it has been pushed back okay yeah for CP2K yeah I remember I had some discussion with Kenneth about CP2K as well I think we still use a custom EasyConfig just because it is a bit more flexible in terms sometimes we need some modifications requested by users because users are not satisfied with the single installation they always want to modify that a bit and in order to let them modify that because we have custom EasyConfig modified by users it's easier instead than modifying each time the EasyBlock just to provide them some modified EasyConfig because each of them CP2K the software that it's currently always updated from the SVM and so they still wanted to rebuild and sometimes they really need some custom features so it's really difficult to stick to a standard version and in terms of usage here you see on the on the right side the Excel statistics on the system actually CP2K and Gromax are widely used some of the most used software at CSES user code means that we actually cannot identify that so it's something that maybe it's compiled directly by the user it's some known code maybe but doesn't stick into our name frame of executable or module frame so we cannot identify that immediately and it needs some further inspection we also have software like Quantum Express or VASP that are already built using EasyBuild as well ANS is a material science software I don't think we have a recipe for that usually these users tend to build their own version anyway also users of Quantum Express or VASP CP2K as I mentioned they want to rebuild anyway so we provide a module we have the we use a standard TCL module framework but they generally tend to use their own their own build so what we provide for them is anyway a system in order to modify the recipes that three actually build for them so user when they ask something new about modification of an existing software that we provide or some new software we provide them with a modified easy config file so we give them access to our GitHub repository we point them to our local easyBuild installation and to the documentation on our user portal where we actually mention how to compile the code in general on our systems and in particular how to build within the EasyBuild framework so the local EasyBuild framework that uses a module in order to load the environment of EasyBuild and then users can use their own recipe or our recipes in order to rebuild the software that we support and this has been working so far more than a more than a year so I think it's pretty stable in terms of our software deployment that is based on Jenkins we are I'm now going to describe a bit how this is made on our GitHub repository because we also use that in order to deploy the standard software stack plus testing new recipes so Jenkins is the basic tool that we use as a continuous integration tool for testing new recipes and to deploy the software stack of the packages that we maintain and support with our easy config and easy block files in the EasyBuild framework so we have then three different main projects the testing project to test new to test the poor request the software coming the recipes coming with a poor request to our repository the production project we which continuously build the software according to the software list that we support and then a regression project on Jenkins that is meant to totally test from scratch every week the build of our software stack in order to to check for instance download links and compatibility with updates and modifications on the system recently we have moved on pipelines which means a high flexibility in order to perform in parallel the activities within the Jenkins tool and each project has a separate Jenkins file which contains the configuration of the project and its version control so every modification is tracked and then you can easily write the continuous integration tool in parallel in order to ex ex better exploit the resources that we have in order to submit a poor request to our production repository these are the rules that we advertise on the GitHub page so it's something that's a common sense plus some slight customization due to our architectures GPU non-GPU and the names of the systems I think not so many users are submitting poor request but I'm pretty sure that many use at least I know that many users are actually using their own custom files so I think it's more or less the same as it happens for easy building general so user tend to modify the easy config file but sometimes due to time restraints maybe they do not contribute back and so this is an example showing how it how it works let's say for instance here then GPU means the architecture with the actual GPU accelerator you are submitting a poor request for cafe to which is a data science software so a new easy config file that then after the poor request is submitted gets to the CI and goes to the dine GPU machine so the architecture lab that's dine GPU in the and you see on the right side how the the blue ocean interface of Jenkins is dealing with that in pipeline so in case you had multiple entries here like the GPU or DOM and see like multicore for another system name then you would see multiple multiple dots on the right side like in this case so yeah that only means all the possible architecture on this system and then you see that two different architectures of the two dots the two green dots that you see over there and for the regression test that I mentioned so that weekly tests all the recipes that we have in our software stack in order to ensure backward compatibility with modification that happened over the time in the systems you can see that all the systems are built and here it was a very lucky case all green anyway it's it's quite let's say it's quite useful to have to have this interface quite flexible the blue ocean interface and the pipeline on top of that now to conclude with the final remarks we have been testing lately to build easy build recipes within Docker containers this is the reason for that is to have reproducible builds of legacy software stacks so outdated software maybe that cannot build no more on recent updates of parts of operating systems and then it's also due to resistance to change from from users so they are used to a certain software that maybe builds only on I've no redact six and that's the reason to use a container successful attempt so far so sent OS 7.x redact with the cave that you need to subscribe to use the you package manager because otherwise you need to download yourself all the packages and it's a bit annoying because of course the containers in general do not come with what you use generally if you install yourself so I try with Ubuntu and it's much easier although it's less useful because we don't have that introduction but just for testing and the create programming environment as well so the MS group they tested that as well producing but that can only be used on a trade system with the license for software so require steps as I mentioned you need to manage yourself of course all the dependencies in order to fully start with easy build like Python or mod compilers like new Intel and then you need to create users and to deploy the docker image to the users in order to let them access the image that you have produced that is still work in progress but it's probably one of the major things that we would have to do in the near future then moving to easy build so officially at an HPC site takes some time as you also have mentioned before can it so learning curve and resistance to change in this case it's not resistance of users but resistance of the management which is even stronger so and then easy conflicts versus easy blocks so the advantage of easy blocks they are reusable so it's great for stable software but there's a hover head if you have let's say bleeding edge software something new and needing some let's say work around in order to be installed the reproducibility then if you how to keep track of changes if probably if you need to hack the installation then it's a bit difficult because you suppose that you need to upload it say a pull request every time at a certain point there is a trade-off between having a local easy build as you said if you have more than five provides too many but sometimes it's really difficult in order to modify continuously and be accepted in a pull request on the other hand easy conflicts of course let's say they provide the advantage of being self-contained so these are three pluses but on the other end of course it duplicates a lot of things because of course you need to copy and paste all the time within but you can modify that easily then wishlist I think there is already an ongoing discussion as far as I know between Alam and Victor about a new command line option for allowing a try dependency version including so dependency as well to be updated but with version still work in progress and then in Guillermo mentioned to have a extended run with some more information in terms of logs command line so each command line maybe separate separate log for each command line then backup of custom easy blocks to ensure that builds are reproducible that what do you mean with the second thing exactly what have a separate log file for each command that runs in the extended run so it is actually when you run a build actually it would be nice to have a separate log only with the commands that you run so that you can easily copy and paste and because right now in the logs you have output you have what easy build does and the output makes so you need to go and pick the things that you need and on the extended run it's not exactly what the easy build runs like you cannot copy and paste 100% sure so if you had a log with only the actual commands that easy build issue during the build someone could just take that copy and paste and run on the terminal for debugging or reusing the recipe this has this has come up before so people want to have I think that's what you mean have easy build dump a shell script that basically if we just run the shell script does what easy build does that it's possible but it's not very trivial so there's more to just running commands easy build is changing directories applying patch files most of that stuff is not done with shell commands but it's quite in stuff so you would have to have yeah actually there is already a wrapper for changing directories there's dedicated functions you can make that spit out the corresponding CD command but to get it 100% right it's probably very difficult but make it work and 90% of the cases is probably fairly doable and what dry run or what extended dry run already provides now is you can copy paste all the commands it does to set up the environment that's done in such a syntax you can just select it copy paste drop it in a shell and it will drop you in the environment the same environment that easy build does but for when it works yeah that's a special case yeah it may be more complex there because there's module swap stuff that happens there a bit of magic maybe open an issue on this with some more details and we know exactly what you want to comment on on the next point there's a backup of custom easy blocks that's implemented now and so basically there's an in an installation directory in this reprot directory there's the easy configs but also the easy blocks and patch files and there's more in there basically I remember everything but it should catch everything I think yeah it catches the hooks as well so it should catch everything to be able to reproduce but then there's something we missed okay yeah and then about external modules something that I actually also ran into sometimes so improve some error reporting for missing modules and then having versionless entries in the metadata file when you use external modules that it might be let's say error prone sometimes because sometimes but it's much easier because then you don't have to duplicate entries all the time in the metadata file but of course you need to take care because as for the skip sanity check you need you need to know how to use that of course at your own risk so latest news we deployed an HPC OpenStack cluster full and the one that I mentioned before with easy build mainly using GM Wolf also tried a bit with Intel but I had to invent a new custom toolchain like Intel with Mbappet so IMV MKL something like that which it's difficult to pronounce as well and so far it has not being is not being giving a good performance but anyway this is a problem of Intel not of easy build the deployment was fine and in terms of useful links so the on top the easy build use a documentation that we provided CSCS plus the CSCS easy build repositories on github with custom easy configs and easy box and at the bottom the wrappers that are mirrored anyway under the easy builders maintained by you again so easy builders less CSCS and then you have the common three structure and thank you for your attention let's see I have a small question not exactly ready to receive it but to exalt just curiosity when do you start using exalt well we use that already well before it was all D then it became exalt so for a long time after the latest upgrade of pitsteins in 2016 I think we migrated to exalt and Victor and my colleague young young were always fine tuning that in order to have a better report because as you know you need to match the names of the of the actual modules well when they don't use when they use our provided modules is quite easy I think and the libraries when they use custom software Victor and gg managed to have a mapping of the names of the executables in order to report also that that usage in jobs okay is the setup public or no that is not probably okay any plan of making it public well they just said that they don't know yet if that's gonna be made public that's a that would be a question for the year maybe we ever said that it's about his pay grade and related to that as well do you have an idea of what is the size of the data you have gathered with exalt 60 gigabytes of data every every few months 60 gigabytes and I assume that's already heavily tuned and filtered one question about regression testing that you did what do you actually test exactly so it's for testing from scratch the build of our software stack so in general when you build easy build downloads the package and then start building so for instance sometimes links gets outdated then you want to test that from scratch and then also we have continuously updates on the either operating system or on the programming environment that does change a little bit the environment so some build might fail so if the module is already present in the our production build does not reveal that because by default easy build does not overwrite the build so the regression will overwrite everything so in a separate folder of course and then to ensure that if you want to rebuild now the software then you are able to that with the current recipe that we have in our repository so that's