 So our next speaker is Damien Franchon and he's going to tell us about the HPC cluster they have at UC Louvain which was my university by the way. Thank you. Hi, I'm Damien as we just said. I'm HPC Systems Engineer with the Université catholique de Louvain. I work in the Center for High Performance Computing and Mass Storage. So I'm located in Louvain and I'm here. So you are here and I'm here. Well actually at the moment I'm here but you know what I mean. Just to show you a bit about one of all clusters. Small clusters so if you're wondering if it's in the top 500 it is not. Don't bother looking. It's a cluster that has grown organically. From the beginning it started with as low as a few as 20 or 30 nodes and it kept gradually growing as more scientists came to us with money. We bought some hardware and we added to this cluster. So it's a bit different from the other clusters we have which we buy as a whole. So we buy a whole cluster and it comes with a full stack for provisioning the cluster, running the cluster and so on. So here we had to build everything from scratch and we actually started manually with every information was in our heads and so we added some nodes and we were thinking like okay what do I need to do to add this node to the cluster and everyone would bring some information from them from their heads and then we started thinking it's better to have that written somewhere. So we started writing documentation and then we gradually improved towards automation. We made the documentation actionable through scripts and then we read books about DevOps and books about infrastructure as code and we decided to use configuration management systems to manage the whole thing. So we had a look at what tools were available and we settled on three tools. One of them is Cobbler, the other one is Ansible and the third one is Tolstack. So I know that often Ansible and Tolstack are seen as distinct alternatives. You either choose Ansible or sort and if you discuss with people you just start wars by saying that this one is better than this other one. The thing here we believe they work nicely together so it's something I would like to show you during this talk is that Ansible and sort they were together pretty nicely. So to show you that I will just go through the journey of a new node that must be added to our cluster. Once we have unboxed it we need to put it in the racks to label the machines to label the cables. We need to choose the name and IP. We need to gather the MAC addresses and then we enter all the information into Cobbler and Cobbler will take care of installing the operating system and then there is another stage where we integrate our node into the infrastructure and that is where we use Ansible and then the third stage is the configuration where every non-user software that has to be installed is installed and where the configuration files are propagated and then after that it's ready for job. So I will spend a bit more time on each of the steps. I don't know how many of you know about Cobbler so it's a tool that the advertising is that it's a tool for deploying machines. So basically it's a wrapper for the PICC, TFTP, DHCP servers and it allows you to manage operating system images and machine profiles. So we use it to install the operating system, most of the case and to us in our case. We use it also to set up hardware specific configuration. So for the dispartitioning or for the IPMI and also it's used to set up the minimal configuration that we need for the other tools to work. So Ansible needs the SSH keys and we also use a lot of SSH ourselves during maintenance and so on. So we deploy SSH keys and we install a salt minion which is the agent for the source tech software that runs on the machines. When the node is deployed, so the operating system is up, we use Ansible to integrate it in our infrastructure. So I see Ansible as a shared script on storage with built-in features such as safety, idempotence and a lot of APIs that are available. We use it for one of operations. For instance, we are using Zabix to monitor the whole infrastructure. So when we add a new node in the man-back cluster, the Ansible playbook takes care of registering the node to Zabix so that we have all the alerts and so on that are activated for that node. We also use an inventory system named GLP. It's a gestion libre de parc informatique. It's a software that we use to keep a list of all the compute nodes and all the other machines we have bought and it holds all the issues and tickets and so on. So when we have a new node, we need it to appear in the inventory and that is done through the Ansible playbook. We also use the Ansible playbook to register the node to sort. So I said that the cobbler system would install the salt agent and then the salt agent will talk to the salt master and say, okay, please allow me to register to you and this allow step is taken care of by the Ansible playbook. We also use the Ansible playbook in a way which might be a bit unusual. We use it to build the configuration files. So it's a cluster, a compute cluster, so we have a job manager which in our case is slum and then the slum needs to know what features are available from which compute nodes. So in the slum.conf file, we need to have a line for each of the compute nodes saying that it has that many cores, that it has that amount of memory and so on. And so we use the Ansible facts and other feature gathered by the Ansible playbook to build that slum.conf file. The slum.conf file is not propagated to all the compute nodes by Ansible, but Ansible builds the file and we do the same for instance with the ETC OS for DNS mass server or other, this file for SSH for known for host based SSH. Once those files are created, we then use Solstack as a central configuration management server. So basically Ansible will create the configuration file and the sort will propagate all the information, the configuration file and it will also install the software that we need, so not the software for the users, for that we use. Easy build of course, but for the other software they are installed with Solstack and it's used to mount the proper file system and so on. So we have three steps, deploying the operating system, integrating the nodes into our infrastructure and then configuring the nodes and installing the software. Once that is done, we still need to check if we have a new CPU architecture, in which case we need to recompile every other software so that the users find the same set of software on all CPU architectures and also another step we need to take is that if the machine was bought by some specific group and that group needs access to that machine with a specific quality of service, we need to integrate that into the slum system and that is done by a tool which we have developed and I will be talking a bit about more later which is called Sluffle, but then it's ready for Japs. But more generally we believe that those three tools can be used in context more general than simply the HPC cluster and we also use it by simply replacing cobbler in this step with OpenSack or with Vagrant, it all works, the chain works because typically when we develop stuff or when we test new stuff on our laptops, we use Vagrant and VirtualBox and then we use Ansible to install temporary cluster in VirtualBox and so on. But we don't care about the configuration management system, I don't want my VirtualBox machines to appear in the central Zabik system or in the monitoring system. That is by contrast with the staging mini cluster, so we have a small cluster that we use to test stuff on. We do not deploy the operating system at every time we try something but we like to keep the configuration in the configuration management system. And so the next thing is that whatever development stage or production step we are in, the set of playbooks that we use here is the same. And the salt, the salt servers of the server for the configuration management system here is the same whether we are in production or in the stage. So it allows us to really test the stuff we do either in the VirtualBox machines or on the small cluster and if we know that it works here and here, the same playbooks works here, we know that they will work here and it's our way to test stuff. So there are some features that overlap between cobbler, sort, Ansible for instance, installing a package can be done in either steps but we have a simple recipe actually. If the software is specific to the development for instance the VirtualBox guest additions is very specific to the development stage, we install it in the vagrant provision step. For everything that is related to hardware such as the drivers, we install it in the cobbler, in the kickstart of the cobbler system. When some piece of software is used for both in the production area and in a stage area, for instance the Zebix agent so the piece of software that monitors a half of the system and so on, then we use so to install it and if some piece of software is needed in all three stages, then we use Ansible. For instance, the job scheduler, I need to install it of course in the production machine but I also need often to install it on the test cluster because I want to test some feature or to test a new version or I want also to be able to test it in virtual clusters in my laptop to do stuff where I can break everything and nobody notice. There are some gadgets though of using sort and Ansible at the same time in your head, it can be confusing because for instance when you want to upload a file in Ansible and sort, Ansible expects a keywords that is R or C so the short for source but sort requires the full word source and it can be confusing especially considering the fact that for installing a package it's the opposite so Ansible expects the full word package while sort prefers the shorter one PKG. Nevertheless, we love both of them not always for the same reasons. What we love about both is that they are both Python, they are both based on YAML and they are both using Jinja for templating. We love the fact that they are shipped with a lot a lot of modules which are very interesting. We like the fact the fact sorry that sort is our single source of truth so we have a sort server somewhere in our architecture which is syndicated as people call it on our main clusters and so in one machine we have the configuration of every single computer in our in our infrastructure. We also have the history of the jobs which were so the the configuration changes which were run through the source server so we can trace what happened and so on. We like the fact that sort is highly scalable so with one server in a small virtual machine in the open stack we can handle the three or four hundred machines that we have in the in the data center. The fact that it's it offers a second entry point is also interesting because when you accidentally kill all your SSH servers on the compute nodes you are very happy to have sort on the hand to be able to restore the servers. By contrast we like the fact that Ansible is very simple to grasp and very simple to build an Ansible playbook is very easy to read even for people who do not know Ansible beforehand so it's easy to share. If I build some Ansible playbook to do something I can easily share it with people from other universities and are able to use it without having to replicate the whole infrastructure just the playbook is rather self-contained. The fact that we've also often used Ansible to to fix stuff. When you start playing with salt sometimes because salt is so powerful you do a little mistake and you break everything on your cluster I found it very easy to fix things with Ansible it's much easier than than with salt to repair what was done and then create the configuration in salt. We love Ansible so much that when we had to write some piece of software to connect our LDAP system to SSH or Sloan for instance when the user registers to our system they use SSH public keys which are stored in our LDAP system and we need a way to provide them to to let SSH know that the public keys are in the LDAP system so there are connectors that exist right now but when we started the project nothing was major yet so we developed a bit of software also to register users from the LDAP to Sloan and to create several file systems so when a user appears we need to create its own directory in the home file system we need to have a directory for the user on the scratch file system and so on so we developed a tool named sluffer that basically monitors the LDAP system and then triggers playbooks when things change in the LDAP so it just making the link between the information that is in the LDAP when it changes then playbooks that are written by the system administrators I ran and the information from the playbook is pushed sorry from LDAP is made available for the playbook. We also played a bit with a salt and for instance we developed custom grains which we can share with you which allow us to actually in the salt top file have specific rules depending on whether a machine is in Sloan partition or in other so in in Sloan terminology a partition is the equivalent of a queue in SGE or PBS so here you see that two rules that say that okay if the node belongs to the ZOE partition that we need this state to be active as well so my main message here was that Ansible and salt they work very well together because first I see them as complementary and they use the same building blocks it's all Python, JINJA, YAML and so on and if you add cobble to the to the team you have a nice trilogy of tool that allow managing a small T2 software and they all integrate very nicely into our full stack so we are using all of these software in our system one which I like a lot which is a such settle I don't know if you know it but it's SSHBs SSHBs VPN like software very easy to set up and very easy when you are a seasoned mean on the go when you are in the hotel room or at home just run it and you feel like home so if you want to know more about this you can visit our website where we have a small page behind the scenes where basically we list every software that we use every open source software that you use and we explain how we use it but thank you for your attention so we never please repeat the question please so the question is do we kill people basically so we do not so someone who entered the adept system will always stay in the adept system but login and accounts expire and actually no system we require other user to renew their account every year so if a user does not renew their account after one year the account is marked as expired and so there's a shadow expire field in the adept so the user is not able to connect anymore but we keep the history it's something we are thinking about especially in the context of the GDPR so we need to think about maybe we will anonymize that information or delete it in some way but we often have researchers that stay for four years then they go away for four years and they come back four years later and expect to retrieve the same data as before and so we keep the data and we keep the login for a long time so the question is when you have a very large cluster which we do not so copying a file to a large number of nodes can take a lot of bandwidth and sometimes you can encounter issues with the network so salt was made specifically to be highly scalable so it has it basically it's written around a messaging queue so some people say that salt is actually an event-based system which can be also used for configuration management just like I've heard people say that MX was a list interpreter that can also be used as a text editor the thing the nice thing about salt in desk respect is that you can ask for a job where it's in country so you can use sort run these jobs and then for every node you know whether the configuration was already deployed or not so you can see the state of every node at every moment but I only yeah we have 200 nodes maybe so we don't have the issue actually any more questions so how does cobbler compare with a rock cluster cobbler is much has a much more cope scope so cobbler it just wrappers around that's dfcp server tftp server and pixie server so it's used outside the HPC context basically so it's not a lot of features and we basically use the basic features while I believe works is a much more see ecosystem which allows also to deploy the full operating system full cluster sorry but here it will just let you deploy the operating system and then it will say okay the other is not much up further questions no okay let's thank our speaker