 go ahead. Good afternoon guys so first of all thank you so much for allowing us to come and talk to you although from far away about our work here at Acre with EasyBuild. I'm David Evans of most of you probably already know me I'm an application developer here at Acre at Vanderbilt University in Nashville, Tennessee. I'm and I'm also one of the EasyBuild maintainers so that's only how you you know you have seen me. Here with me today there is also Eric Cappold which is one of our senior system administrators that is one of the two guys that also worked on setting up all the CVMFS file system and moving the stack to it so that I he you know he accepted my invitation to come and join this com call to talk with me about the work that we did with CVMFS. So Eric go ahead. Okay thank you let me move this. Okay yeah and just to add a little bit about me with this slide I was a large collaboration physicist when I was at Vanderbilt before joining Acre with the CMS collaboration at the Large Hadron Collider and then I changed career career tracks to go into software engineering and systems administration and so the point of this is saying that is that previous to this I was used to having CVMFS around and so that's that's where this comes from. The team of the three three of us that worked on this was first of all Andrew Mello he's currently a physicist at Vanderbilt and he's really in charge of CVMS deployment and synchronization scripting which will describe. Davide already introduced himself develops and maintains our cluster software stack and really I've worked on some of the configuration manager for this testing and benchmarking this system. So when we originally wrote this talk the target audience was someone who's a user of Elmod an easy build which I believe is everyone in this room but somewhat unfamiliar with CVMFS and the key problem that that we encountered at Acre and that we're really discussing here is that we want to deploy the software stack that's been built with easy build that uses Elmod to an HBC cluster and really outward from that cluster to the cloud and do that at scale and to be able to transparently execute users job on an elastic cloud with the minimal setup and maintenance footprint so it's not too much of a struggle for us to keep keep track of this. At Acre our software stack all the software is installed with only via easy build the environment modules are managed by Elmod we have 130 unique software modules visible to users and several hundred easy config files in its own GitHub repository with several tool chains. So the question we have is how to deploy this classically I think there are two general options one is to put it on a parallel file system IE NFS, a GPFS, HDFS, cluster at Acre we use GPFS. The good part of this is it's easy to do you build the software you put the file on the file system and point the users at it and there's no additional maintenance required in theory. The downside is this is heavily affected by parallel file system performance and when there's a problem with the when there's heavy write contention when there's nodes that are leaving the cluster this can really cause a problem. Additionally this can be hard to mount on cloud instances so if you want to burst out into the cloud and like us you're using a commercial GPFS file system that's that can be pretty much a non-starter you don't want to add random cloud instances to your file system cluster for security reasons and there may be licensing costs so this is difficult. The second option for deploying the stack is to replackage and deploy to the local disks on the node and this means you have to do the extra work to produce RPMs, devs or whatever sort of packaging you're using so that's a lot of work in scripting for cluster administrators and if your software stack is large this may mean you're taking a significant disk requirement on the local nodes themselves so if you're even using something like diskless nodes this could be a non-starter. So we've been pretty happy with option one in the past but the new problems that have come about in the last couple years that have caused us to rethink this model is one as I said trying to burst into the cloud it's not feasible or legal for us to extend our network file system into cloud instances and we also have the problem that our software stack contains licensed software and open source software. The licensed software we can't legally put into the cloud but we can put the open source software and we would like to do that. Another new problem is graphical interactive use. Traditionally large-scale computing our large-scale computing facility has been used for batch processing but more and more we have research groups that want to run graphical applications directly on compute resources and softwares become available such as open-on-demand from our house state, the super computing center which we we've started using which makes it easy for users to request a compute node create a desktop and use it in their browser. Once you do that software startup times become extremely important in a way that they're usually not with batch processing. For example MATLAB can depending on right contention depending on the state of the network file system take several minutes to start up on a on a cluster and the reason the reason for that is it has lots of tiny files it needs to read them all and reread them all several times and while in a batch processing when you you know the users are submitting jobs that will run for several hours if that software takes three minutes or ten seconds to load it makes little difference to the user but when the researcher wants to make plots and use the computing resources in this way then waiting three minutes versus ten seconds is the difference between usability and having a system that's not considered usable for them. So with those new problems in mind we really looked at CVMFS this is used extensively by the high-energy physics community it was developed CERN as part of the CERN virtual machine project and what it is is the file system that's optimized specifically for deploying software and nothing else and this is really how the CERN large CERN physics collaborations are able to deploy their custom analysis and event reconstruction software to a variety of clusters compute clusters around the world and now what what CVMS does is it acts as a read only file system and using an immutable data model and this gives it a lot of advantages in terms of distributing software over a general purpose file system because the data is immutable you can get away with some tricks such as content addressing if two files have the exact same content then effectively they can be considered the same two links to the same file and you only have to distribute that data once to a compute node and you really only have to read it once into into the system page cache. You can use off-the-shelf technologies that's a standard fuse mount to create a block device on a Linux system and everything's distributed using HTTP and for this you can use squid proxies to to scale up the distribution of the software stack to many systems so a little bit of a confession I used CVMFS for years behind the scenes as physicist I never really knew what it was or what it did I just knew that you had to your software was in the CVMFS directory I was in charge of writing comprehensive testing of our workflow on the cluster to make sure all of our components worked and I didn't know what CVMS was and that's that's something to say that CVMS worked really well it worked quietly and reliably for the CMS collaboration for several years on our system and I who is in charge of making sure it worked never had to do anything so that was pretty cool so so based on that history and the new challenges we were facing we decided to go see if we could make CVMFS work not just for CERN software but for our own software stack now the basic architecture of CVMFS CVMFS is you have a what's called a stratum zero this is an authoritative system which new files are written to this then talks to one or more servers which form a stratum one and this is where people will read from the files this product provides redundancy and high availability if the stratum zero goes down you cannot make changes to the file system you cannot write new files and directories but as long as there's at least one stratum one up then everyone can all clients can can read files as necessary then on top of that squid proxies are added these are standard HTTP proxies and they will just cache stratum one data and provide scale of the scalability so if you only have a handful of test machines you can read directly from the stratum one servers but as you scale out and want to deploy you could put any number of squid proxies out front and scale to thousands or tens of thousands of machines very easily so yeah go ahead yeah I'll take over I take over here so I think it's worth to also give you an overview of how how how our stack management pipeline is designed here at Taker because in some ways is one of the approaches so how do we deploy the software not only on the class but also on CVMFS so we adopt the classical three stages development with a dev QA and prod so the dev is the our let's say our contribution to the easy build community so to the easy build project so here is done by a developer like me or like fenglai or whoever else in our group wants to do it where you know we simply develop on our local machine the easy config files and then we contributed back to the easy config repositories the advantage of that is well first to contribute back and then to have a feedback from the easy build community where you know sometimes you're building software that you've never seen and you know let's try not to reinvent the wheel somebody may have good hands on how to avoid issues once the next one once the the easy config files are then merged into the easy code easy build easy config repository we simply have a single QA machine that mimics the cluster that contains a copy of the cluster stack and where we pull the dev easy config repository and then extract whatever easy config repository easy config files we need and then we can modify them depending on internal needs and then inject them directly into our own easy config repository in this way we have the flexibility of adapting the installation to our needs and at the same time to keep constant track of all the changes that we do on easy config files to assure reproducibility and avoid future errors the advantage of this is and also we use issues internally to chat to keep track of problems we had on a software it was typical before we adopted this system this way of deploying software oh the user has this we're having issues with this software yeah I remember we I saw that a year ago or something like that but I don't remember how I did it so we use issues internally on that repo to keep track of what is going on so now the easy config files are ready that are now merged into our internal repository already to be brought into broad and the way in which we do that is we have a single user for all the admins so there is dedicated user to build software on the stack via easy build so he's the only one that is allowed to use easy build on the cluster the advantage of this is that it simplifies the stack they use the file permissions on the stack so independently on which admin is actually building the software there is no problem because it's always the same user so he's the owner the advantage of the other advantages is that it prevents accidental stack modifications so for example if I as myself build a software on the stack and then at a certain point I do a pp install and I forget to use the dash user or you know turn on a virtual environment the problem is that I would end up polluting the software stack the other advantage of having a third entity that builds software is that it's easier for user support team to debug and troubleshoot what the users come out the problems that users have with the stack because there is we have the same permission as users with respect to the stack so it's it also simplifies that thing so as Bob the builder we then pull from our easy config repository and we use the slurm integration in get in easy build to then launch the job launch jobs that build on our debug queue one of the problem that we have on our cluster is that we have multiple CPU architectures so in our debug queue we have multiple nodes of all the different architectures and we build the software one time on each architecture so if I'm building something I'll build it four times for the four pictures and the debug queue is connected directly to the stack on GPFS so here so the debug nodes can read and write from the stack on GPFS once the software is built on GPFS we have an automatic system developed by Andrew that it might leave sinks GPFS with our CVMFS and once the sink is done at that point both the software is available both for the cluster and for the cloud so that's that's in a nutshell how our pipeline works man yeah so the next thing is how to how easy is this to do it's it's actually really easy because we started with the pipeline of I'm correct where we we only we only wrote to GPFS and then the whole cluster would read it back from GPFS so in order to make it transparent to the user and have it so that the debug nodes would be looking at the stack as a g under the GPFS file system and the rest of the cluster would would pull the stack from CVMFS basically all we need to do is some clever sim linking so what we did is we created some directory a directory acre arch and that was just a sim link to the network file system when you're on a build build node and we'd set that sim link to the node architecture depending on the node so the software that was built against acre arch would would naturally if you were on a yeah let's say an Intel Sandy bridge now at an older node then that would automatically link to that architecture and then that the software would all just get naturally installed on acre arch now the compute nodes what you do is our configuration management would simply relink acre arch to CVMFS where the directory structure would otherwise look the same allowing us to also do a fallback to GPFS if the configuration management software noticed that C of AMFS is down and again the sim link would be set to the node architecture so this would mean that the software is available in the same place transparently to the user in terms of the file system hierarchy but it's really being served on CVMFS or GPFS depending on where you are on the cluster so just to show you an example here I am on a standard node and I can look at some sim links acre common and acre arch so acre arch is symbolic link to a system within CVMFS we use the Oasis part server on the open so science grid to serve our open soft source software and you can see it links into a directory which has sandy bridge in the in the path as this is a sandy bridge machine we have a separate common directory which we didn't mention serves for certain binary distributed software where it doesn't make sense to make different compiled different versions on the different architectures that all goes to the same directory okay now if I go to one of our debug nodes and I look at the sim link acre common points to GPFS 22 acre common so if you're the builder user you can then write and update that directory and overnight that will be synchronized so that the other nodes see the exact same thing again if I look at acre arch this points to the internal of the GPFS rewrite file system but here you can see where we're linking to the Skylake architecture the other issue we had to solve and in distributing this is differentiating between open source software and license software for the open source software we actually didn't need to set up our own CVMS FS server as we're part of the open science grid so we were able to use their Oasis system which provides this as a service to other clusters and again what we do is in GPFS everything works normally all software is built in the same to the same directories and then we can use sim links on CVMS to separate out the private and the public software so the module directories in CVMS FS will be in the nightly sink we just flag certain certain directories as private modules and those become sim links to another location if you're out on the cloud these private links will be ignored by Elmod if they're broken so all the license software we host on completely separate CVMS instance this is only in turn available on the internal network there's no external question yeah how do you solve the problems with some software that are resolving those sim links from a common or a car down to the GPFS file system into their bash files or whatever the sim links are on the top of the directories but some softwares when they install do sim link resolution and put the actual path into their scripts yeah do that yeah no no I see what you mean that's a good question as of now we haven't hit that issue you haven't no have you looked closely well everything seems to work when no I mean yeah that's that's definitely a good point it's it's definitely it's something that we haven't hit yet so early we are just lucky yeah so going on with that good luck so I can I could show you to this sim link if I take a package like I take an open source package anaconda 3 oops okay this is a directory on our build node but if I take one of our license software modules like Matlab okay on the build node these are both directories now if I exit the build node and go back to the cluster you could see that anaconda 3 is still a directory but if I look at the license software this is a symbolic link and it goes to a different CVMS server which is our private one so what happens is if we go to a cloud instance that will Matlab will just appear to them as a broken link and it will be ignored completely by Elmod well if you're on the cluster you'll actually see that when you run a module avail command so here's the big chart of our overall distribution model the internal everything highlighted in blue is something internal and on-premises to acre so for acre we have our we have our GPFS file system and nightly that gets synced to our private stratum 0 servers and then that can go forward through an internal architecture where we can set up as many stratum one or a squid proxies as we want inside the internal networks cluster nodes can then read from that the open source software will send out to the open science grid oasis server now if you're not an open science grid member you could also set up a separate server internally to be your stratum 0 let's see that sorry about the typo there and then that architecture has stratum ones and external squid caches so what happens is that our internal squid caches are reading from both the internal and the OSG stratum one and collecting together all of the software if you're on an external user work station or a cloud instance you could then read from the OSG and external squid clash really that just reads from those stratum ones and just gives you the open source software okay and we have a little demo for the cloud instance so I've created a cloud in AWS a couple I've created an instance in AWS a couple minutes ago before I started this presentation I haven't logged into it yet so I'm gonna do that now and what I'll do is I'll run a simple script to enable enable L mod and connect it to to our open source software stack now I've never logged into this just to show you I've never logged into this before and this is a stock instance of CentOS 7 from the AWS collection so this is right off the shelf so I'm gonna grab this simple script and then this could be properly you know from this script you could create your own AWS image or containerize these instructions so I'll just run it and this will set up and in just a minute or two here set up and get the software stack ready to go on this new image any questions while we're installing software no questions so far okay now we're just populating an entirely empty environment that's right yeah this is so this is just demonstrating that this is pretty simple and fast you can this is this is a stock core CentOS 7 image okay so now everything's installed I should be able to log out log back in and type module avail and then again this is gonna have to read do an initial read from CVMFS and populate the internal cache and there we go so I can see some compilers and binary distributed software and I'm ready to go so I can I can load whatever module I want and I can start doing work what what's really nice about this in my opinion is that a researcher can use a create a container that links into CVMFS and that sort of dynamically gets updates to all their scientific software stack in a cloud image whereas if you were try if you were to try to put all of this software into one Docker image that would be kind of a pain and you'd have to do updates this sort of short circuits that step just to add just what a similar use case another one is for example we have now we have a few users that they have some pipelines that are half on their local work stations and half on the clusters so the the problem for them is how to have a consistent software stack that allows them to seamlessly move their the part of their pipeline from the workstation to the classroom vice versa so one of the thing that we are doing now is for example to allow users obviously with with our supervision to allow users to mount the part of our stack so the GCC the false part of our stack directly into their workstations in that way they can have they can build the software and they can have the same software both sides and build on top of the same stack yeah so this is this is an exciting new way we can distribute so the other the other issue is with this issue of interactive startup time and here's here's where we really had a problem with a cold cache and this this is one data point that I'm showing right now this shows the startup time of some packages Python with loading I think I had a NumPy and scikit trying to just import those modules and load Python trying to load MATLAB and trying to load R and just run an empty script when we use the GPFS file system which again this is on a regular day there's lots of workflows going you know that the the file system is under some regular stress you could see it could take as many as nine ninety seconds to load up all the Python modules several minutes to load MATLAB and when we switched over to CVMFS those load times really shot down by an order of magnitude now once the cache is warmed once the data for those files is on the internal on the compute node loading this software then the difference doesn't make as much it doesn't matter so much in fact GPFS maybe a little faster here due to how it uses user space caching versus the fuse mount for CVMFS but in terms of the user experience this is pretty similar and just to demonstrate what our users are doing is we have an open-on-demand system which gives us gives users in browser desktops that they can launch and ask for oops I shouldn't clean this up and a user might want to use this desktop to let's say load and run MATLAB and here's the step where we really had a lot of users complaints if there were any significant difficulties or stress on the file system this could take several minutes and again with CVMFS all the little files that go into loading MATLAB can be shipped put together as a block shipped over HTTP and and you know if there are multiple files that have the same content these can also be red ones and so this will this will show up for the user in less than a minute pretty reliably now which is still slow for any application these days but it's a lot better than when you're trying to load MATLAB over a rewrite file system and there we go so we did that so the basic conclusion before we go to questions and further discussion is basically we found this is a reliable and simple reasonably simple method to distribute scientific software that we can easily extend out to the cloud or to other nodes in our researchers work groups that aren't we don't want to connect for whatever reason to our cluster file system and it's also very nicely suitable for interactive use thank you very much yeah go ahead no no I was just saying that that's also if you have any questions or anything on that side we are happy to help to answer you said you had dedicated you have dedicated build nodes correct how large is that system what are you or how few users do you have no it's just so we have we currently have one two three four five CPU architectures but also because we have one that has GPUs so so we have four for GPU architectures we have four nodes that are our debug you so what happens is that I understand your point in a sense of well you know just dedicating nodes just for building this kind of a waste of money and cycles so the way in order to you know use them as much as possible is to put to make them available through our debug queue on the cluster so they are both acceptable by the users for debug purposes and at the same time available for us for building that does not slow too much our process because the maximum job that can be launched by users on the debug queue is 30 minutes so you know if we have to build a software or worse case scenario we have to wait 30 minutes and that's that's that's the way in which I justified to our managers how to to give me four nodes one per architecture and if that's a problem with virtualization you could reserve you know create VNs of just two cores on a one or two cores on an ad and release the others for general purpose computing too yeah okay any other questions for the V date and this is John day I'm curious about how you are using squid in AWS is that a standard image that you're getting from from Amazon or oh no the squid servers are dedicated machines on premises so that's on-prem and you you have a dedicated connection the you know the AWS cloud so so that the squids so that so for for the software that so so we we have a compute node on the cloud here basically that that can access the software stack it's using the public-facing squid proxy and that's good proxy we could have our own this one's actually hosted by the open science group and so literally anyone in the world can talk to the squid proxy you could we could conceivably make restrictions based on IP address or you know other qualifications but yeah as it stands this is basically exportable to everyone thank you so yeah so the the cloud node doesn't need anything so any specific credentials to talk to the squid server we just we just stand it up and it immediately begins communicating okay great thanks very much thank you guys okay I'll stop the screen is the streaming briefly so we'll set up the next talk