 Hi, it's time now for our next talk. This talk is from Jakob Luma from CERN. He will be talking about the CERN VM for FS, giving us an overview of it and a roadmap for its future. Jakob, I'll hand over to you about 30 minutes for the talk and then some time for questions. Yes, thank you. Thank you very much for the invitation. It's my first time at the Easy Build Users Meeting. And I'm happy to tell you a bit about the CERN VM file system, what the system can do, and give you some ideas on what we plan to work on, let's say, in the next year. So some words about the Software Distribution Challenge. Then I would like to introduce the system and its main components and specifically speak about containers and CERN VMFS. Containers obviously being one of the big important technologies for software distribution and then a little bit of an outlook and a summary. So this picture shows the CERN VMFS deployment that we run for the LHC experiments for high energy physics. So in a nutshell, what CBMFS does is the delivery of scientific software, containers, and auxiliary data. I will speak a little bit about this later, what exactly we mean by auxiliary data, but let's say mainly scientific software and containers to a worldwide distributed network of loosely coupled data centers. This is what we run for high energy physics, which we call the grid, so a federation of data centers. As you can see, the server infrastructure, so data sources and mirror servers, these are about 10 of them. We call them stratums, so stratum zero is a data source, stratum one is a replica, is a mirror server. And then all the data centers that read software from those servers, and there we run about 400 caches, the site caches for efficient access. And so we are aware of more than a billion files that are under management, obviously serving high energy physics and the LHC experiments, but along the years, we have also attracted some other users, there's LIGO gravitational waves, you plead the European space mission, and of course, easy. And if I've seen, I'm not mistaken, there's a talk tomorrow about easy project. So CERN VMFS is a file system. So if you have the CVMFS client installed on a Linux or a macOS machine, it gives you, it provides a uniform, consistent, versioned, POSIX file system access to this area slash CVMFS. In this example for the CMS experiment at the LHC, and you can see it spends the tree of all the different software versions that are built and available and that you might need on a machine to run CMS experiment software. Of course, you pick one of them. So at any given point in time, you only need one of those sub directories. And that you can do on grids and clouds, on HPC systems, and also on the end user laptops. So that is for reading, but of course, someone also needs to write. And CVMFS in contrast to many other file systems is a highly asymmetric file system. So there are many readers, but only few writers, and we call them publishers. And typically the publishers are software librarians. So they are, this is usually a small team and a scientific collaboration that takes care of building the software stack and then writing it to CVMFS. Whatever is written there is cryptographically signed and it's written in a transactional manner. So on the file system, you will never see half a software version written, but you write all of your new software version, you commit and push and then the clients atomically see the new version being available. And so there, as I said, there are only a few publishers. So typically you have like a dedicated place where you stage new content into the system. And this is what we call this the stratum zero. Right. So why do it by a file system? Why do it by a global shared software area? Well, that part of the explanation is simply that the scientific software stack can be at least in our case, it is quite large and complex. So that's again for the CMS experiment where you see the typical software stack starting from the operating system and then some base libraries compilers. Then we have common high energy physics libraries that say for math, for machine learning, geometry and so on. On top of that, there's an experiment specific framework. So every detector has its own software framework on top of it. And then comes individual analysis code such as at the very top. And all of this comprises the software release where in our case, often hundreds of developers have contributed. This can spend something like several tens of thousands or even 100,000 files per release. And this is for the production releases for every production release. Of course, if you look at nightly bills, you can produce much more because you test different compilers, different combinations, platforms and so on. So this is at the moment typically something like a terabyte per, actually per night of nightly bills. We distribute this to about 100,000 machines worldwide. And yeah, something that is a bit special about the scientific software is that the releases need to remain available. So it's not like you always update to the newest, but you want to be able to go back and do the same analysis with the software release that you used three years ago. Right, so and if you look at common tools for sharing data center applications, they are all bundle or package based right there. We have obviously Docker, but then also more traditional packages like Windows installer, let's say RPM, Debian packages and so on. So the applications get bundled ahead of time and then installed everywhere. And that is an efficient way to structure the build process, but it is actually quite inefficient way for the large scale rollout. And we see on the next slide why this is. So this is an example with R. So let's say I pull the Docker container for the R statistics toolkit. I get a gigabyte approximate, a little bit less, a few hundred megabytes of image data. And then I use it to run something, let's say the fitting tutorial. And it turns out out of my close to gigabyte of container image, I actually only have used 30 megabytes. This is what they would actually need. So what we would ideally want to have is a container for the isolation part, but for the distribution, it would make much more sense to have something that loads on demand, that loads only those 30 megabytes that they happen to need for any given job, for any given processing task. And I mean, especially in the HPC environment, we actually know about shared software areas. The problem often with this is if we put them on general purpose distributed file systems, it's very easy to break them. And the reason is the high metadata rate. Software typically consists of very many small files. And when we start a job in the data center or on the HPC system, all of the nodes want to access the same set of small files at the same time, open the same Python files to start TensorFlow. And we can easily have a megahertz request rate per compute node and a kilohertz file open rate and that can bring down even powerful distributed file systems to their needs. And so if you look a bit closer into, I mean, usually distributed file systems are optimized for data delivery. If we use them for software, we see very quickly that the specifications are very different and for some of the typical metrics, many orders of magnitudes apart from each other. For instance, file sizes and access patterns. So software, if you look at it from this point of view is massive, but not in volume. In volume, it's very small, but in number of objects and in metadata rates. And that was the motivation to do a purpose-built file system for software delivery, which became CERN BMFS. And so it is purpose-built and it works well for a specific set of, let's say, application areas. And these are the four application areas where it has proven to work well. That is production software, so a software stack for, let's say, CMS experiment or Atlas experiment, what I've shown before. Relatively stable software, stable repositories and it is used by a large number of machines for data processing jobs. Then we have the integration builds. Integration builds are a little bit different. We write, we publish much more because usually all different combinations and tests are being published, but we access it only from usually the developer machines or some test nodes. So the integration builds are really about publishing throughput much more than about read scalability. Then container images, I will speak about this a little bit later. So it's actually, so CBMFS is a good fit to deploy container images at scale. The only trick here is to distribute container images not as one big block or blob, but to unpack them, to explode them before into the file system and then you can use them essentially like a CH boot entry point. And lastly, what about data? There are some data sets that can be distributed very well on CBMFS. I call them auxiliary data sets. So that is not clearly defined. It's a type of data that goes well along with your software that has a similar access pattern than your software. So an example is, for instance, what we call in high-energy physics conditions data. That is the state of the particle detector at the time of data taking. And that's typically, let's say what you need for a particular data processing task is a few gigabytes. And the important point is all the nodes in your data center need typically the same few gigabytes at any given point in time. And that makes it different from the actual event data, from the actual physics data where every node needs a different set of data files. And for the real data files we have high throughput systems that are tuned for high volumes. But for the conditions data, CBMFS works quite well. Another example is from the bioinformatics community genome reference data sets. Those go also well on CBMFS. So let's go a little bit into the components of the system for reading. We have a fuse file system client. This is what you see on the left side. So this is a single compute node, let's say with a Linux and fuse kernel module and the CBMFS fuse client. All the data transfer in CBMFS is done by HTTP. So the content is available essentially on an HTTP server or an S3 server. This is the stratum one typically. Then there is a hierarchy of web caches and that can be designed according to scalability needs. And at some point files end up on the client. There's also client cache typically a few gigabytes or let's say 10, 20 gigabytes in size. Right. And so especially those HTTP caches let's say the site caches or data center caches, they are usually used very efficiently because typically the nodes in the cluster require the same bits at any given point in time. Availability is essentially everywhere where there is fuse. We have a few A platforms and B platforms, A platforms, those best tested platforms are Enterprise Linux 7, 8 at the moment, Ubuntu, the long-term releases. But it's also we have it available for Mac OS, Susie, Enterprise Linux, Fedora, Debian, then different architectures like ARM Intel 32-bit and even some experimental builds for Power, Raspberry Pi and RISC 5. So how does the publishing work? For the publishing we look at the dedicated publisher node so the node that maintains the stratum zero authoritative storage and what we do in publishing is we use a union file system on that specific node because so typically the CVMFS or the CVMFS client is read-only so we cannot write anything but we can put a union file system on top of it so like a writeable layer on top of it and all the changes that we are then going to write into this virtual mount, this virtual writeable mount point are ending up in the staging area and this staging area you can then use to commit the new and modified files to the central storage. This can be either local file system attached to the node or S3 storage system and then practice this is how it works on the machine that has that is a dedicated publisher you open a transaction at this point then slash CVMFS container cvmfs.io if this happens to be your repository name becomes writeable so you can write some content for instance un-tar something and publish it and when the publish is finished this is the point where the atomic has taken place. What happens during publishing during publishing we rewrite the file system the content that you have written into an internal format into a content addressable storage so CVMFS publish routine walks through the staging area picks up all the changes compresses the files hashes them and puts them into an object store so that can be a location local file system or S3 and all the metadata they end up in SQLite file catalogs and SQLite databases are also plain files so they can go into the object store as well so the object store has simply compressed files and chunks deduplicated and the SQLite file catalogs keep track of the file system structure like directory structure, symbolic links the content hashes they are digitally signed they have a time to live this is how clients get to know about updates when the time to live expires they will look if something new is there and they use hash trees in order to well secure let's say the entire directory cryptographically so that you know you don't get any garbage data by mistake for instance so that also means we can use untrusted intermediate storage systems we can use HTTPS sorry we can use HTTP we don't have to use HTTPS so the deduplication is actually quite useful that is shown on the next slide this is an example from one of our nightly build notes and you see on the x-axis there are published IDs so every point is a published operation and on the y-axis you see how many files have been published in that operation and so if you look at the upper part this is a logarithmic scale you see the number of processed files so the number of files that were in your staging area and in purple is the number of actually new files that were published happened to be already in the repository under a different path and you see this is like a few percent of the files in the lower part you see only a few percent of the files is new and this is very typical for software if you release version 53 and we have already versions 1 to 52 installed it turns out not that many files are actually new right so a little bit on the caching infrastructure the transport so I said that the content is on the stratum 1 replica on a web server and clients read it through HTTP and typically they go through caches and what we have seen is a site cache, a web cache we use quid but you can use different technology as well, different caching servers typically you have something like 500 nodes per cache that is a ratio that works fine it's for various reasons a good idea to have multiple caches on site because the system is built in a way that the entire transport infrastructure is stateless and so that means you can take out any part you can take out caches, you can take out a replica servers and the clients will simply try different ways to get to the same content so in this case we have added two quid servers but this we have done for scalability let's say we have 1000 nodes in the data center now one of them fails so the clients realize this path is blocked let's try for a time being this other path and the same we can do on the replica site where clients pick using GUIP a close graphically close replica but if that one fails they can fail over to different one and one interesting point here clients have a local cache so they can run in completely disconnected mode if they happen to have no cache misses they can work for instance on a laptop on travel in disconnected mode also on an HPC site if the cache is pre-populated it's not necessarily there's not necessarily outgoing network connectivity needed so a little bit about containers and CVMFS so for containers and CVMFS we have to distinguish two different problems that we might want to solve the problem on the left side is let's say we have a container from somewhere let's say an Ubuntu based operating system and we want to make a software tree in CVMFS available to that container the easiest way is to bind mount from the host into the container but if this is not possible for instance if let's say CVMFS is not available on the container host we have a few more advanced options to make CVMFS available nevertheless that can be that you mount inside the container that works without any privileges if kernel and container runtime and everything else is new enough we have also an option to let singularity take a part of mounting CVMFS I will show this later or CVMFS itself can come as a client container and expose the slash CVMFS area to other containers I will also speak this in a minute so that's problem one provide CVMFS in an existing container and problem number two that we might want to solve this let's say I have a set of containers with all my software and I want to deploy them at scale and here what we have seen at least is that in enough machines that can put quite some stress on the registry and on the worker node image caches so it's actually it works quite well to distribute those containers instead of like docker hub in an unpacked way from CVMFS and to do this we need to take the containers and publish them in CVMFS we have a tool to do this in our runtime actually can use those unpacked containers from CVMFS so let's have a look into the first case let's say we have a container busybox in this case and we want to make CVMFS available this we can do with a standard bind mount if CVMFS is available on the host the important bit here is to use a shared one to make the entire subtree of slash CVMFS available the same with singularity now if I do not have CVMFS on the host then there's a relatively new little utility called CVMFS exec that might help and this is a clever sandbox so what you see here in this command is you call CVMFS exec you tell it which repositories you would like to have available in your sandbox and then you run the command and in this case I wanted to have Grid and Atlas and they are available together with a helper repository and if kernel and some other bits are new enough that works completely unprivileged then what can be done is to deploy the client itself as a container so what does it mean we have the client in a container mount CVMFS inside the container and then we leak it out to the host and in practice this is how it works we call Docker run we run it as a service container we set a few options we need a few privileges to do this and we leak the slash CVMFS volume in this case out from the container to the host and on the host we can then look into CVMFS and there it is so that is an interesting way to just try it out if you do not want to start immediately with a package installation but it's actually interesting to target systems that do not have traditional package managers for instance let's say atomic host so container only operating systems and it is the way to target Kubernetes clusters so this container can be used and deployed as a Kubernetes demon set and make CVMFS available to all the pods in your Kubernetes cluster and the last point that might be relevant for HPC systems specifically for HPC systems that already have singularity installed if this is the case singularity can take care of the elevated privileges so what happens in this script down here is singularity starts a container and the container has the CVMFS client installed however it might not have the privileges to actually use it but singularity does and this fuse mount option says okay singularity please help me get the necessary privileges to mount a fuse file system inside the container that might make it a bit more available to some HPC installations right so that was the client part how to make the CVMFS client available in various ways but what about if we have containers and we want to distribute them using CVMFS and well the first thing that we have to do is we have to convert them into a CVMFS repository and we have done this um we have two big repositories in the high energy physics community unpack.cern.ch, singularity.openScienceScript.org where we have each a couple of hundred images available and if you remember one of the earlier slides where I showed the experiment software stack um this was a very well defined software environment for the library and carefully picked bits and pieces if we start making container deployment to CVMFS available this broadens the range quite a bit and wherever you have a container to do something this can end up in CVMFS this can be base operating systems this can be some user code explorative tools machine learning user analysis and so on and even something like volunteer computing containers such as folding at home now once you have the the container image available in say unpack.cern.ch you see on the bottom this is how you use it you call singularity exec you simply point it to the container name and you run your program um so how does the conversion work for the conversion we have a service called duck so daemon that converts images into CVMFS and at the moment this is um so this works in in collaboration with a regular registry and with a wish list with a wish list that you see on the left side so you specify which images from your regular registry let's say GitLab registry you would like to see in your CVMFS repository and the daemon takes care of fetching the images, extracting them, publishing them on CVMFS and if you look into the repository what ends up in CVMFS unpack.cern.ch you see what we have seen before so the name of the container image so that you can for instance start it with singularity and this points to the fully flattened container image so this is like a CH hoot starting point but we also keep all the layers in their exploded form so we can use it with singularity but we can also use it with container the Kubernetes portman and so so this is like our first incarnation the wish list you have seen that it might be nicer to actually automatically convert images when users push so that they don't even need to think about the wish list and this is what we plan for this year a webhook connection so that when you have a harbour registry or some other registry that can send a webhook and when you push an image it will automatically trigger the image conversion these are the runtimes that are supported, singularity natively also portman for portman we produce a little bit of extra metadata so that portman can see the content in cvmfs and then use the images from there for docker and for container d these are runtime serving plugins we have one for docker and we have particularly one for container d which is called a remote snapshoter so that is how these plugins are called in container d and we hope actually the docker at some point would update to a recent enough container d foundation so that we can use this universally to target docker container d and kubernetes right and this is an example what we have done with it at some point during the year people in the distributed computing groups of the LHC experiments thought it might be a good idea to run folding at home and this was the container was put in cvmfs and then ran on the LHC computing grid and actually it put us right pretty much to the front of the teams and we were in illustrious company of nvidia and amazon and a few other big companies so what are we planning to work on this year well so I spoke about the publisher note and the fact that the publisher note is typically on a simple configuration a single machine now this single dedicated note obviously comes with drawbacks it's a single point of failure and it's also a performance bottleneck so one of the relatively new things that we have is the system for distributed publishing and in this system you can have multiple publisher notes and well you have to somehow straight access to the central storage and that is done through gateway service so the gateway service takes care of handing out leases or locks for certain parts of your directory so that for instance your one builder note can write into slash arm for all the arm releases and your second builder note at the same time writes into slash x8664 for your intel releases this works at a few repositories at the moment at CERN in production and what we will do this year is work a bit on stabilization we still have a set of known issues that need to be addressed and also to match the feature set still this distributed publishing cannot do exactly everything that the local publisher can do for instance you cannot trigger garbage collection from the remote publishers so that's one area of work the other one is getting a bit experience with a new command that we have recently introduced cvmfs server enter it is called and the idea of this command is that well maybe you would like to be able to get a writable slash cvmfs area not only on the publisher note but on any machine that can mount the client and this is exactly what the cvmfs server enter command can do it requires again a relatively recent Ninox kernel and fuse overlay but with these ingredients any machine that can run a client can enter this enter sub shell you see this at the bottom so you call let's say hsf.cvmfs.io as a repository name and it opens a shell where out of the sun you have write access to this area and you can make modifications now why is this a good idea well for instance if you have a build note that produces software that ends up in cvmfs you can now build directly into slash cvmfs you don't need to build into some off-site area and you can look at the changes and that is what we want to work on of course we would ideally like to directly publish from there and that we know how it works but we have to implement it to connect this to the gateway publisher and then you would be able to given some access keys to use any build note to push new releases into cvmfs right so cvmfs is a special purpose version file system that provides a global shared software area for scientific collaborations it uses content address storage asynchronous writing and those are the key ingredients to keep it scalable for a very large number of readers so scalable on the reading site and we work on container integration this year and next months and scaling up the number of writers and here's how you can find us the website you can find us on github and thank you very much for your attention thank you for the talk I know Kenneth's got a question so we'll go to him first yes so the question I had is if you can elaborate a bit on the problems you've seen with getting HPC sites to adopt to making cvmfs repositories available on their systems and what kind of concerns are raised typically and if they are valid right so I'm probably the wrong person to say if they are valid or not but the concerns that are typically raised are so typical problems is first the outgoing network connectivity so the system works best if you use this setup of tiered caches and this is let's say the zero configuration mode of running it if the outgoing network connectivity is missing you need to pre-stage files into the HPC site preload the cache we have tools for this but it is some extra work the second issue is deploying a fuse client on the compute nodes that has been a problem on some sites not all there are also many sites that install the fuse client for instance cscs but there are also many others but this can be a problem and here well I mean let's say there are several levels of how deep you want to install the client on the compute nodes of course the easiest or the fastest is to simply do yum install but maybe that is not wanted for one reason or another you can bring in for instance the client on demand with your slurm with the slurm scheduler if a job requires cvmfs you could only then mount it on the node and unmount it when the client leaves so there are some deployment options that we have developed over the years to make it a bit easier for HPC sites maybe one more point I should mention is the caching so an important an important piece of the caching infrastructure is the local compute node cache now some HPC sites do not have that there are simply no local disks on the compute node and that also requires special configuration where for instance the bulk of the cache is on the cluster file system and then upper cache layer sits in RAM we have something which is called a tiered cache configuration that makes this possible or you have cache files per compute node where you loop back mount them in a loop back mode to your compute nodes but then okay all of this is not I think there are no show stoppers but they require special care and special configuration okay yeah that's very clear thank you if there are any more questions for Jakob then raise your hand and we can allow you to ask the questions Hey Jakob maybe I have a question so when the cache was being populated so in this HPC user user mode when the cache was being populated there was quite a bit of load I saw from CVS at that time now it disappears once the cache is populated but then it made me wonder about the jitter does CVMS produce jitter into the jobs so would you be sacrificing scalability a little bit if you had the CVMS we don't do this ourselves so we would rely on the job scheduler for this but the jitter is actually interesting I think only when you start using the software when you read it right when you populate the cache it should take a long time the first time for the initial for the initial population but then subsequent calls only transfer the data okay yeah I'm not sure I hope this answers your question yeah I know I'm curious yeah I was just curious looking at the cache loading up and I could see the load it's not small I have a question considering the amount of Linux distributions that you have to maintain so did you guys already consider to do something really like the Compute-Canada dies or like Easy-Project dies it's gonna have like a compatibility layer so it can have a unique sort of files that can run everywhere just be adding newer and newer versions of distributions and you know yeah that is a good question so the the service container the container that wraps around the client is meant to use our fallback for any Linux that is not supported that is our compatibility layer otherwise I think being a file system we are one of the pieces software pieces that need to have distribution specific packages and I think we will have to live with it and support what makes sense what is actually being used by our users so my question would be over the last year we set up a certain tier 2 site actually so we are running CVMFS on the Compute-Nodes or yeah we run it on all the Compute-Nodes but we actually use it just on a partition my question would be can I mix different upstream straightums then basically because I would still want to go upstream to the turn CVMFS stratum 1 0 but what is interesting from your presentation is definitely also the kind of small ish static datasets because we also have biologists or we are actually mostly biologists that run on our machine so this kind of reference genome data stuff like that might be something interesting so I'm wondering can you point it to different routes in that sense right yes so technically those mount points all those repositories are completely independent so your repository atlas.cern.ch could point to a completely different stratum 1.cms.cern.ch what we do for the CVMFS configuration it's typically grouped by domain so that you would have a domain configuration for .cern.ch but if you use bioinformatics.eu you might go to different set of servers I see that would then basically also differentiate basically for the cern part to go to the frontier grid and whatever other proxy cache I have for other domain yes absolutely so you can decide whether you want to share those caches you can actually safely share them or not okay all right thank you hi so my question is easy I'm used to using AFS I think I see a little bit of influence here but so our scenario is that we have had a number of our file servers experience exactly what you described at the beginning they people started 150 MATLAB jobs on 150 separate nodes and then they all went to the same nfs file server which screened and ran away from the data center so our use case really is looking at things like a python distributions our distributions MATLAB distributions for the software that we know has got hundreds of thousands of files and so I just want to make absolutely clear we can when a compute boots up and is getting ready to be put into the scheduler and ready for jobs we can ask it to pre-cache those applications so that those are already on the local disk at the point at which the compute node becomes available for other people's jobs right there is a tool which is called CVMFS preload which can be used to preload part of the directory to the compute node now typically I would say this is not even necessary you might you might have a different use case but what we have seen in high energy physics I think none of the production sites do it it is true that the first time you run a job it takes a bit longer to load all those python libraries but the local cache on the compute node is persistent and when you run this software version or a similar one it is as good as being preloaded we have some I consider them to be a bit over eager system administrators so they like to reinstall compute nodes on a regular basis so it would be nice to be able to preload immediately after reinstallation and before we give it to the users this was also kind of directed at the previous question about does it introduce jitter to the nodes and I think some of the if you can identify the software that is going to be the biggest drag for populating the cache in the first place if you pre-populate those then you reduce that so thank you looks like we have finished with questions at that point I can't see any more hands up so again thanks to Jacob we will have a few minutes and then just over 10 minutes time we will have our final talk of the day thank you