 All right it's 10 o'clock so good morning everyone welcome to the third day of the CVMFS workshop. So in the last two days we showed and you did the exercises for setting up the actual infrastructure for CVMFS. Today we will actually start using the infrastructure to publish files to your repositories and show you some details about the transaction process and some other techniques. But I first want to do a short recap again of things that we saw yesterday, some questions and issues that we run into and other remarks. So I collected a few of the things that I have seen in the Slack channel and questions that we got after the talk. So one thing that is important because it can cause very annoying issues is that on your stratum one you basically define a directory where you store all your public keys. So for instance if you make this directory where you have to leave out the repository name but you just use the domain name then you put all your public keys into that directory for this particular domain. But even if there is one that exactly matches your repository name, so let's say that's repo.yourdomain.tld, CVMFS server will still check all the other public keys in that directory as well. And yesterday we saw a very weird issue that if one of those public keys in that directory for some reason is incorrect then CVMFS server will still complain even though the the actual one that you want to use which completely matches your domain name and your repository name is correct then it will still complain because there's another incorrect key in that directory. That also means that the the public key that you want to use doesn't have to match the name doesn't have to match exactly to your repository name so you can rename it to whatever you want and then CVMFS server will still find it. So again make sure that they are all correct otherwise you can get weird errors about the white list being invalid or similar kind of errors. Another question that we got is if you start with your stratum one not using the geo API so by using the workaround that we provide on the tutorial webpage can you then still later on enable the geo API or do you have to redo all the steps from scratch instead of the replica from scratch? That's not necessary so if you still want to use it later on you can just request that key and insert it in your server.local file on your configuration and then when you do the next snapshot CVMFS will detect that you have that have set the license key and it will then just start updating the geo database so what it does basically every snapshot it will contact the website of MaxMind which provides that database and it will check for a new version and pull in that new version of the database that contains some information about which IP addresses are located where. Then we got a request if we could add some kind of too long and then read a page section to each page so there's quite a lot of text on each page now and every now and then somewhere in between you will find some commands that you will have to run or there's some instruction that you have to edit a certain file so what we want to do we didn't have time to do that yet but we will try to do that maybe this week to add some kind of quick summary or quick overview to each section or each page which will just list the commands that you will have to run to set up some kind of server so for instance the stratum zero so which commands you have to run which files you have to edit or which files will be created by either the packages or the the tools themselves so that you have a quick overview in case you want to redo something without having to read the entire page again. Bob can you make the slides a bit bigger? Ah yes. Just make it full screen for now or something? Yeah that's probably better I'm using the terminal anyway. Then I was questioned that most of these commands especially for the server but also some for the clients have to be run as root so it was questioned can you run CVMFS in user space? Well for most of the server parts that's not really possible for instance the stratum zero and one do require some root permissions to change stuff in directories that you don't have access to as a regular user you also have to edit some configuration files. It also uses tricks with overlays to make the the CVMFS repository writable as a writable overlay on top of the read only parts that cannot be done with singularity for instance. It is possible to run the server in a docker container if you want to do that then you have to give it full privileges to do whatever you can in that container. So that's possible but that probably doesn't really help if you want to run this somewhere where you're not root anyway then you probably can't use docker either but for the clients there are solutions to do this without root permissions so there's for instance singularity which can do this for you which can at least mount your repository inside the container using a fuse mount which I think Jacob also demonstrated in his talk on Tuesday or Monday and for instance in the easy project or in the talk that I gave about easy I also gave a quick demo of how that works then you can just pull in a container which does need to have some packages installed but given that they are installed already you can just let singularity mount CVMFS for you in that container and then you don't need any kind of root permissions on that system. There's an alternative some kind of wrapper called CVMFS exec which can mount CVMFS in several ways and it will do several checks on the system you're running this on and will for instance check if you have user namespace is enabled on that system so that depends on your kernel and then it has several fallback mechanisms to do the mount for you and depending on which features are enabled and which ones are not it can do certain things or cannot do certain things it even has a full fallback on singularity so if all those namespace features are not enabled it can still use singularity and basically do what I just explained as well so use singularity with the fuse mount option but if I think user namespaces are enabled and the fuser mount tool is available then it can mount CVMFS for you even at slash CVMFS meaning that you can just mount CVMFS for instance inside the cluster job and just mount it as a regular user for your job only. There are still some things there that you have to take into account for instance with respect to caches etc but it is possible and the final thing which was I think mentioned on Slack so at some point in the instructions we say at the end run this command start minus fee to make sure that your connection uses the proxy and connects to the stratum that the right stratum one in order to run that you have to be sure that your repository is mounted so if you run it you might get an error saying that the repository is not mounted and then you just have to do an ls on your repository or do a cpmfs config probe or whatever other commands that just accesses the repository because whenever you access it in some way it will automatically mount the repository and then you should be able to run this command. I think that covered most of the things that we've seen yesterday so unless there are still other questions I will continue with today's section if you have a question now just raise your hand see any questions so then I'll continue there was one other question which I will cover later on I saw some discussions about how long it takes if you ingest new files how long it takes before they show up on the clients which I will cover in the next section because that's about publishing anyway so today I will show you how to do actual transactions you've already done that basically in the first session because you already added a file to your repository so I will explain that a little bit more I will show you how to easily ingest large star balls for instance if you have a build machine where you build lots of software and you want to ingest that all in one go there's an easy way that you can just tar up everything on the build machine copy over the tar ball to your stratum zero and then just without extracting that entire tar ball again and adding it to the repository there's an easy command that you can just give a tar ball and then it will automatically ingest it for you which is also a bit more efficient than entering yourself yesterday I was already a question about tags and snapshots and revisions so I will say a little bit more about this and how to do actual rollbacks so that you can go back to previous versions of the repository finally a very important aspect is the catalogs that CVMFS uses to store all the metadata so by default you at the beginning you just have one big catalog which stores everything about your repository which can become very inefficient at some point if you store too many things in your repository that's where you can make nested catalogs so I will also explain that and then again you can work on an exercise so let me go to the tutorial web page not sure how many commands I will run but for now I will do it with a split screen again so first the transaction well there's not a lot more to say about it than what you already did during the exercises so basically if you want to change CVMFS which is read only by default so if you want to change something add a file remove a file modify a file create directories etc you first have to open a transaction which you can always do with this command if you made your user who's running this command also owner of the repository when creating the repository you don't need group permissions for this so you can just run it as a regular user and the transaction will basically make sure that you get a writable overlay on top of the read only CVMFS folder and then it will capture basically all the changes that you make inside that CVMFS folder so once you've done this you can go into that folder go into the repository folder and you can just start modifying adding files removing files but they will not actually show up yet so it's still a kind of scratch area where you're working and only when you run the publish command it will actually start figuring out which changes you have made and add them to the repository so we'll do the compression the deduplication update the catalogs to adjust the metadata for everything that you've just changed when you haven't run the publish command yet so nothing is published yet and you can always abort by running this abort command and that will basically just wipe the entire writable overlay and undo all the changes and close the transaction so then nothing has happened that can also be useful if you just want to test something if that really works then you can just play around in your repository and see what what it actually does and then if you're not happy with it just abort it and nothing will will be changed um well that's basically everything about the transaction um so then the tarbles um it's already mentioned sometimes you want to ingest just a large bunch of files of directories and then it's sometimes easier to just take a tarble and ingest that to the repository you can of course start the transaction and unpack the tarble yourself but again this should be a bit more efficient if you use the ingest functionality and so how does that work same kind of command so again a cpmfs underscore server which basically does everything on the server um then the ingest sub-command and then you have to provide it with two arguments first the name of the tarble of course so that's the minus t here and the name of your tarble and this shouldn't be a compressed tarble i will get back to that in a bit but here i will assume that it's not compressed and you have to of course provide the name of the the repository over here a bit of a weird order here probably the two turn this around but this is the name of the repository and the minus v basically says where you want to ingest this to so it should be a relative path in your repository so in this case what it will do it will extract this tarble in slash cpmfs slash your repository name slash some slash path so that's where everything will end up important remark here don't prefix this with a slash so then you will get an error as i've seen before so don't include slash over here but just a relative path without a slash so this is where it will go then in all the versions of cpmfs you couldn't do this to the root folder so if you want to ingest something to the root folder that wasn't possible but if you have a recent version the latest one 2.8.0 you can also pass a slash to the minus b option and then it will just extract it to the root folder well often your tarble is probably compressed so how do you do this for compressed tarble well you first have to basically use the right tool to decompress it and make it a regular tarble so for instance if it's a gzip tarble you can use the g unzip tool and then use the minus c to basically unzip or unzip to the standard outs and then just pipe that to the cpmfs server ingest tool and then instead of passing it the name of the tarble you just add a dash here which means that it will read it in from standard ingest other than that it just works in the same way so then the tags basically every published operation that you do you can assign a tag to that so that can just be a name or end or a description that you can assign to that publication by default unless you change the default settings but by default cpmfs will add an automatically generated tag to each published operation that you do I can show that here already demonstrated this yesterday a little bit but if you use the tag minus l then it will list the current tags then I provide my organization or my repository so you will see the name of the tags over here so this is the automatically generated one which is often then called the generic dash and then the timestamp you see the revision number which will just be increased by one every time you change something you see the timestamp and there are two special ones that's trunk and trunk previous which is just the current version and the previous version but you can add a custom tag if you want to by using the options that are explained over here so for instance here you provide an example tag you can also provide a description which will then show up here in the last column this full screen make it a bit easier to read so if you provide the description it's more or less like a git commit message basically where you can just add a message to this tag and if you then want to roll back at some point so let's say you do something and then you find out that you made a mistake you want to go back to an earlier version then there's the rollback command that you can provide again with the name of the tag and then it just switches the latest version back to whatever tag you provide here so if you want to know more about all these commands just run cbmfs server and you will see all the different commands that you can run and for instance the rollback is explained over here so then you can find out which options you can pass in and for the tag it has quite a lot of different options that you can provide so you can create tags you can remove tags for instance you can list all the current tags etc yes i see there's a question yes if you do rollback you can't can you go forward again that's actually a good question i'm not sure if it removes the the current tag then i see jacob is here as well maybe yes um uh you can't so the there's no redo the rollback is is is very permanent so what it does in fact so what it does technically it actually republishes because technically the repository can only ever move forward so it takes this old uh tag that they want to roll back to and it republishes it as the latest state now also technically all the hashes are still somehow in there they are not deleted but it's then very difficult to get them back it's only with with manual intervention yeah it it's what the rollback does sound like but i just wanted to make sure okay thanks jacob and thanks for the question i also see a question on slack maybe i can try to cover those now as well so kasper is asking before you do an actual publish is there a way to check which changes are staged so equivalence to git status i think there's a diff which you can do on a working tree uh correct me if i'm wrong jacob but i think that also works if you're just working inside the transaction and haven't published it yet uh yes that's right there's diff dash dash work tree with a 2.8 release yeah so to be released on monday and the other question cvmfs server ingest immediately also publishes right yes so if you use ingest that will uh i will demonstrate that in a bit but that indeed it does the publish right away so you don't have to do a manual publish afterwards um so maybe i can show you that right now no let's first do the catalogs then i will show you how the ingest works then i can do both at the same time um so that's i think everything about tags um then catalogs which is a very important thing uh as already mentioned so in the catalog is basically a sql database which stores all the metadata about the actual files um so from the client side you need this catalog for instance to make sure what the structure of the entire repository is and which part you're accessing and then the database knows which files have to be pulled in so which compressed versions of all those files um and of course that catalog will grow so the more files you add to your repository the larger this catalog becomes and the more overhead or the more metadata has to be pulled in by the client um so at some point if you add a lot of files that catalog will get unmanaged manageably large and it will take a lot of time to do all the overhead um so that's where nested catalogs come in then you can basically make catalogs per subtree of your repository um where it's important that you make that catalog in such a way that for instance files that you often access together so let's say an entire software installation for one particular application then you probably need most of those files in that directory at the same time so that you make catalogs per uh software installation for instance uh or for any other kind of directory structure that you often access together um now there are some recommendations about catalogs uh some principle you should stick to these rules so each catalog should have more than 1000 files otherwise they will get too small and you get another kind of overhead but they should contain less than 200 000 entries so files for directories um and there are several ways to do this um which I will come to now first uh how does it look like if you ignore this then at some point when you start doing publications that's when you run cvmfs publish you will see uh such a kind of error let's make this full screen zoom in a little bit uh so then the the publish operation will give you a warning saying hey your uh your root catalog so the one stored at the root folder uh contains more than 200 000 entries uh and also shows the actual number here uh so then it really warns you that you should do something about it otherwise you will notice this in performance um if you do use nested catalogs already uh and some of the nested catalogs exceed this limit then I think the warning only shows up uh at 500 000 files so uh then instead of 200 000 you will see it a little bit later um so how to make these nested catalogs so there are a few different ways uh first one is that you I don't think we discussed that one here but you can do cv uh let's cvmfs do it automatically for you so then we'll just uh make them with some assumptions but there are two ways to do it more manually because often you will probably know which ones which files belong together so for instance again make sense to make one catalog per software installation um so you can do that by uh making these files yourself what you basically have to do then is touch this file inside the directory where you want to make the nested catalog so it can just be an empty file but this is a an indication for the cvmfs server command that in that directory you want to have a nested catalog or for that directory and everything beneath it so the entire subtree will then get one catalog and you can nest these in whatever way you want so you can put one in the root folder you can put one in one directory below that and again for all the subdirectories in there you can add even more so in whatever directory you put such a file and everything below it until it reaches another level where where it finds such a file but for everything it will create these nested catalogs so you can just open a transaction yourself touch the the file in the locations where you want to make these catalogs and do the actual transaction um or the the publish I mean and when you do the publish command uh cvmfs will search for these files and generate the catalogs for you um then from the client if you for instance access the root folder it will only retrieve the root catalog um but once you go into the directory structure for basically every level where a nested catalog has to be used it will pull in that nested catalog but that means that at least the root catalog for instance is can already be much smaller and depending on what files you access it will retrieve the additional ones for you um but this might sound easy but at some point of course you have so many files then you also might lose the overview yourself where you have to create these files unless you come up with some smart scripts for instance or a hook for instance that you can use to make those files but an alternative is to make a cvmfs dir tab file which is basically a text file which specifies in which directories you want to create these nested catalogs um and that's all just relative paths inside your repository so this is an example that shows a structure of a repository probably a quite typical software tree on the hpc cluster so you have the software and the modules folder with uh applications over here and inside the application folders you have the version folders and inside those folders you have actual software installations for that particular version and in the same way you have these module folders and then just one level less with the the actual module files so if you want to create nested catalogs for this as i mentioned before you probably want one catalog for this version you want one catalog for that version etc one for that one one for that one um and for instance you could make one for um the entire software folder so in this case there are only two applications then it doesn't really make sense if you make a nested catalog here then there's basically only two entries in it but assuming you're going to add lots more applications in here you could make a nested catalog inside the software folder as well and you could make one in the modules folder which will just take up all the module files and module directories in here depending again of course on how many apps and versions you have but probably that's good enough so how can you do it but you don't have to specify all the directories individually you can use these wildcards which make things very easy so then you can for instance here say i want one catalog for each folder and each for each sub directory of every directory in my software directory so that means you will get one for this so this is two levels deeper from the software directory so that's where the two wildcards point to so now you get one for every software installation and then we also make one for the entire software directory itself assuming that there will be more than two applications in here in the same way we also want one nested catalog for all the modules so you get one for all the folders here and all the the files in that directory and you have to make quite a lot of software installations before you actually exceed 200 000 here but then you can even start splitting this one up if you want to one remark here so with the cvmfs ingest commands you didn't have to use this this leading slash here for the dear type i think you have to use it so that's a little bit different but something to remember and one other thing which i think i didn't mention here you can also use exclamation marks to exclude something so if you don't want to have a nested catalog for something that you did specify here you can so exclude something well for most cases this will work fine and of course at some point you may have to add customized entries in this file as well so for instance let's say you have a software installation which is extremely large with lots of files because it has some kind of subdirectory with lots of examples or libraries or whatever then of course you can add individual lines here let's say i want to have one more for my software and let's say every matlab installation for instance if that would have lots of files if matlab has a subdirectory with lots of files for which you want to make an additional nested catalog you can just add that line here as well or you can even combine this method with the previous one so you can also still add manually created files with this name so this dot cvmfs catalog so something which can be useful is this command so let's show you how that works so i'll run this on my server that will just show you all the catalogs that you have well at this point it doesn't do anything yet because i only have one catalog but i can show you what happens if i'm going to add more first we'll show you the ingest command what i did this morning because it takes quite a bit to ingest this tarble i prepared a tarble just a dummy tarble i'll show you how that looks like to my stratum one with lots of dummy software installations and dummy directories and files in here so just empty text files basically or almost empty text files with just one number in it but just to make sure that i'm going to exceed the limit of my catalog so there are lots of files in those directories i'm going to go through it all now but i just ingested that this morning that took about i think 10 minutes or so so i used this command as demonstrated on that page to ingest it and i ingested to a software folder in my repository so that's going to take the tarble and well you get all these dots as some kind of progress marker that it's chewing on the tarble but at some point it will say okay waiting for upload of the files committing the new catalogs and then you will see this warning over here that i showed you on the page ready because my tarble added like half a million files it's going to warn me that this catalog now has too many entries so that i should split it up but nevertheless it will just publish this this tarble so then you have to basically sync the stratum one so assuming you have a chrome job set up you don't have to do anything just wait for a bit but you can of course also do a manual snapshot on your stratum one so that's what i did here so it's going to replicate and you will see that it has to pull in quite a bit of files and also this too i don't know also again like five to ten minutes or so because there are of course quite a lot of changes in the repository but after that's done it will show up on the client so now on the client i can browse to my repository and go into that software directory so there's dummy one dot zero and then in share i think i added all those directories where i can just go into one of them and in each directory i think there are 500 files or so which all have if i'm correct just a number in them so just the name of the directory and the file which means they are not the same otherwise cvmfs would have been smart and deduplicate all the files if they are empty for instance or contain the same contents and then it would have been quicker probably that also brings me to another question that was raised yesterday on slack how long does it actually take before files show up on the client after you ingest them which can vary of course but there are several layers in between that have to be synchronized basically so first you have to ingest it on the actual stratum zero that can take a while but after that's done the first the stratum zero has to pull in all those changes so that both depends on your cron job of course or whenever you run it manually the snapshot command so first you have to synchronize the stratum one and that can take a few minutes for instance and also the actual synchronization itself can take a bit depending on how many files you have added so for me this took like five or ten minutes before it actually pulled in all the changes this is the stratum one so this took like five to ten minutes and that even then when this is synchronized it might still take a bit of time before it shows up on the client because the client has this catalog and the catalog has a default time to live value of four minutes which you can change if you want but that's the default and that means that in principle you would have to wait four minutes or maybe less but at most four minutes before the client is going to check for a new version of the catalog I think you can force it to reload the catalog by for instance remounting a repository but if you don't do that you might have to wait a few minutes and then you should see that the next operation that you will do on the on the repository will pull in the new catalog and you should see the new files let's go back here so assume that I now want to fix my catalogs I can either go into my repository so first I have to open a transaction and now I can go into my repository and I can either start creating the deer tab file here or I can go into the software directory over here and manually change or add all the cpmfs catalog files so if I just want to make a deer tab I'm just going to add a file here and I have to tell it that I want to add this for the software directory and then the right sub directories not completely sure what the layout is again so let's check that so the software then there's dummy and the versions so I want dummy and the versions so now I basically get one per installation which might be enough I'm not sure anymore how many files I added inside those directories but let's just first try this and see what happens so commit this file just to run a publish on my repo so now it should read in my deer tab file and it will start creating those nested catalogs that might take a bit of time because it has to process everything in the repository let's in the meantime see if there's anything else I haven't explained yet no I think I already reached the end so in the meantime I can have a look if there are questions there's quite a lot of things going on in the chat so I haven't followed there was a there was a very good question let me scroll back oh yeah can you purge old tags or prune the data if you remove a large part of the repository yeah that's a good question you can indeed manually remove tags if you want to so that's the this larger again where's the tag commands so over here you can see how to remove tags which will basically mark them for removal and we will discuss this also tomorrow so there's a topic about this in the advanced section this will not immediately remove all the files in in the repository then you still have to do garbage collection which is another command which is also listed somewhere here that's the gc command but I will discuss how that works tomorrow so then you can basically mark tags for removal and then the garbage collection will basically do a scan of which files are no longer used by the the still active tags in your repository and everything that's that's basically an unreferenced file will then be removed from the repository but then you have to enable garbage collection and do some other things so I will show that tomorrow but now you see here that it's going to defragment the catalog because it can clean up lots of things from the root catalog which will now be moved to the nested catalogs to verify that these catalogs have been created you can of course just go into the repository yourself and check if for instance in my dummy 1.0 folder I now have a .cvmfs catalog and indeed it shows me that there's catalog in here right now and in the same way you can do that for the name of the other my app so also for my app you indeed see that this catalog is now here this file which is an indication for the server commands that it should create a nested catalog here there's also a nice command for this which I already showed you the list catalog catalogs sorry typo so this shows it in a bit more graphical output where these catalogs are located so it lists all the paths that have a catalog and there's a nice option for this so again this catalogs where you can get the actual number of entries in the catalog so that's the minus e option over here so if you run that you will even see how many entries are in that catalog that nested catalog for each of the the the ones that you've added so now you see that even the ones that I created here are still quite large so you could even split those up I think for this one it might be a bit tricky because we have if I'm correct 500 folders with 500 each 500 files in them so then you have to come up with something smart to not make them too small and also not too large so I will not do that now but that's part of the exercise as well and then the structure is a bit more or just makes a bit more sense so then it's a bit easier to solve this and then you can be sure okay make sure that you get the right size just for your catalogs but on the client basically nothing will change or at least you will not see it so the client will still work in the exact same way at least from a user's perspective but under the hood it will now pull in the right catalogs based on by hue access so if I'm going to access this dummy 1.0 over here I'm doing ls on that directory it will only pull in the catalog for this particular folder and not for all the dummy 2.0 or the my app folders or whatever so we'll do that on demand I think that's everything I wanted to cover for today so for the exercise let me explain a little bit about that so we prepared a similar kind of star bowl for the exercises which is in some sense a bit easier in terms of structure but on the other hand a bit more complex so it has a structure with several kind of architectures and in those folders for applications so if you've played around with maybe the easy repository already that's kind of similar structure so you get software installations for for instance Intel Haswell and different architectures and for each architecture you will have a similar kind of tree with some dummy applications in them where some of the applications have lots of files in them so we added a folder with lots of files so what you will have to do in the exercises first grab this star bowl from the specified URL then don't extract it manually that can take a very long time because there are lots of files in there so just leave it as a tar bowl this shows the overview of what's actually in the tar bowl so as you can see and there's a AMD Roam with modules and software installations so we have a software installation for R and for Flensource stream and whatever that may be and it also explains a little bit about the number of files or directories that you will find in there so for instance those are very small but here is the thing where you have to be careful the examples directory of open phone those actually have three sub directories with lots of files in them so this is where you will have to be careful with the catalogs and then the same thing for two other processors so an ARM processor and Intel Haswell processor with the exact same structure as the other one so what you're going to do in the exercises is first ingest the tar bowl using the ingest sub command into a directory called easy build so again this more or less mimics the thing that you might want to do if you set this up yourself by having a build machine or several build machines and then build the software on those machines tar them up as one big tar bowl and then ingest that in one go to your server then you should see that error that I also showed you about the catalog being way too big so then you have to think a little bit about where to create those nested catalogs but don't do it manually yet but just think if you would have to do it manually where would I place those files and then instead of doing it manually you're going to make a cvmfsdir tab file with the right locations for those nested catalogs then you do another publish because you have to add that file to the repository and then you have to be sure that the at least the warnings go away instead you will see probably those defragment messages because it's going to move lots of entries to the nested catalogs and then you have to make sure that indeed the catalogs are now in order by either doing something like this a find on your repository and print all the locations of the cvmfs catalog file you can do the same thing with that list catalogs command and if you use the minus e you can also see the counts of all databases so here's a solution for the cvmfsdir tab but first try to come up with it yourself and then you can maybe use this one to verify that you've done it correctly okay so that was everything I wanted to tell about today I see at least one raised hand here in zoom so Marcus go ahead I'm sorry I was muted regarding the replica on start from one if I have set up a cron job for doing a snapshot every five minutes so if I'm ingesting for example million files also the snapshot will take its time what happens if the cron job starts to transactions on to snapshots operations at the same time I think it will detect that a snapshot is already running and then it will just skip it and do nothing so it will not cause any issues okay perfect so even if a snapshot can indeed even take several hours I mean in this case it should go quite fast because both VMs are running very close to each other but in practice you might have a certain one in Europe and one in the in the United States for instance and then indeed a very large transaction can take hours or maybe even days if it's an initial snapshot which consists of several terabytes of software but that just works fine so then it will just take a very long time but that should just work okay perfect thanks yes there's a yeah okay Victor go ahead yeah just a question more to the easy build itself integration with cvmfs somehow I don't know if it makes sense to easily support the creation of the catalog automatically does it make sense to let easy build do that for you you mean yeah created this dot catalog file cvmfs catalog files on the fly so it passes minus cvmfs catalog and even created catalogs could perhaps make sense we indeed also for easy have thought about maybe using an easy build hook to just ingest a cvmfs catalog to each software installation because probably in practice you want one per installation anyway um so yeah that could probably make sense but maybe kind of as an easy build developer something about it no yeah I think that makes sense but it would be up to in of course it only makes sense if you're using cvmfs but so having these dot catalog files in the tarpaul you ingest is it enough to make it a single ingest or do you still have to do a separate transaction afterwards that's a good question I think I didn't say that here I don't even know if we added that warning here but there's one small issue with ingesting tarpauls and nested catalogs as in that if you do an ingest of a tarpaul then the ingest commands will basically ignore the cvmfs deartap file so it will not create a nested catalogs for everything in the tarpaul but it will do it on the next publish so what I often have to do is ingest the tarpaul and then do a manual transaction plus publish with no changes but then at least it will read in the deartap file and generate the nested catalogs I think they are working on a solution for this so I hope in the near future that's not no longer necessary and then I expect indeed if you put those dot cvmfs catalog files in the tarpaul that it will automatically use those as well but yes again I think it makes perfect sense if you would let easy build creative files for you or do it in some other manual way or with a hook and to be clear about these dot catalog files dot cvmfs catalog files they're just markers right where the catalog should be created but they remain empty so it's not like cvmfs puts the catalog in there that's all internal it's it's not an actual catalog itself indeed I can show you the also on the client side when you see those files I just ask one here that's probably not visible on the client yet but let's do it here then so software dummy 1.0 so you will see the file but it's just still an empty file so it's not that this is the catalog itself but indeed it's just a kind of marker which says for this subtree I want to have a nested catalog and if you ever want to get rid of it you can also just again delete this file and then it will basically merge it back into a higher level catalog so the reason you have to do the two-step thing with the nested catalogs is because on ingest it's creating the markers but it's not actually creating the catalog so that's why you need a second transaction yeah yeah okay so yeah it sounds like something that could be fixed on in cvmfs itself there was another question by by true is there a way to clone a repository of a given tag to archive it forever so I guess similar to what gets archive can do um so make a full copy of everything of a specific tag and I guess for backup reasons or thinking of this I'm not aware that something like that exists let's take a look here but I don't think there's some kind of export commands or so there's a tag which cannot do that and that's just tagging no as far as I'm aware of that's not possible and it I think the question is actually not not to create a separate tarble but to make sure that a tag remains there forever so whatever the state of the repository was then that you can never lose it so in this some kind of archive but that's actually sticks to the repository itself so that may be a good feature request to take up with Jacob he's no longer in the call in the call now but so maybe that's something to take up with them via their JIRA or their mattermost and I noticed they also have a new forum setup what's it called they don't call it the forum yeah this is a discord so you can you can jump into their discord and ask there if this is possible or if it's a good idea to add something like this there is actually a check out I'm not completely sure what that does compared to a rollback so maybe that's something you could play with and check out a specific tag and then make a tarble or whatever well that's still not the archive that you mean but yeah we can probably ask a jacob okra has a question as well yeah does it ingest on or the outer catalogs setting so that it creates the catalogs on the fly or do you still need a dummy publish sorry I didn't get the first part of the sentence does the ingest command acknowledge the outer catalogs setting to be when it's set to true so that it actually creates the catalogs on the fly or do you still need an empty published transaction if I'm correct there is something in the documentation with the ingest where you can indeed make catalogs or was that this one um not completely sure but at least with the daretop file it will not do that um but maybe indeed if you use the auto catalog setting then it might be able to do that I haven't really played with that myself because I think for most cases it's better if you do this manually since you often know better which files belong together well maybe well at least more I would guess than the cvnfs server command because well you probably know better what kind of structure you're ingesting into the repository yeah we're using both okay make sure that they aren't that they don't get too large anywhere yeah there's another good question in the slack as well like apil um is there a way to check if a transaction is already open on the repo short of getting an error so um I think the info command will do that so I can uh transaction or what was it list or info um here it doesn't say no it doesn't there's there's no way to check you just have to try or it's list so list shows in transaction and I think well but there's also a way correct me if I'm wrong that you can start a transaction on a part of the repository you can you can give it a part and say I'm only gonna make changes to this specific part of the repository but maybe that's is that the gateway stuff or can you do that straight on strato man that's definitely with the gateway indeed um can you do that with a regular transaction probably no don't see that here but indeed with um so that's also a topic we're going to cover tomorrow in the advanced topics but if you use a gateway which allows you to have different um publisher machines or different machines that are allows to make changes to the repository then you can indeed add a setting that says this build machine only has access to this part of the repository and then you can even have multiple transaction transactions running at the same time as long as they don't overlap in whatever they have access to but yeah so the list command does show if something is in transaction or not and I think that was already part of the question then but if you don't check it and accidentally try to open a second one you will get an error or warning or at least a message which says this doesn't work because there's already a transaction open okay a more practical question maybe by by york how long will we have access to the VMs so we will we will keep the VMs running um until tomorrow evening so we plan to wrap everything up before uh the weekend so around 8 p.m european time tomorrow evening we will kill everything that's still running and close the accounts so until then you can play around um and if you're done with playing so if you've uh done all the exercises and and maybe some stuff after that you're you're certainly welcome to but as soon as you're done and you don't plan to use it anymore please terminate the cluster yourself okay I don't see any more questions so I have one if you allow me sure Victor so it's about the disbalance between the number of catalogs the number five and catalog right it sounds like to be a difference between the size of the message versus multiple messages right so uh is it because you can parallelize the the importing of the mysql or the my sorry the the sql light database or is it because it takes forever to read files and this limit of 200 000 where it comes from it's empirical or I think that's related to um let's say you have one extremely large database for basically your entire repository um so on the client size you kind of have some kind of virtual file system so the files are not actually there but when you access a file in slash cdmfs slash whatever um it will basically translate that path to the compressed file that it needs and for that it has to run queries on that database saying okay I'm accessing this file what is the actual uh compressed file name so the all the compressed file names have a file name consisting of the hash of the file and then it has to pull in those files on the fly from your proxy or stratum one and I assume that if you make that database too large that it will become too slow to run those queries on that large sql light database plus you of course are pulling in a very large file a very large sql light database where you might not even use like 90 of the information because you just want to access one specific sub directs v so in this case if you make nested catalogs and you access just one application folder for instance in your repository all it does is pull in I guess still the root catalog but also then the nested catalog for that particular folder which is small so it makes it easy to retrieve it it won't take a long time to download this catalog and it makes it much quicker to run a query on that smaller database but yeah on the other end if you make them too small then you're going to have to pull in quite a lot of those small databases I guess and you have to open them all and run queries on all of those so making them too small is not the best idea and I think the exact numbers I don't know where they come from exactly but we also noticed that CVMFS will give you a warning if you go over 200 000 files in the root catalog but for nested ones you can actually go up to half a million before before it starts creating um or spitting out warnings so there's some heuristics there or I don't know how they were determined exactly that's a very good question for Jakob who is not in the call anymore I'm sure he would know one of the reasons is that the root catalog is needed to buy all clients always yeah yeah yeah so that keeping that smaller makes sense but how exactly they got to the 200k and 500k I'm not sure okay thanks also it's just a target if you if you prefer keeping smaller catalogs so if you keep the limit at 100k yourself that's fine that's a question because you know going back to my previous question about easy view support the christian catalog so you could do like minus minus cvmf catalog equal and then you give a number of files right and then you always split in that amount of files therefore for easy build going back to the creating the dot catalog files automatically we do have installations that are quite small which are like one binary and maybe two libraries and here and there are files so it may be a dozen or so files well while the recommendation is to have at least a thousand files per catalog so that in mind creating a dot catalog for every installation is probably not a good idea on the other hand we do have installations like are where there's certainly uh several tens of thousands of files um in there so there you there you do definitely want at least one catalog maybe multiple ones even but that'd be great if you automatically skip those that are very small right but you give a warning say not creating because it's too small you can force it by minus whatever force catalog creation something like that and then voila sure and then it could just create a catalog higher up so one catalog for all the software versions if it knows the installations are small yes that could definitely be some logic there and i think that's also what we'll do in the easy project so we will probably combine a nested catalog like via dit tab together with dot catalog files that were either created manually or by easy build so we'll do a combination of both for the exceptional cases thanks i think we should wrap up here because the the reframe tutorial is starting in 15 minutes so we have to make sure we don't overlap with that in terms of streaming um so yeah if there's more questions um please use the slack channel and we will make sure there's there's some time for questions tomorrow as well in the in the last session i think bob tomorrow it's mostly going to be explaining and showing things we don't have an exercise anymore yeah for the hands on so a bit more advanced topics tomorrow so indeed without an exercise okay all right thank you very much everybody see you tomorrow you tomorrow thank you thank you