 We'll go ahead and continue with the second part of the CERN VMFS tutorial. So to start with, I'll skip the slides. We covered yesterday here. So this is where we ended yesterday. We have set up a stratum zero and a client. And we're using the client to directly connect to the stratum zero, which is not recommended in a production setup, but it's a good starting point. A couple of things to recap after the questions and the comments we got yesterday. So first of all, one of the questions or actually multiple people ask the same question if you can use multiple CVMFS repositories on a client. Yes, you can. Even if they are served by different stratum zero servers or through stratum different stratum one servers, that's absolutely not a problem. So they can be from different domains as well. And actually, if you have set up the client, you are already doing that, but you may not be aware of it. So here I have my client setup with my small test repository that I've created yesterday. And actually, if you check the configuration of CVMFS, so config.d, you will see you actually have a default configuration file there as well. And thanks to that or related to that, there's also related domains here. There's cern.cegi.eu and opensciencegrid.org. These are just included in the CVMFS installation itself. So you will also have these, which means if you look under CVMFS due to AutoFS, you won't see much here, but you can actually access some of the cern repositories as well. One of them is cms.cern.ch. So if you ls that with that name, it should pop up after a couple of seconds while it's pulling in the metadata. And here there's a whole bunch of stuff already available. Likewise, there's also an unpack.cern.ch, which I think includes unpacked docker containers in the CVMFS repository. So you actually already have access to multiple repositories on top of the one you created yourself. And of course, you can install the needed configuration for additional repositories as well. I'll talk a little bit more about how this works, because there's some magic going on here in terms of how things are configured. But I'll cover the other recap things first. There are some questions related to the repository name as well. You can basically name it anything, but yeah, help yourself by using a suitable name that's specific to to you in this case, in case of the tutorial. The domain name part doesn't actually have to exist. So it usually does. It usually points to, if you just enter the domain in a browser, you will hit a website related to that domain, but it doesn't actually have to be a real domain that resolves through DNS. It can be anything. And as a consequence, you're actually fully free to pick the name of the repository and the domain. You could even name your repository test.cern.ch, and CVMFS will not complain about that, even though you don't own the cern.ch domain. Of course, that's not recommended to do that. There was one small mistake in the tutorial. So it was telling you to run the check setup sub command first before running the setup command. It actually is the other way around. So check setup can only work if the configuration or if the repository is already set up. So run CVMFS config setup first before you run the check setup. And we fix this in the tutorial. So if you refresh, it's now correct there. And there was also an issue with the debugging. So people were using the debugging section in the advanced topics part of the tutorial, which tells them to set the CVMFS debug log setting to a log file, where CVMFS will log things. A small and important detail there is that this has to be writable for the CVMFS user. So for the user that was used to install and run CVMFS. That's important. People were trying to point this to a log file in their home directory, which was not writable by CVMFS. And then not only do you not get logs, but you actually also break CVMFS. So that was a bit annoying. And then, of course, the public key has to be in the right location. So under Etsy, CVMFS keys, and then the domain. On stratum zero, everything is directly in the keys directory. We recommend you to, on the client and on stratum one, to put it in a sub directory for the domain. The main reason for that is on stratum one, you typically have repositories for a single domain, potentially multiple ones, but usually not a whole lot. While on the client, you could potentially connect to hundreds of CVMFS repositories. So there you want to have things a bit more organized per domain. That makes a lot of sense. This morning, we added an additional section to the advanced topics one. So most of these will be covered on Friday. But there's one here that we added this morning that we figured is actually good to follow up on after installing the client yesterday. So as you noticed, and certainly some people ran into small issues by making mistakes, when you manually configure a client, you have to do a couple of things and do them right, use the right names, put things in the right location. Like don't forget the .conf extension, for example, or make sure the domain doesn't have any typos, all these things. So there's a manual configuration you have to get right. And in addition to that, if you're actually using a proper repository, which has multiple stratum one servers, which is production, so anything beyond playing around, you also have to maintain that configuration. So if there are changes made to the stratum one servers in terms of adding or removing servers, you should update your configuration to remove or add those corresponding IPs or host names. And yeah, also with additional changes, you may have to be aware of things and and update your configuration. That's a bit annoying. And first of all, you have to know that the changes were made. So you have to register to some mailing list where notifications are sent or something like that. And then also remember to make the changes every now and then. So CVMFS has a way of preventing these issues by using what they call a configuration repository. So this is really another CVMFS repository, which is mounted on the slide CVMFS, but it only contains configuration files and public keys for the repository for the set of repositories. So this is explained here. And there's actually this is actually also what's going on with CMS, for example. So if you check here in the config dot D, there's a CVMFS config dot cern.ch. So this is the configuration for the cern.ch configuration repository. And if you check this on the CVMFS, so this is actually mounted by default when you install CVMFS. And in here you have Etsy, CVMFS, and then a config dot D and the keys directory, which has the public keys for each of these domains and configuration files for different repositories. In this case under the cern.ch domain. So there's a lot going on here. The nice thing is here, when you have this configuration repository mounted, you don't have to update or maintain it yourself. So just like a regular software repository, when changes are made to the repository, they are picked up automatically by CVMFS. So that's very useful. If the configuration changes, you don't have to do anything yourself. CVMFS will make sure it always has the latest version of the configuration for these repositories. So that's a handy way of avoiding to maintain the configuration yourself. There's one limitation to that is that you can only have one configuration repository at a time. So that's a pretty strict limitation in CVMFS. And from a technical point of view, they could work on removing that limitation, but it's not that easy to do. And it's currently not something they are planning to remove. So you're stuck to a single configuration repository. If you want to add or mount additional software repositories, which are not covered by whatever configuration repository you are using, you will have to maintain those configurations yourself, either manually or by installing the packages, the RPMs or the Debian files, and that provides those configurations. To remedy that a little bit, they have, under the CVMFS contrib organization on GitHub, they have a config repo GitHub repository where they collect configurations of multiple CVMFS repos. So I think there's maybe three or four configuration repositories in there, each of which have configurations for multiple repositories, multiple domains. So by mounting just one of these configuration repositories, you get a whole bunch of software repositories along with them. So that's a bit of a middle ground. You can only have one configuration repository, but it can include the configurations of a lot of software repositories. So that's one way of getting access to more stuff. Another way is what we do in the easy project. So the people who have seen Bob presentation yesterday on the easy project, there we take a similar approach, or at least we provide a package that you can install in RPM or Debian files or even for macOS. We have packages that install the configuration files for our configuration repository. And if you do this, so I'll just copy paste this here and do this on the client. So if you install this additional RPM, it will create the necessary configuration for the easy configuration repository. So you get this config file and probably here as well. Okay, not here. But yeah, whatever is needed to get access to the easy repositories is in place. So you can then make sure this is mounted. So just installing the package is enough. Cfmfs will make sure things are properly working. So all the configuration for the easy Cfmfs repositories are here. And thanks to that, you also get access to the easy pilot repository, which is on the pilot.eachpc.org. And here you can see the structure that Bob was talking about yesterday, the compatibility layer, the software layer, or the x86 Intel Haswell software. So here, there's a whole bunch of stuff that was installed with easy builds in a nicely organized way. So once you have a package or a configuration repository, it's very easy to get access to additional software repositories. Okay, then I'll do the high level explanation here. And then I'll hand it over to Bob who will cover this in detail and also demo it. So the next step in setting up Cfmfs is installing a stratum one server, so a mirror server of the stratum zero. So we don't have to connect directly to stratum zero anymore, which is not recommended. We want to offload stratum zero as much as possible. So it only serves to stratum one servers and not to clients directly. So we can properly load balance things. And so you're so there's since there's only one stratum one server, it's in one specific place, for example, in the Netherlands. And if you're connecting from the other end of the world, there could be a long delay in terms of network latency and so on. Well, a stratum one, you could actually have a stratum one close to you, which also helps a lot with the latency and the bandwidth issues. So there's multiple reasons to set up stratum one servers. And this tutorial, we will set up a single stratum one server, but the procedure is the same. If you want to do additional ones as well. So we'll do that. This will create a mirror, a full copy of our test repository, it will automatically synchronize things if we make changes in there. And in addition to the stratum one server, we will also set up a proxy server, a squid, which acts as a front end to the stratum one server and does some caching. We will install a proxy boat on the stratum one itself and also on a separate server. And Bob will explain why that's a good idea to do both. And then we'll reconfigure our clients to make sure it uses the proxy and the stratum one rather than connecting directly to stratum zero. So we'll explain and or Bob will explain and demo all of that. And then there's an exercise that basically you can do the same thing yourself for your test repository. So with that, I'll hand it over to Bob. So I'll stop my share. Thank you, Bob. Start sharing my screen. So actually there was a question about what you just did by installing the easy configuration package. And just to clear that up. And the package that we provide for easy indeed is a CVMFS configuration package that will use a CVMFS configuration repository. And as I think it was Victor pointed out in the chat, Kenneth just mentioned that you cannot use two repositories, configuration repositories at the same time. So that will definitely cause some kind of issues. And it's, I think, a bit randomly defined which one will then be used. But you can let CVMFS dump its configuration that is being used at that moment. And then you will see that there is a CVMFS config repo variable that can only point to one specific repository. So it will probably use either one of those. And I guess since the easy one work that it was using the easy one at that point. Yeah, that's probably I guess because I installed the easy package last, then that that one wins. And the other stuff is probably not accessible anymore. That's a good point. So yeah, that's something you should not do in practice, probably, or that we've done in production. So I will now, again, walk you through the page that we have set up for installing a stratum one in a proxy. It's a little bit more complex than the things that we did yesterday, which was already quite prone to errors, because you had to do lots of these manual configuration things today that will be even a little bit more because we are going to add two new servers in between the client and the stratum zero. So what I will do basically the same thing as yesterday. So I just fired up two additional virtual machines for my test cluster. So if you only have two running at this moment as well, then today you should start the stratum one VM that you have in this list and the proxy that you have on the list, so that basically everything here is green. So on the right side of my screen, I'm now logged into the stratum zero, then the second tab, the stratum one, then the proxy, and then the client. They have a bit annoying names, so it's hard to distinguish between those, but I just put them in the basically the logical order stratum zero one proxy client. So let's do right. Um, so let's go to the page. So I'll go to stratum one and proxies on the tutorial website. So I'm not going to explain this again. Kenneth has just explained why we're going to add this. So I just will do the same kind of thing as yesterday. So there are some requirements here in terms of ports. We need a few more open ports or actually one more open port on the stratum one compared to the stratum zero. That's something you don't have to take care of in the in the virtual machines that you're going to use for the exercises. If you're going to do this yourself on your own machines, then take this into account so that the firewall actually allows you to access those ports from, well, basically from the machines that you want to give access to. So it's up to you if you want to have a public stratum one, for instance, or if you just want to have a more private stratum one that can only be used, for instance, by one specific cluster or a set of clients. Something else that you, it's the documentation of CVMFS is a bit unclear. That is a very strict requirement. But in principle, you do need a license key for the geo API. That's an API that is used by CVMFS to determine which stratum one server is geographically closest to the client. So suppose you're running a large project, for instance, like easy, and you have multiple stratum one servers distributed throughout either Europe or the entire world. So let's say we have one in the United States and one in the Netherlands, and then there's a client connecting from, let's say France. Then basically the client will connect to a stratum one, and it will find out which other stratum ones are available. And using this geo API, it will be redirected to the geographically closest one. So that basically the latency will be the lowest and the bandwidth will be the highest, assuming that everything is correct with that stratum one and the connection, of course. So to make this work, all the stratum ones need this geo API license, which you can obtain for free. You do have to register on the website. So linked over here. So you have to sign up for a free account, and then you can basically get a license key that you will have to pass to your stratum one configuration. So the documentation of CVMFS seems to suggest that it's not strictly required, but if you don't use it, then you will get a nasty error when you're going to deploy the stratum one. Now we do mention the work around, which I think is a little bit down the page. Yeah, it's here. So bypassing the geo API license key. So you can set a variable in your server configuration file that basically says I don't have a geo database file. And in that case, all the commands listed on this page for deploying your stratum one will just work out of the box without complaining about this geo API. So if you don't want to do this for the tutorial, that's fine. Then just set that variable. And it's okay. But in production, it's really recommended also by the CVMFS developers to do this, because they make lots of assumptions that this geo API is being used. So they strongly recommend to enable this feature. Okay, so how does the actual installation work? Well, first we need a bunch of packages again. So I will just start running those first on the stratum one. So I'm going to reuse my stratum zero from yesterday. So that's still the same one in my client. But this one is now a fresh VM for the stratum one. So I'm just going to run this. Hope I don't have the same issues as yesterday. So as mentioned over here, again, you don't need this in CentOS 8, then there's still this one dependency that's not available in the CentOS 7 repository, which is available in CentOS 8. So that should be fine. And other than that, you will see that we again need the CVMFS server package. We don't need the default CVMFS package on the stratum one, but instead we need the squid package because the stratum one basically runs Apache and squid as a front end. And there's one more package. This is a kind of Python interface for Apache so that Apache can basically talk to the Geo API, which is a Python as a Python interface. So that Apache can basically send out those queries to the Geo API to find out which stratum one is the closest one to the actual client. So this is almost done. So that's all explained here as well, why you need these packages and what they will do for you. So this is almost done. Now we have to do some trickery basically with all the different ports and configure them in the right way. So Apache doesn't have to be available from the outside of this machine because squid will be the front end to Apache. So we basically connect to which to Apache. So Apache only needs to listen on an internal port and on the internal IP address. So that's what we're going to do right now. We're going to edit the standard Apache configuration and look up that line that you will find over here. So by default, it will listen on port 80. So we're going to remove or comment this line and basically add this one. So then just save and exit. That's everything that we need to do for Apache. And then for squid, we basically have to set it up as a reverse proxy as a cache towards Apache. And this is the default configuration that you can use. You can tweak a bit with, for instance, the memory that it's allowed to use. But other than that, here it basically defines that Apache is running behind the squid proxy. So on port 8080 and it's running on local host. So I'm just going to go to the squid directory. So there's the default squid configuration here. So for now, I will just move that out of the way so that we can start with a start with a fresh configuration file and then just copy paste whatever we use here. The main thing that this will access a cache for is for the API calls. So that's explained on this page as well. But basically this line says that it's going to basically check the link that is being accessed, the URL, and it will find if this is accessing the API and it will cache all those API calls because that's the most load intensive on this machine. So that's the squid configuration. And then we're basically done with Apache and squid. So we can just start the services and enable them to make sure that they will automatically start after a reboot. This doesn't work apparently. We'll do it one by one. So that's all done. So if you want to be sure that they're running, you can take a look if they're indeed running. So this looks fine. And also, squid should say that it's running. If you ever have any issues with this, we also have a section under the advanced topics page which explains a bit about where you can find the log files for, for instance, squids and how to find out what kind of issues you may be having. Then there's another thing that we will not explain here, but the CVMFS documentation also recommends to use some kind of DNS caching servers on your Stratum 1 server. Also, again, because of the Geo API, because that's also doing a lot of DNS lookups, because it has to reach out to other Stratum 1, for instance, and to the Geo API. And basically to reduce that load, it's recommended to have a DNS cache running on this machine. So there are tutorials available online. For now, it's not that important. I'm not even sure if it's extremely important in production setups, but if you want to use one of those, so there are several services available that you can use as a DNS cache, then you can look that up yourself or, for instance, go to this tutorial which explains it about system D resolve D. So for now, I'm going to skip that and I will continue with setting up the actual Stratum 1. So again, you need this license key, which I'm going to skip for now, but you basically have to add that key to this file. But instead of doing that, I'm just going to use the workaround for now. So the server.local is where you can define local settings for either a Stratum 0 or, in this case, a Stratum 1. I'm going to scroll down to the workaround again and just put this line into my server.local file. So you either have to do this or fill in the real license key using this line in this server.local file, otherwise you will get an error. So next step what we need is, again, the public key of the Stratum 0. So we'll switch to the first step here where I'm locked into the Stratum 0 and I'll go to the keys directory and then I need the contents of this file. So I'm going to, on the Stratum 1, also going to that same directory. Doesn't have a keys directory yet, so I will have to make a keys directory in here. And again, notice the different structure that we're using. So we need another layer in this directory with the domain name. It's not strictly required. You can basically dump everything into the same directory, but it might be useful to organize it a bit better and store all the different keys for one particular domain in the same directory. So now I'm going to open the public key file and copy paste this thing from the Stratum 0 into here. I hope that actually works. Mess up the line breaks. Okay, so that should be fine. And if I'm correct, now we have everything in place. So I can leave this directory. And then what you basically, the only last step that you will have to run is to set up the actual replica. So basically to register this system as a replica of your Stratum 0, which can be done by running this command. Bit of a long command. Let's just press that one, which will copy the entire command. So there's a few things in here again that you will need to replace. First, you have to insert the host name or the IP address of your Stratum 0. So in this case, we will use IP addresses. So I have to look up the IP address of my Stratum 0 here, which doesn't list it here, I think. Let's copy it from Cycle Cloud. So Stratum 0, copy the public IP address. So in principle, you could also use the internal IP address. But since often in production, you will find that these machines are running on different places. In practice, you will often use the public IP address. So again, here you will see that I'm telling the command that my local user, so CVMFS999, which is my account, should become the owner of the repository. So that you don't need root commissions later on to do stuff with the repository. And I pointed to the keys directory. So again, you could even store those keys in a different directories, or you could dump them all into just ETC, CVMFS keys without the sub directory. That's up to you. But I think this is good practice to store all the different public keys of one domain name, one organization, or one project, whatever it belongs to, into one directory. So then I'll just run this. You will see some information about the Geo database, for instance. It says something about updating the Geo database. So it basically has to pull in a new version of the database every now and then, which you can either do manually, but if you use the snapshot function, which is basically the synchronized function that we will see later on on this page, CVMFS will automatically do that for you. So you don't have to maintain the database yourself and update it yourself. So in principle, you can just skip whatever they say here. So this is just what it's doing then. So no errors. That's all looking fine. And then it's done, and it will tell you to run basically CVMFS server snapshot to do the actual replication. So it hasn't really done anything yet so far. It basically just registered this machine as a replica, but it hasn't actually pulled in all the files from the repository yet. That's what this function is for, snapshot. So let's quickly go back to the page to see if I'm not skipping anything. So I've done this. If you make some mistake at some point or want to remove a repository which you no longer use, you not only have to remove it on the stratum 0, but also you have to remove the replica on the stratum 1. So on both machines, you can actually do it with the same command. So remove the file system for this repository that will just wipe out all the files for that repository. So now indeed we can run this command. I'm not actually sure if we really need sudo here because we made our local user owner of the repository. So I guess I can even try that without sudo. So snapshot will make the initial snapshot, which may take a while, especially if this is a new stratum 1 for an existing project which already has existed for a long time and has lots of stuff in it because then it needs to pull in all those files which are compressed and de-doublicated, but still if it's a large repository, this can definitely take a while. Also, of course, depending on the bandwidth between the stratum 0 and 1. For now, for my repository, there's only one or two files in it so that you go very quickly. So let's try to run it. Indeed, I don't need root permission so it just runs as the regular... No, it does. So it's complaining about the geodatabase, which apparently does need permissions. So let's try again. But I think it did synchronize the repository itself. So just for updating the geodatabase, it does seem to use root permissions here. But what you will see, it found a few snapshots. That's something that we will cover later, but basically each revision of your repository automatically gets some kind of tag so that you can go back to earlier versions. So here it will say, I found five different versions of the repository and it's going to pull in the files then. So with 16 workers, you will see something about the number of chunks that it's pulling in. And well, then I just did it again. So it's again replicating, but then there was no new stuff, of course. So there were zero new chunks. It will also tell you which revision of the repository it is serving at this moment. So it's serving revision three at the moment. So I basically did three revisions. The snapshots, it's actually a bit higher. That's also because it's marking the latest one as, if I'm correct, latest or something like that. And the previous one as the previous versions, they have an additional tag that you can easily use to go back to the version that you had before. So that's basically it. So now I have a stratum one which has all the replicated files of my repository. But of course, when I now add new stuff to my stratum zero, it doesn't automatically pull in those added files or modified files. So you will have to synchronize the stratum one every now and then, which you can of course do manually by running this command whenever you make changes within production that isn't really helpful to do it manually all the time. So the best solution for this is to just set up a simple Chrome job that will run this, for instance, every 10 minutes. So they will check if there are new files and then it will just pull them in. So you can do that by making a Chrome job with, for instance, this line so that it runs every five minutes even in this case. And then it will just run this command snapshot. And I'm using two different options here where that's all explained. If you run this, then you get the help information. But basically the minus a means do it for all the repositories that are registered on this machine. It's a little bit bigger now. So minus a is explained here. So that means for all active replicas on this machine. At the moment, I only have one, so it doesn't really matter. But if you're going to add more repositories later on, this is really useful, then it will just synchronize all the repositories. And the minus I, it's up to you if you want to use that. But it says skip repositories that have not run an initial snapshot. So this assumes that you have done this at least once manually before. And then it will automatically update all those ones. You can leave it out if you want, but then again, warning that the first one may take a very long time if you're going to do this for lots of repositories at the same time. One other remark if you want to use this, the minus a flag only works if you have set up log rotation. So you also have to enable log rotation for the CVMFS log file. So that's easy to set up. You just need the log rotate servers, which I think comes by default on the center west. But you have to add this configuration into a file in this directory. So for instance, you can make this one. So CVMFS, and then you can just copy paste this. And if you want to, you can modify this a bit about how often you want to rotate the logs. But let's just use this for now. Okay, let's do that again. And then you can add this as a cron job, for instance, to this file. I'll skip that for now. It's just copy pasting it to this file. And then it will run every five minutes. That's for now not important. So that's a stratum one. And when you have this in place, you can already start reconfiguring your clients to no longer connect to the stratum zero directly, but to the stratum one instead, which I'm not going to do right now. First, I will also set up a proxy, so that we can configure our clients to both use the stratum one and the proxy. So Kenneth already explained this, but the proxy basically just adds more scalability to your system. In principle, you should have only a bunch of stratum one servers, because that also adds some overhead. So you should not start setting up like tens of stratum ones. A few is often already enough. And local sites can then just add local proxies. It's recommended to at least have to in case one goes down, you always have another proxy, but you can add more and more depending on the number of clients that you have. But already with two proxies with just a few cores and a bit of memory and some disk space for the cache, you can serve probably hundreds of clients already. So for now, I'm going to set up just one. Again, it's recommended to at least have two. But if you know how to set up one, it's easy to add a second one. So I'm going to switch to the third tab, which is locked into my proxy server. So all I need to install in this case is quid, which is in the default CentOS repository. But again, I think there's a warning at the top of the page about the squid version in the default CentOS 7 repository. As you will see here, it's a squid 3.5.20, which is quite old. And I think even end of life. So I got an email recently from one of our security experts at the university that one of our squid servers was running this version, which was end of life and that it was considered to be a security issue. So there's no updated package, unfortunately, in either the CentOS or the Apple repositories. So if you want to update to a newer version, there's a link on this page to a squid page about some different repositories that you can use to obtain squid version 4. And so for now, I'm just going to use this version. Squid 4 should work fine with the same configuration as well. And the same is true also for the Stratum 1, of course, because that also has a squid running here. So you should probably use a newer version in practice. If you use CentOS 8, then it's fine, then you'll automatically get squid 4 by default. So again, we need a squid configuration on this machine. So let's move the original one out of the way again. I'll do the same thing as I did before. And for this one, it's really more specific for your setup. So there's not something that you can just copy paste out of the box. So we can start with this template, but you will have to change some things here. Because this squid proxy is acting as a proxy between your clients and the Stratum 1s. There are some different things that you may want to configure here. So first, there's the port. This is the standard squid port, but you can change it to a different port if you want. And then you have to provide some ACLs, because if you make this a very publicly available squid proxy, and probably within a couple of days, it will be abused by people to do weird stuff with your squid proxy, and it will start proxying or acting as a cache for weird websites, which you probably don't want. So you can limit this in two ways. First by limiting the destination domains, basically the domains for which this squid proxy is acting as a cache. And since it's intended to be used as a cache for your Stratum 1s, you probably want to limit this to the list of Stratum 1s that are available for this project. So you don't want to, if you have your own Stratum 1 nearby, you don't want to limit this only to your own Stratum 1. But assuming your project has like three or five Stratum 1s, you want to limit this to all the Stratum 1s, because if your Stratum 1 goes down, and still this proxy can use a different Stratum 1. So how you can do this depends a little bit on the setup that you will have in practice. So you can, for instance, use the destination domain. If all your Stratum 1s live under the same domain names, then you really need this real domain name to exist. So for instance, for our easy project, we have a .easy-hpc.org domain name, which really exists. And we use DNS entries there to basically point to our different Stratum 1s. And then you can easily hear, say, I want to use this for everything that lives under this easyhpc.org domain, then it can use all those Stratum 1s that are registered with our DNS entries. You can also use regular expressions by using this one. But for now, we just have one Stratum 1 anyway. So I think that's also explained here. You can also use just this line where you can just provide one IP address, and then it will only serve as a cache for the IP address that I'm going to give here. So then I need my Stratum 1 IP address, which is this one. So this tells Squid to only act as a cache for my Stratum 1. Then the same is true for the clients. Where is that? I skipped over that. That's defined on this line. So basically the source, that means the clients that are going to connect to this Squid proxy. That's also something that you really want to limit in practice. You can do that with an ACL, but of course also with a firewall if you want to, or both, to be sure. But by using this ACL, you can basically define the range of IP addresses that you want to allow access to your proxy server. A proxy should always live nearby the actual systems that are going to use the proxy. So for instance, if you have a HPC cluster and you want to set up a proxy for that cluster, preferably put it in the same network so that bandwidth is really high. And then just allow all the client IPs of your cluster to access this proxy. So for now, again, I'm just going to put in the IP address of my clients. So only this machine is allowed to use this proxy for now. But this page also explains, I think, yeah, not in detail, but for instance, it does mention that you can use this CIDR notation to, for instance, allow a range of IP addresses to access the proxy. I think that's all I have to define. And then there are some special lines here that basically tell, deny everything, that's not a stratum one. So deny access to all the destinations that are not listed in my stratum ones ACL, only allow access from the local nodes. So for my local clients, basically defined with the other ACL over here, also allow local hosts, if you're trying something on the local machine, of course, that should be allowed. And everything else should be denied. And then there's a little bit about the actual cache size, for instance, both for the memory and for the disk size. And of course, the larger the disk cache will be, assuming you don't have a whole bunch of memory, you often should have a lot of disk space for the cache. And the more disk space you add to your cache, the more it can actually store locally and the better the performance will be or the latency to start running stuff from your repository. So for large repository, repositories really increase this to something much larger than only five gigabytes. For now, that should be enough since we're not going to add lots of stuff during this tutorial. And the disk space of the VM is probably also not extremely high. So this should be okay now. And since this is a little bit more involved, this configuration file for this particular squid proxy, you can ask it to verify that the configuration doesn't have weird errors. So using this parse function, you can check if it's okay. Then it's just going to read the configuration file and check if it can basically parse and process all the lines that you've added to your file. So I don't see any error. So that should be fine. And then I can start and enable the service. So that's already the part for setting up the proxy. So now I have a stratum one and a proxy. And the last step should be to reconfigure our client. So the client is still directly connected to the stratum zero, which is really a bad habit if you do that in practice. So we now have to reconfigure it. So first thing you will need to do is to tell it to use the stratum ones instead of the stratum zero, which you can do by changing the file that you already created in the previous section of this tutorial. So I'm going to add my configuration file. And instead of telling it that it should directly connect to the stratum one, I'm going to add the IP address of the stratum one over here. So this one pasted here. Other than that, it should be fine. The keys are still the same. Let's save this. And of course, when in production, you're going to use more than one stratum one, you can add multiple ones over here. So separated by a semicolon, you can just add a bunch of stratum one servers. So you always have one available, hopefully when another one goes down. So that's also explained here. So make it a semicolon separated list of servers. You will also want to enable the geo API. For now, that's not important because both I'm not using the license. So it's not using the geo API on the stratum one. Plus, I only have one stratum one. So there's only one that is closest to me. But if you are going to add more and use the geo API, you also want to add this to your configuration so that the client is instructed to actually use the geo API and the server, of course. The other thing is now I'm directly connecting to the stratum one, but I want to use my own script proxy in between. And that's something that you have to change in the site, the machine specific default, the local file that you also should already have created yesterday. So in here, besides the limit for the quota on the client, there's also the I didn't want to do that. I should also add it with sudo. So yesterday we added here direct, meaning directly connect to the stratum one or zero yesterday. Now we can add our local proxy so that it will connect to the stratum one, but will go through our proxy first. So it will basically use the proxy to find out if the proxy has the file that we want to access in its cache. If it does, then it will just pull it in from the proxy itself. If it doesn't, then the proxy will basically fetch that file for you from the stratum one and cache it. So we have to change it to something like this. Be sure to include the right port number here. So we left it to the default three, one, two, eight. And of course you will have to provide the IP address of your proxy server. Click here because the toolbar of zoom is in the way that now works. So I'm going to copy this IP address as well. So this will now point to my proxy server and I can close. And basically that's all that you have to provide. So now it knows that it has to go through the proxy to the stratum one. So it won't use the stratum zero anymore. So that should be fine. Now it depends a little bit on whether or not my repository is mounted because I have to reload the configuration. I think it's mounted at this moment. So if you don't see this directory in here, then it's unmounted by the auto FS service. If it's unmounted, you can just start accessing your repository again and then it will automatically reload your new configuration. But since this is still mounted, I should either reload it or force it to reload the configuration or I should first unmount the repository. Both can be done by using this command with either this sub command which will unmount the repository or you can use the reload function for an already mounted repository. So in this case, I'm going to use reload for my repository name. Again, forgot this. It is failing. Then apparently it wasn't mounted for some reason. Well, let's just force it to unmount. I guess it's already unmounted then. And now I can try to either access it again or I can use a command like probe, which will also just connect to my repository that also fails. So I probably made a mistake somewhere. So I guess this doesn't work anymore now. No, it's unmounted. So let me check again, my configuration here. Where did I make a mistake? Proxy is pointing. Sorry, this is the stratum one. It should be okay. My key is still here, I assume. Yes. The proxy should be in default.local. That also seems to be fine. Is the commenting out? Is that working? The last line? It should be okay. I can even try uncommenting it and see if that shows some information about what is wrong. So again, I'm going to do a probe, which will try to connect to the repository and hopefully give me some information in my debug.loc. It's written as another user, so I need pseudo for that as well. So this also shows you how you can debug issues, which is probably bad. This is just saying on the chat that on the squid config you put the wrong IP address. Ah, okay. That would make sense. So the squid of the proxy, I assume. Let's check. So stratum once should be that one. Yeah. Ah, yes. Thanks. I don't know which one this was. Stratum zero probably. Yes. And my client just verified that one as well. Okay. All right. Let's go back here. Just to show you the debug.loc maybe that would also have told me the issue, although there's often lots of information here. But it indeed gives some error here when it's trying to access these URLs and forbidden. So it looks like indeed it was an issue with with that. Let's try again. Do you have to restart the squid? Ah, yes. Sorry. Of course. Thanks. One more time. Still no luck. So you returned HTTP error. Still forbidden. Let's check my stratum one. So something else that is a useful debugging command that's also listed on the debugging section. Maybe the squid on stratum one. Yeah. That doesn't really have a lot of specific configuration. Should just talk to local host. Good example for debugging. So I can try this one, which is often a useful command. So if you can use curl to basically access default file, which is in every CVMFS repository and point that to my stratum one first. Can you get p address? So that's okay. So I can access my stratum one directly. And then I can also use minus, minus proxy and then use my proxy in between, which is basically what CVMFS does as well. Yeah, the port number of course. It also looks okay. It doesn't need probe work. Okay. Didn't change anything. So I'm not sure why it didn't work when I just tried that. Maybe too quickly after the restart or yeah. Maybe it was still restarting when you tried again. That could be, yeah. But now I can access my repository again. Still you're not completely sure of course what it is using now, especially if you have a bunch of proxies and a bunch of stratum ones. So useful commands here to verify what it's actually using, which is another CVMFS config commands. Let's look it up at the bottom. That's this one. So that will print out some useful information about the actual connection to your repository. So something that you should verify once you're going to add a stratum one and proxy is this line and make sure that it indeed connects to the stratum one that you're expecting to see here. And if you you're using a proxy, it should say I'm going to that stratum one through the following proxy and also make sure that it says online because if the proxy is unavailable for some reason, it might skip that proxy and directly go to the stratum one and then you're still not using the proxy effectively. So make sure that that IP address is correct. This IP address is correct and the proxy is online. So if you see that, then everything should be fine and the connection should be okay. But as you notice, it's quite prone to errors. So you have to do quite a lot of manual configuration. Also, that will be covered later on in the advanced topics. We will talk a little bit about automation, which you probably want if you are going to add lots of stratum ones and lots of proxies, because then you will definitely make these mistakes that I also made today. Copying wrong IP addresses or making small errors in the ACLs or firewall settings. But for now, it's a good exercise to just try it yourself so that you know what kind of issues you can expect in production. So what kind of things have to be configured and which ports have to be open, etc. So for today, you can try doing the same thing that I just showed you. So set up your own stratum one server. If you want, you can request a geo API license key and use that for your stratum one. If you don't want to do it, just use the workaround that I also used. Try to set up the cron job that will automatically synchronize your repository every five or 10 minutes, whatever you like. And which will, of course, configure the Apache and Squid services because they are required on the stratum one. Then you can already try if that works for your client. But if you first want to do the proxy, that's also okay. So then you do the same steps that I just did at the squid servers to your squid proxy server. Set up the configuration with the right ACLs and then finally reconfigure your client to make use of both the proxy and the stratum one. And then make sure that it indeed connects to the right services so that it no longer connects to your stratum zero directly. Which is again the recommended way to use this. And then you can basically limit the access to the stratum zero to, for instance, certain users or certain IP addresses and have a very strict firewall to make things more secure. Okay, I think that's it for today. So if there are any questions right now, feel free to use the microphone or the chat. Let's see if anything was in the chat. There were several questions in the chat, but I think we answered most of them, except for maybe Caspar's remark right now. Is there any way to generically disable that clients directly connect on the stratum one side? Actually, yes, I think there is. You can lock things down, right? Yes, sure. In terms of firewall, you can already do things as well. Only allow clients from a certain IP range or all of these things. In production, that might be a bit tricky, because especially if it's some kind of publicly available repository, you're not really sure who is going to connect and it will be a hell of a maintenance if you want to register all the IP addresses for all the different proxies, for instance, that need to access your stratum one, which can in theory be 10s, 100s or even more depending on the size of the project. So if every HPC cluster, for instance, would have four proxies, then you have to allow all those proxies to access your stratum ones. So I think in practice, most of the stratum ones are quite open, but what you can do is, for instance, have a few private ones. So if there's a site that really wants to have a private stratum one, which has a full replica of the entire repository, that's of course possible. So you can add the stratum one, for instance, close to your HPC cluster to make sure that it fetches the entire repository, but that only your cluster nodes can, for instance, access that stratum one server. But even then, it's still recommended to have at least a few open, publicly available stratum ones that also other people can just connect to your repository easily without having to fiddle around with IP addresses. And we do know that Compute Canada, for example, has a separate repository with licensed software like MATLAB and things like that, that they lock down to only their Compute Canada partners. Well, there are other repositories that just open publicly to the internet. Yeah. I see another question. Do you really need the Geo API, even if you only want to use it for the local center? So if it's, if the entire CVMFS setup is your own local setup, including a stratum zero, and there's only one stratum one for your local center, then yeah, that doesn't really make sense, of course, to use the Geo API, then it's obvious which stratum one should be used. If it's part of a larger infrastructure, so let's say you're part of a large project, and you add your own local stratum one, it could still be useful to use the Geo API, because if at some point your stratum one goes down, then you want your proxy, of course, to talk to the geographically closest one. Still, it's not required. Otherwise, it will just probably pick a random one. But it might be useful if there's some other stratum one close to your location, that then that one will be used instead of your own one that is not available. So that's up to you. But I think in practice, also the developers really recommend to always turn on Geo API. I think there's some weird issues when you disable that. There's also a difference between a public and a private stratum one. So you could have a stratum one sitting in your network that's not actually publicly accessible to the world. And then it's less, I think it's less of an issue to have the Geo API enabled, as long as you're not exposing it publicly so others can connect to it. At least on your side, there's probably no issue in not having the Geo API stuff. But hopefully in the CVMFS network that doesn't cause any problems. So especially the CVMFS developers strongly encourage you to always have the Geo API enabled. But it's preventing stratum one servers to inject malware to their snapshots. That's a good question. If there's a way to prevent that. Yeah. So I think when you mount CVMFS repository, you're trusting that whoever maintains it makes sure that there's nothing malicious going in. Now the stratum one server itself is not actually capable to do that. It has to be injected for your stratum zero. So stratum one just copies what stratum zero provides and CVMFS has security checks and fingerprint checks, checks some verification internally of the data it provides. So unless it's coming from stratum zero, I don't think there's a way to inject malware like we demand in the middle attack. No, but it's the only way to really mess around with the repository is if you would have the master key of that repository, which by default, so that's the one that you will see here on the stratum one, I was reading the directory. If you create a repository, it will create this master key. And that's really the one that you need to mess around with the repository. But it's only available here that will not be distributed to any other servers and not to stratum ones, for instance. So if stratum one boots chains repository, then I haven't tried that myself, but I expect that you will see a weird error on the clients that the integrity of the repository is not okay anymore. So as Kenneth mentioned, there are features in CVMFS to take care of the integrity of the repository. And also for storing the master key, there are some recommendations in the documentation that you can, for instance, use these UB keys, which you can attach to your machine, a USB device, so that you can store them safely on that device, and they cannot be stolen from that device anymore. So you can only upload them to that device, you cannot retrieve them. But then you can resign the repository using that USB key. So then you can make it even more secure. I also see a question popping up on Slack now. I missed that, but what is the recommended VM capacity setup in terms of size CPUs memory for stratum zero and one servers to serve a full medium HPC cluster 400 nodes run as KVM guests. And if it's recommended to use VMs or bare metal, I don't think it really matters if you use bare metal or VMs, it should both be fine. In terms of resources, I think we do mention a little bit about it in the tutorial website as well. But assuming you set it up properly, so with one stratum zero with, well, maybe one or a few stratum ones and a bunch of proxies, then the stratum zero doesn't need a lot of resources. It's just there to store the repository. There will not be a lot of load. The only exception is when you're going to ingest new stuff, because then it has to calculate checksums and compressed data and things like that. That will take a little bit of resources, especially if you do a very large ingestion at some point that might take a while, but usually that's okay anyway. So other than that, just two or a few cores and a little bit of memory for a gigabytes should already be more than enough for the stratum zero. And of course, lots of disk space, assuming how much you want to store in the repository. Same for stratum one. You don't need a very, very heavy stratum one, especially because you can easily scale out. So if you ever find out that one is not enough, you just add another one instead of increasing the capacity of that single one. And even for that, you can reduce the load by adding more proxies. So for one HPC cluster, assuming you're going to host everything yourself, I think you can get away with just one stratum zero and maybe even one stratum one and then for instance two proxies, which all don't need a lot of specs. So just a few cores, a little bit of memory. And probably the most important one is that disk space for both the caches and the repositories. And I think that should easily be able to handle a few hundred cores. I think also Jacob said something about that in his talk. I feel Teria has already answered this. So yeah, even in terms of disk space as Teria and Kenneth are saying on Slack, you don't even need that much because everything gets compressed. Everything gets de-duplicated. So you will save a lot of disk space by those features. And that's true for both the stratum zero and the stratum one. And of course also the proxies themselves, which also store the compressed versions of those files locally. Let's see if there's something else. I think we covered all questions up until now. Ah yeah, there was another question from Sam. What's the difference between a revision and a snapshot? So I wasn't quite sure on that. Maybe you are. Yeah, the terminology is a bit mixed up I think on it and both our documentation but also the CVMFS documentation. So you indeed have tags, snapshots, revisions that more or less all points to the same thing. So on the stratum zero you have this command tag which will show you the different tags that you have. Maybe show that but that's also part of the publishing section that we will do tomorrow. But there's this tag command which you can see it here. They use both tag and snapshots already. So tags and snapshots that's basically the same thing. So I can do a tag minus L for my it's called repo.organization.tld which I probably need to go. No, not. So here you will see those named tags. Let's make this a little bit larger. So there are multiple revisions basically and that's maybe slightly different from a tag as in that the tag is basically the name that you assign to the revision. So one revision can have multiple tags as you will see here. So this is the default tag that you get with the default configuration of your server. Then basically every time you publish something new into your repository you automatically get this name, this tag, which is something like generic and then the date and the time when you did the publication. Then the revision gets increased by one and as I mentioned before you automatically always get these ones as well. So trunk that's basically the latest version. So there's also description that you can add yourself as well. So that points to the current hat on the current latest version and there's a previous so that you can easily switch back to for instance the trunk previous and then you don't have to use this long tag name but you can easily say go back to this one. So as you can see we have five tags here for three revisions. So revision is really a change in your repository and tag is just a name given to a revision. I hope it clears it a bit up. Snapshot as far as I know is just the same as a tag except that there's also these commands. So a snapshot basically also synchronizes it's maybe a bit confusing for this command but snapshot if you run that as a command on your stratum one then it will do the synchronization between the stratum one and the stratum zero. Okay yeah don't see any other questions coming in. So maybe we can wrap it up here. If anyone has any questions or has any problems with trying the exercise of setting up stratum one and proxy yourself don't hesitate to jump into the Slack channel that's the best way of helping out and also people watching the recording afterwards who are maybe in a different time zone feel free to jump in there and we'll try to be around to help out if needed. Let's wrap it up here. Thank you very much everybody. The next session in this tutorial is tomorrow morning the same time 9am UTC.