 So, data storage. So, where should you store data, Richard? Hmm, depends on what kind of data. That's always the first thing I would ask. So, I guess the main trade-offs are how large is it? If it's really large, then it would be on the scratch system. Is it backed up anywhere else? So, the scratch systems are usually not backed up. So, you would want to have an original copy somewhere else. How much do you need to access it? Like, is it something that you're reading with a high performance during the job? Like, you have a deep learning program, which is just streaming in data as fast as possible to train a model. Or is it something that it can read once and then it does a bunch of calculations with it? Yeah, when we were talking about scratch system, maybe we should mention that, like, we... Yeah, okay. So, what is scratch? Yeah, we mentioned about home. Some machines do not have this home. Like, we have in Triton. We might not have it in the future, let's see. So, what's the difference between home and scratch? So, the home is this kind of like, it's a one server that provides this data storage. And it's mainly meant for, like, configuration options, so that if the scratch would be down, you can still log in and, like, we can do debugging. Like, yeah, nowadays it's not that actually important. So, we probably will do away with it in the future. But the main file system in every cluster. Every cluster has some sort of a, like, file system that is meant for high performance computing. And this usually means that it's a fast file system that is connected to the fast network inside the cluster. So, this fast network inside the cluster. And it's present in all of the compute nodes. And this fast file system is, like, the main workhorse of the whole thing. Everything is stored there throughout when the program is running. When we say fast, how fast is fast on Triton? Is it tens of gigabytes per second? I think it's about, like, usually it's not under that heavy load. But I think it's like, I can't now remember the benchmark numbers. But a few, 20 gigabytes per second or something I can remember. Yeah. But the point is that it's hopefully fast for every user, even though there's, like, hundreds and hundreds of jobs running at the same time. But it means that it's fast in the sense that it can do, it can serve all of these requests at the same time. But it's not, like, fast as in, like, your laptop might have an M2 drive, like this kind of, like, SSD drive, which is faster than Triton file system. But if you try to use that M2 drive by 200 computers, it becomes quite slow. So, because it's like, it's a big machine with lots of, like, disks inside it. But the main point is that it's usually big, but because it's big and fast, it means that it's not usually backed up. So it's not meant for, like, storing data that is, like, critically important. Like, if you do, like, a patient survey or something, and you actually have to roll in the patient to do, like, I don't know, MRI scan or something. If you lose that data, you need to get the patient back and maybe ask them to put the cancer back in their head or something. It's not something you can do. So you need to have a situation where, like, the data that is, like, like, you need to be able to recover the data. Then that needs to be stored in some other file system. But the, like, this scratch file system is meant for the data that you will be processing and you will be working on. It's a reliable system and it's very reliable. And the biggest, like, problem might be that you mistakenly do something yourself. But, so be mindful of that. That it's not on the backup, but it's a very reliable system. It's meant for, like, all of the actual data storage you're using. Yeah. So should we go down to the summary table? We've talked about these different considerations. So for most clusters, you would find a table that looks like this. And it says, here's the places you can store data. And you see that home has, is at this place and has this much space and is backed up and is available everywhere. Work in scratch are work is available here and the default quota and no backup and scratch and so on. So, at least on Titan work is the same as scratch it just a different name so work is what we use for a personal storage space and scratch is what we use for a shared storage space among multiple people. And that is. Yeah, that's what we really recommend. There's also things like local temporary spaces that's only on a certain node. Yeah. Like, but this is, this is usually specific for the site you're using, but you should be mindful of like, okay, there is, there is a correct place for the data. I just need to know where to put it. And if it's like, you should separate data that you don't want to lose and data that you can recreate usually. Or something that is like too big to take a backup out of, but usually your code, you want to use something like version control to store your code, because the code is usually your ideas, basically, it's, it's your, your ideas made manifesting like words. So, so that is very important usually like to store your code in some place and a version control, which takes basically it stores line by line. It, it, it like checks what is the differences in the code and you can store it in like GitHub or your own version control system or whatever. Those tools make it so that you don't lose the most important thing, which is your like thoughts, like your ideas and that that is very important. And that's something we always recommend that you have a version control system. And then you store the important, really important thing in store a copy of that in the like backup system. And then you use the, the heavy lifting system, the scratch system in our case to actually run the simulations. And one thing that we should also mention is that, like, because the system is shared by all users, you should be mindful of what your code is doing, like how many files is it writing, how, how much IO it's doing. For example, like there was a mention in the notes that in CSC, they recommend using these containers, for example, for Python environments, because Python environments can create thousands of files. And that can be a problem. In Triton, we have a like a technical solution that mitigates this problem. So we don't necessarily need to do that. But there are like various things you need to be mindful of. So it's usually good ideas to check the documentation and like use the best practices and like separate your data into piles of important stuff and piles of like stuff you're working on that that you might create by just running your code again. Yeah. Okay, um, let's see. Yeah, so I guess. And all these data things could end up being like a whole day of a course. So personally, my recommendation would be if you're at Alto, do what you can and drop by garage and will help you design your strategy just for you, and we can help you get it set up. It's probably more efficient for everyone that's trying to make a course that's so long no one attends. Was there anything else under data storage? So there's usually a quota for each place, which limits the amount you can store, both in the total size of the files and the number of files. If you need more quota, then you can usually ask and you'll get more. Yeah, maybe we should talk about remote access now. Was it in the agenda? Yeah, maybe let's. Sorry, my desk is occupied. Well, in general, like the remote access to the data is usually like how you want to like work with with your data. So basically, you usually want to like get access to the remote, like the main workers file system, like we were talking about the scratch file system. You usually need to get access to that file system from your laptop and there are various ways of doing this. So the easiest way in Alto is to connect yourself to the Alto VPN and use these Samba mounts and then you can like edit your code with your code editor and save the results into this remote mount. When we talk about mount, it means that we take basically like a virtual cable and then we plug it into the laptop using that. Very hot to say. Can we look at these pictures here? So one, the data copying which is on the right is what people are used to. So you had one copy and now you make two copies. And this means, well, it works without the internet access, but also if you modify one of them, then the other one is still the old version. So you have to copy back and this is quite, well, you know, this could be annoying. Yeah. So the remote mounting, which is the picture on the left. So it's like this. It's like a view. So on your computer, it looks like you have one. But really, whenever you access it on your computer, it sends it straight over the network and reads it from the other file. If you save it on your computer, it sends the save right back to the other side and so on. What I was saying is really convenient for doing a lot of things. Yeah, this is what happens inside the cluster as well. Like the files aren't in the compute nodes. The file isn't in the login node. The file is stored in the file system, but they just get this view, this idea that, okay, I have access to this file. So, but the file is actually in the storage system. So, but this is just like the remote access is that you make it available similarly to your computer. Yeah. And this approach is very good if your files are something that are easy to transfer via network. But if your file is something like, I don't know, like a 500 gigabyte, like deep learning data set and you want to view that on your machine. Remember that you, if you want to view actually the data on your machine, you need to transfer that over network and that's no fun. Yeah. So usually what you want to transfer are the code files and maybe plots and that sort of things instead of like. But not go opening hundreds of gigabytes files across the network where it will be transferred again every time you do it. Okay. Yeah, and if you read down below you see different options for doing this. So for remote mounting, you can do it. Okay, well this first section. So at least at Ulta University, the data on the cluster is automatically available on many of the ultra Linux workstations and things like the virtual desktop interface and so on. And if you're using one of these systems, then you don't even have to go transferring it you just sit down at your desk or log into VDI and go to the right place and your data is right there. But if we go on down there's you can read the list yourself the remote mounting. So this SMB mounting works on Windows Mac and Linux, but you need to be on the auto network. As far as I know, Simo said that maybe it was different now, I guess you can see. There's also SSHFS, which works right over SSH and works well. Yeah, this should work anywhere where you have an SSH connection to so basically what it does it uses the SSH connection to make this like view happen. Yeah. Okay, and then transferring data. Well, you can read these different options down here. But for things like code files and so on version control or something is really useful and you should really be considering transferring very big files if you want to make certain that the files are definitely copied. Arasynch is a very good tool because it also checks the file sizes and timestamps and sort of things like that. So it verifies that the files have or definitely been copied. There are also like better tools for this as well like git annex and the data lab and many of these other tools that can do it, but they are more complicated. We are not going. Yeah, actually, there runs me tomorrow morning I can tell the story of how I use the cluster and a bunch of the tools we're talking about and somebody aren't talking about to process all the videos in one night and get them published, which might be excessively complicated but is sort of might be interesting. Yeah, and for the legacy users out there, SFTP is an option but I wouldn't recommend it for, for like, for the, like, normally users, I wouldn't use this, but it's possible to use it. If you're using SFTP I recommend using some of these Google programs so you can transfer files back and forth. I would personally use the mountain. So should we go to a general Q&A thing. And I'm starting up to the bottom of the thing now. Okay, I'm switching to the notes.