 So welcome back. We are here to talk about data storage now. So this is a sort of interesting topic because it's something that is not going to seem very interesting, but I can guarantee you that we have more problems about data than about computing. Maybe that's not quite true, but it's pretty close. So as much of using a lot of computing is getting the data through the processor as actually using the processor itself. And as much of computing is making sure you have your data available and backed up and not being lost, as it is about just having the data on the cluster. So let's see what we've got here. So yeah, in this tutorial we have basically three parts. What are the different storage locations available? How are they available on other kinds of systems and how do you access it remotely? So this is especially going to be different on every site. So the general principles about data being important and there are different file systems with different properties to store data. And you need to actually think about this a little bit before you start. These will all apply, but the actual locations and where you access them are different. So continuing on, so basics. Any cluster has different ways of storing data. Each has a different purpose. So some are really big, but not backed up and fast. Some are small and slow, but are backed up and you sort of need to combine these together. So the basic general idea is you have a home directory, which is just for you. And that's usually used for configuration files and maybe installing some small software. On Triton, the home directories are currently 10 gigabytes. There's usually a large scratch file system of some sort. Triton uses luster for this and that's the same with many other clusters. So on Triton, it's divided into two parts scratch, which is shared among groups and work, which is just per user. So by default, every user has 200 gigabytes of scratch work. And this is where the primary calculation data is stored during typical jobs. So there's other custom places. For example, at Alto, there's different home directories. There's also Project and Archive and so on, which we talked about more on Friday. So let's not go into things too much. So it is important to think about the file system speed before you start. So quite often we see that people's jobs are slow, not because the code is slow, but the data can actually get to the things, the process fast enough, or the other way around. People start a massive array job that's running their code 10,000, or not 10,000, 1,000 times in parallel. And all of them are loading an Anaconda environment, which is stored on scratch. And this produces a huge amount of load on the cluster, which makes it slow for everyone. So what are the factors to consider? Yeah, I was just going to say that the file system is basically the blood circulation of the system. It sends everything to everywhere. So it sends the nutrients to the brain. That's basically the CPU. So if the blood circulates and it gets crowded, if there's like a stock somewhere, then it will get stuck the whole way. So that's why we want to keep it as fast as possible. And it's also important for you, because you want to, well, get your stuff done as fast as possible, so you want it to not be contested. Yeah, we made a strategic choice right now to not go into too much depth about these things. But if you have questions, please come to the garage and we can look at it individually. Yeah, we can. Just like a quick, when should you think about these kind of stuff? I'd say as a rule of thumb, when you're starting to do jobs that you start to do hundreds of jobs, or when you start to read hundreds of gigabytes of files, or you need to do some very advanced machine learning with big data sets, then you need to discuss with us, because otherwise you will not get the performance to your code. And in the worst case, you will get mails from us saying that, okay, now something strange is going on in the file system. Can you discuss about it in the garage? Yeah, but in general, like you shouldn't be too worried about it, but just be mindful that data is also a resource in Triton, so keeping it tidy and keeping it fast is important if you want to do stuff fast. Yeah. Okay, so another summary of the options. So we have the home directory 10 gigabytes backed up available everywhere. Work, which is available on Triton at slash scratch work and then your username. And we have this environment variable worker, which points to it for you. Scratch goes by group, as we said, so it's available in slash slash scratch, and then the department, and then the project. So for example, scratch NBE, what's the project here? Like say brain data or something like that. So for scratch and work, we can give you basically as much quota as you need. For local files, there's slash TMP on every node. This is the same on pretty much all Unix computers, by the way. And here this is easy or good to use for temporary calculation data during a single job. So you would copy things from scratch to temp and then use it and then copy it back to scratch. Use it also note that this temporary directory is they are sometimes like many of our nodes don't have disks in them. Like they are just computers with CPUs and RAM memory, so they don't have physical disks in them. So this might go to the RAM disk as well, which means that it's kept in memory and it's very fast from there. So but it will of course, like if you have huge data sets, it doesn't necessarily fit into the local disk because well, it will go to the memory. So it's good idea to use these temporary directories, especially on the GPU nodes where the temporary directory solve faster SSDs. It's good idea to use those as a caching system. Yes. So on some of the group servers, so not on the general nodes, there's a slash L directory, which is usually a few SSD disks which are available for usage there. And then if you need files that are super fast, there is a RAM FS, which is basically looks like it's storing files on a disk, but actually it's being stored in memory. So as you scroll on down, you can read a little bit more about that. We don't really need to go into that more right now. So we have quotas. So we can't allow people to use as much space as they would like, because, well, things will fill up. So at also quotas are not so much to prevent people from using more space, but to make people think before they use space and to ask us and so we can keep sort of track about things. So if you need more data on scratch, just ask us and we can usually increase it. So, yeah, so, so in, in the scratch, we currently have, so the whole file system is 1.6 petabytes, and we are purchasing a bigger file system this spring. But we are currently using 900 terabytes. So we have 700 terabytes free. So it's not about the space is not missing or we would have to conserve it in some way. It's more about, like, we want people to store data that they use. We don't want people to store data. That's just for for the kicks. So, so basically, you will always get the, the space that you need if you have a valid use case, but, but you need to tell us about your use case what you're doing, what you need the space for. Yeah. And we also look at the number of files you store. So on the scratch and work file systems, it can store a big amount of data really easily, but there is a bit more overhead for each file, each individual file system you access. And that causes, say, if you make 10 million files, then we'll start wondering what you're doing and help you to arrange your data a little bit better. Yes, we've had cases in the past where users have had so many files in folders that if they try to open it with a normal file browser window, it will crash. So, so we don't want that kind of use case, because that means that the users can't actually use that. So if they no longer can access the data with, let's say file browser. So we want to help you manage your data so that you will, you will have it in a format where you can actually access it. We will help you deal with it. But yeah, this is something we need to, we can discuss together. Yeah. So the data's on Triton is available at other places within Alto, for example, through the virtual desktop interface, through shell servers and on different departmental workstations. So it's easy to seamlessly analyze data on Triton and then view it on a workstation of yours, if you need to. Next we have a section on remote access. So this is something that before we've spent a lot of time on but now we're going to just give a quick summary and then if you need more, you can try to follow the instructions and come to garage and we can debug this. So it's very important to be able to access your data on say your own laptop, wherever you are, and it's best to do that without having to copy it back and forth because that just takes a lot of time. So through the SMB protocol, you can make the data which is on Triton available on your own computer. So you have these URLs here that you can enter and show. So I'm actually not connected to the Alto VPN so I cannot demonstrate that right now. Simo, are you? Yeah, same thing. Yeah. No. Maybe we can come back and try to demonstrate that in a bit. Let's see. So there's instructions for this on different operating systems. VDI works from anywhere in the world. And also there's something called SSH FS, which can operate through SSH and make the data available on your own computer, which is actually what I use more of the time than anything else. Maybe I can quickly demonstrate that. Let's see. Assuming my Triton access is now unblocked. So let's see. So I have, I will make a new directory. Triton work. And now I list this directory. And I see there's nothing there. So I SSH FS Triton, colon, scratch, work, Darst R1, and then mount Triton work. So here I say just Triton colon, but because I have an alias for Triton setup, but in general, this would be Darst R1 at Triton Alto.v. Let's see. So I do this and nothing happened. Well, as we know from before, no message usually means it just worked. So let's list mount Triton work. And there I have all of my files available now on my local computer, and I can view them, look at them, edit them, whatever. So this is really handy. So this, yeah, this is like this example is, has some caveats, like Richard said, you need to have the SSH keys ready and it doesn't necessarily work between those, but you might want to use the Samba mounts for Windows, for example. But this is just to demonstrate that it's, once you set it up, once, it's very easy to access the folders remotely. Yes. So now let's get to good stuff. But right after a break. So you see some exercises here, which I think you really, we really shouldn't go into detail on. So I'll propose we go to a 10 minute break now and then we will get to actually running things and using Slurm and taking the power of GPU CPUs and stuff like that. So let's have a 10 minute break until 10 minutes past the hour. Actually, that's 13 minutes. And you can ask some questions here, but please don't spend all your time asking questions and instead actually take this break. Remember, there is always HackMD to ask questions, scroll right to the bottom and ask whatever you may like. So if there's no objections, we will go there. So see you in 10 minutes.