 we come to data storage. This is a pretty big topic. I think I might have made jokes like the real problem with computing is data not computing, which may not be quite true for every field, but quite often it is. So would you summarize like where should you let there was a good question in the HackMD earlier that like they have a lots of data that they want to process with Matlab, where should they storage? So, well, I would first start by asking what kind of data. So I guess there's several categories. One is original raw data that's irreplaceable. So usually you would not have your only copy of that on the cluster. It will be in some secure storage system. And by secure, we mean not on your own laptop's hard drive. But for example, also there's these project file systems that are network drives and they're backed up. They're replicated to another data center in another city, all that kind of stuff. If you put it there, it's pretty secure. And then you'll have a copy of the original data on Triton. And then once you go to Triton, you have the scratch file system. And this, like it says, is sort of scratch data. So unlike some clusters, our scratch data is not automatically deleted, but it's not backed up. It's not snapshot. It's not replicated or anything. So if it's on there... I'll quickly mention that what the original means, but it's not replicated. It's not like it doesn't have a backup to just secure against like human error from your side. Like if you remove files there, the files are gone because it's such a big system. It would not be economically feasible to have it under constant backup. But it's very reliable. We have paid a lot of money for the system so that it should not ever break in. It's not only one hard drive. It's full shelf of hard drives and lots of redundant servers and the redundant systems that make certain that the system is very reliable. But what Richard is saying is that you should not put all of your thesis data there if it's like four years of work and the only place they are is in the scratch file system. We could say it's safer than your laptop's hard drive, but less safe than say the project storage area. And it's safer because scratch is not going to get lost and not going to it has hard disk level redundancy. It's RAID 6. But neither of them save you from deleting it yourself. Okay. And then we have things... There might be data that's just yours or might be shared. You have things like your home directory, which is small in four configuration files. Maybe we can go to... Well, here's some... Yeah. I would quickly summarize that if you have any... If you go into Triton or any cluster, any cluster, there's a fast work drive that is shared among the compute nodes. In our case, it's this scratch work drive. And there you can go to... You have a personal work folder where you can put your stuff, like big stuff, talking about gigabyte or more up to gigabytes of stuff, depending on your quota. There's usually a quota there. So if you go to your work directory, you can put whatever stuff you have there. And then usually there's project level directories, at least in our system, where you have a different quota. And basically, it's for shared data among a research group. And those directories are better suited for data that belongs to multiple people or data that... Or projects that need to be worked on by multiple people. So you don't have to replicate the data to everybody's work folders. So you have a personal folder, but then you have usually a group folder. But everything like this is on the fast and big network drive that is shared among the compute nodes called work directory. Yeah. The section about think about IO before you start. So oftentimes when programs run slow, it's not because of the processor of the computer, but it's because of the speed of getting data into the processor. And this could be because of where the data is stored or how it's formatted and so on. It could be because of limited capacity between the storage system and the processor. But you can read about that. So now we finally get to the summary of data storage options. Let's see. Yeah, maybe we've seen more sort of hinting at some of these, but maybe we can go over them. So this is the example from Triton. And yeah. Yeah. So usually, like in a cluster environment, you have a home folder, which is like a place where you store, let's say these SSH keys that allow you fast access to the cluster and stuff like that. Like this kind of like stuff that needs to be secure, like keep it hidden, keep it safe. That's in the home directory, like something that that is like configuration keys. Like let's say you want to download the Docker image and convert it into single like an image you need, like an access key that you need to store in the home folder, like that kind of stuff, like all kinds of like API keys and stuff like that if you need to access web services. And those are usually stored in the home folder. And then you have the work folder, which is the actual like where the where the meat of the data is, where the actual data is stored. And that work folder is like you have a personal work folder and then you have like group work folders. Like in in CSC's machines, they have these different accounts, each account, like project account has different work folders. So in our system, we have like a user has its own work folder and different projects have their work folders. Yeah. And if it's stored in the work folder that no one else can access it, which is not great because you're probably going to leave your current place, and will your group be able to continue with what you're doing. So it's better to go straight to the next step, which is the scratch or shared group folder. Yeah. So so there's some historical naming things that make it maybe less less clear than it is supposed to be. But but basically, like your work folder and your home folder are something that we will we want to access and we want to give access to other to other people than you. So basically, that's your stuff. And if you leave the project, you leave the university that data will be gone after you have your left. So they're your collaborators or everybody else, they cannot access it. But so that's why like if you're working in a project with other researchers, create a project for your or let's ask us for a project for your work, and we will create your project with a quota that you can fit your data in. Yeah. Then comes the local temporary disks on every computer. How does that work? Yeah, so like Richard mentioned, like sometimes the bottleneck is the data like data like transfer becomes a bottleneck. So often you might want to like if your code creates like three, it needs temporary disk in many of our nodes, they have either temporary disk in the memory, or they have a separate temporary disk, especially the GPU nodes, because in the GPU world, you need a lot of data and the data usually you need to have it as close to the GPU as possible. So you need to have it on the local drive. We have fast SSDs there in the nodes themselves so that you can use them as like this fast, the temporary, temporary data just for the job. And that is quite common in all clusters, you have like in the nodes themselves, they have like some temp folder or something that you can use for having the data that you're willing to lose after the job has finished. But by then you have copied the data, like the actual results back to your work folder. Yeah. Then there's a few special cases here. For example, the persistent data on individual nodes, that's basically only on certain groups, group servers, or RAM FS, which is looks like a storage place, but it's actually in memory. Yeah, these are very specific, but only relevant if you have like, let's say, like a local database that you want to access fast or something like that. And if we go down, we see a more detailed description of all of these. There's quotas, which is the limit of the number of space and files that is in there. The quota command will show what the current status is. And then there's some exercises if you're interested. I think maybe the most important part of this whole lesson is that there's this memory hierarchy. So it would be incredibly expensive to get something as large as a scratch system and backed up like the home directories. So once you get to this large professional scale, you basically have to have these different storage tiers. And you have to be able to move your data across the different storage tiers as necessary. And it all goes into the memory hierarchy. So at the top of the tier is the main memory of the computer or the CPU registers or CPU cache and so on. Okay. Yeah, I would quickly add to that. So in simple terms, I would say that try to differentiate what is important and what is not. We talked about a bit yesterday, but try to find what is the important original data, put that into a system that is under backup, and then may put your work data into the folder where you're working in and then you're good to go. Yeah. And make it certain that you know what data is what data, put labels on it. Yeah. Yeah. And we could talk about data management forever, but well, let's go on and finish.