 Now, we've got these three more little things, and they're really more like discussions. So basically, all of the hands-on is done for the day. We have this extra discussion, and then tomorrow we arrive and we get more hands-on with the scaling things. So the array jobs, the parallel, and the GPU. But for the meantime, we have these three things. So software modules, data storage, and remote access to data. So it used to be we would talk about these first in the day. So we'd end up spending basically like half the time just talking about software and data and so on, which are really important, but is a really like unique to each person kind of thing. So now we're instead giving a quick summary, and you can read it yourself. And we think that we'll have a better outcome overall. So should we begin? Software modules. So Simo, I see your screen is shared. So let's go there. So you saw me try to use the module command to what I did, module load lamps before. So Simo, what is module and why? Yeah. So basically what I also on the day before tried to explain with maybe limited success about software is that like there's so many different softwares that people can use. So many different things that people can use. Whoops. Press the wrong button. There's so much software that everybody has their own software. Like there's lots of stuff that people want to run. So we have these modules software modules like that we install software in and you can basically load these modules. So we don't have software installed into the machines themselves. The software is installed, installed somewhere else. So you can load it though. So for example, like let's look at this type along here, like you can try it yourself. Like where is Matlab? So if we look at where is Matlab, there's no Matlab because like there's no one version of Matlab. There's multiple versions of Matlab and not everybody wants to have Matlab available constantly. So if we run module spider Matlab, what Richard was running previously is module spider command. You can see that there's like multiple versions of Matlab available. So you can see that there's like one 2012 to 2019 because like some software doesn't necessarily work with the newer versions of some software needs the newest version. So we have options for people. And then you can load Matlab. And after you module load something, you can find it that there is suddenly like a Matlab available. So this which command tells that where is this command Matlab. So now there's a Matlab available. And there's lots of software and it's provided by this module system. And basically if you just module spider name of the software, you might find it, you might not find it. You might want to ask us to install it for you. But basically there's a huge bunch of software and we cannot go through all of them because there's so many of them. But in the application space, there's many of them listed and your site might have its own applications installed. But basically there's lots of software already present in the system that are installed for everybody because like everybody uses them. So yeah, like Rich said, it's no point to go through all of this because it's so unique. Like it depends completely on your workflow. Some people compile their code. Some people use Matlab. Some people don't use Python and you might find your different modules and different things to find out. The main idea is to know that there is such a thing. Like there are software available via modules and you can get it loaded into your environment. Yeah. Okay. Yeah, I think that's basically the main point here. You can find the quick reference of the different commands down below. Maybe we can go on. If you want to know how the sausage is made, it's also done there. But like yeah, like you don't necessarily want to know or if you don't feel like looking through it. So I would say let's go on and we'll answer all Q&As about all of these at the end. So the next is data storage. Yes. This is a big one. Yeah. So the data storage is like each site has their own data storage. But like if you remember from the talk, there is usually in a cluster, there is some data storage. So in although we have a Lustre parallel file system in Helsinki University, they have a Lustre parallel file system. I'm not completely familiar with other sites, but nevertheless, there is a data storage to use when your jobs are running. So in you can access it through the work directory environment variable that should be set. You can you can go there into your into your work folder. I'll use the CD dash to go back. But so yeah, so basically there is a place for your data, but it's not the only place for data. Yeah. Can you tell me why don't we have one place for all data? Why do we need different types of places for different types of data? Yeah. So there's basically the idea usually is that you have this one file system like you cannot have a file system that is under backup is large and is fast at the same time without it costing like costing a lot of money. So usually you have to do compromises and these cluster file systems are usually large and fast, but they're not under backup. So if you have like some finished products or finished article or finished like or important experimental data that you don't want to lose, those should be in a different file system. But you can always have a copy of the data in the in the cluster file system for like doing analysis. But like the cluster file system usually isn't meant for like a long term storage or something like that. It's a work directory. So it's like a it's not meant to it's not a gallery where you show your paintings. It's like a painter's house where everything is like can be messy and and and under work. But it's not yet like under the plexiglass in the gallery. So it's like it's like it's a place where you work. So you should always like in different universities, there's different places and there's huge bonds of places where you can store your data in in like some places have different sizes, some places have different speeds, some places have different backup schedules. You should determine which how important your data is, like where do you want to store it? Yeah. So you want to show so each site should have some sort of table that describes everything, which I think is down below here. Yeah, here we go. Available data storage places. So this shows the name, the path of it, the size, is it backed up? Where is it available? And so on. And there's also like this hierarchy. So the things that are, well, so some of them like the smallest ones like the local storage. Well, actually, this is not quite correct. So they all have different speeds also. So for example, the largest ones are everywhere and is really fast but not backed up and things like your home directory are really small, but it is backed up. Yeah. So usually it's a good idea to, there's documentation here in the wiki on like data management policies. There's a whole of whole like chapter on it. But so it's usually a good idea to like separate your data conceptually. What is like, so you might have a situation where you have like some original data, let's say some important data that actually experiments have to be done to get it. That might be that it's stored under like a backed up system, but the data might be still like tens or hundreds of gigabytes. And then you might have a copy of the data in the work directory. And then you might have like additional like modifications of the data when you do like experiments or stuff like that. You might end up with like a megabyte of data or something like something of that sort in the scratch work directory. But at the same time you might have code that is used to analyze the data and the size of that code might be a few megabytes. But it's very important that you have it under like a correct system because that code is your thoughts basically, it's your thoughts made flesh like or made text. So that should be some in some version control system, like that is that is the most important thing because that's what you use your time on like the data just hangs around there until the code comes and does something with it. So that should be in a version control system. So it's good idea to have like this kind of like which part of this data is in what sense like how much time would I need to use to recreate the data from scratch. Some experiments you cannot replicate them without like actually like doing the experiments themselves. But some like data analysis things you can do it in like let's say in a week. So if you lose a week's worth of job that or weeks worth of time, that's bad. But it's not in it's like small peanuts compared to like like actual like writing a piece of analysis code or something which might take you months or years even. So it's it's good idea to separate these things and like store your code in a version control system your important data that needs to be archived and original data in some sort of system with a backup and then everything else like in a work system where everything can be messy. Okay, there's a good question from HackMD. Once you do use your home directory and once you use your work directory in practice. So like practically speaking, is it based on size or type of data or what? Like I personally use the home directory mainly for only not only for like configuration files like stuff that that is still like like because like many, many programs point to the home directory by default for for storing like configuration files. So it's usually like the home directory is best for those because like if the configuration files, if there's some problem with with with the file system, like if the configurator if the home folder is missing, you cannot usually log into anywhere like nothing works. So the home folder is that's that's why it's separate system because like in the in the case where the file system is is down or something that really happens, but in those cases, you still have access to the system. So yeah, but like the home folder is like mainly mainly for configuration files and that's why it's so small. Okay, what else is there about data? I guess the main summary here is really think about your data storage. Read what's on this page and yeah. Yeah, especially if you're going to be doing let's say some data intensive work like deep learning or stuff like that, you really should take into account the data and where the data comes from because like the data can really become like this kind of a bottleneck if the data is not loaded fast enough. We'll talk about this a bit more in the GPU when we talk about GP jobs. But yeah, like data data should be where you're doing the computation. That's why we have the hot like the big disks in the system. Like it should be accessible to the machine itself that does the computation and that's why we have this kind of a work directory. Yeah. Okay, so next up is the remote access to data and this is important because not only are you doing stuff on the cluster, but you also need to usually move data back and forth. So it might be moving a big bunch of data to Triton once when you first start doing your work, which is, I mean, if you're doing it just once, it doesn't matter how hard it is. But there might also be things like the visualizing the data where you're doing it basically every few like, you know, many times a day you're looking at figures or running another script that will visualize it live and stuff like that. And for this, we have two main options. So one is transferring data, which is like running one program that moves it and makes a copy on the cluster or on your own computer. The other option is remote mounting, which is basically mount is the term for making some data or storage device available. So basically what you can do is take your own laptop or it's automatically done on Alto desktop computers. So on your own laptop, you can run some commands and then the data on Triton is available right there on your laptop. No copies needed. So you open a file and then you like your computer goes, gets there from Triton, gives it to your program. If you modify it, it goes and modifies it on Triton and so on. So this is really convenient for small, quick things like looking at figures or stuff like that. It's not great for efficiency. Yeah. So if you consider like, again, this is kind of a workflow that you have that like you put code into the system and you put maybe some big data into the system and it goes through the computation and stuff. And then you want to see the output of it. Is the output the exact same data that you put in there? Most likely not. Most likely, like when you're doing some analysis, the end result can be described like, okay, did the physics code, did I get a minimum energy state or something? Like, did I get a small number here? Like, what was the end result number here? Or you can get like a graph or something or like a text file that describes like how the system was behaving. Like some graph is going down and some graph is going up and usually it can be described by a few numbers because like that's what you probably put into the article anyways. Like you cannot put a 10 gigabyte data file into an article that needs to fit few PDFs. So you need to have some sort of like visualization for it anyways. So most like, in many cases, it's useful to like a run some visualization maybe in the cluster or some sort of like data reduction thing, like you calculate a mean or you calculate standard deviation or you do some statistical analysis or something to reduce the complexity of the data so that you can then visualize it easily. And then you don't have to transfer that much data when you do the work from home or from afar, you only need to transfer that small amount of data that describes what the analysis did. Yeah. So the downside with mounting is that it's not very efficient. So basically, if you're accessing a big file, then every time you're accessing it, it's being transferred over the network. So small things like plots and stuff like that, sure, big things like gigabyte today files go to the cluster and access it there. Yeah, it's basically like, like, if you consider like, like that the cluster is this kind of a restaurant where stuff has been done and stuff is happening and and somebody's eating their spaghetti in a table. And if you ask them that is the spaghetti good and that they answer like through a window or through a doorway that yeah, it's pretty good. Then you can be like, okay, like the job is doing. But if you ask them, okay, can you like bring me the spaghetti like I want to get it. And they try to bring it through the window like it's it's it gets complicated really fast. Like if you need to inspect every everything yourself, if you don't just like it's better to usually just like look at the end products instead of that transferring the whole data set and everything through the mount. Yeah. So with that being said, have we answered the three things here? Or at least I think we provided you enough to read yourself and ask more questions if needed. Yeah, I'll quickly mention that there are other like tools that we are working on like Jupiter. We have a Jupiter system in also and and we're also working on enabling this open on demand system so that you can have like some sort of a bit more usable interface, but there are many workflows that you can use to do work. But the main main thing is that like usually you you write some code and then you submit it into the queue, the code does it. Yeah, what you have told it to do, it runs there on the background, you go for a coffee break, you come back, the code is done, you transfer maybe some one CSV file or something that describes your output, and then you visualized on your laptop and then you are happy that you managed to do this stuff without your laptop burning. So that's that's how people usually do, but we cannot tell the one workflow that works for everybody because there's so many different kinds of problems that people are solving.