 So Richard are we going to talk about the data storage now now you're muted Yes data storage So Or should we jump straight into interactive jobs or maybe we should mention that data storage Quite important. Should we take a look? Yes so So data storage Is is like well, it's one of the basic things that you need when you go to a system first Of course, you need to have the system and you need to have The machines there and then you need the applications that you need to run there But of course you will need your stuff there as well like your code your results your initial data stuff like that So you need to store them somewhere and and in So this isn't very Like like it says here, it isn't very glamorous But it's an integral part of the whole system that you have the data there and you access it In a like optimized way or the best way possible. So These are various site specific. So different sites have different guides I'll recommend checking the different sites guide, but but the basic idea is the same on all of the systems. So usually Well, well, basic things apply again to all the systems and here Well, we could quickly go through the different Types of data or different places that we have for data in these kinds of clusters. So first of all when you connect to a cluster You usually are in your home folder Quickly arrange rearrange this so you can see here. I'm at my like the distilled and tells that or This PWD command will tell me that I'm in my home folder when I connect to this and this home folder is like this kind of a shared your own folder where you can put your SSH keys and stuff like that That we mentioned before So this is meant for like like this is something that is always always there and it's meant for having your like configuration files. So basically Like yeah, it's meant mainly for configuration files and stuff like that is so It's basically like the operating system partition of your computer and then you have a data data part That is like where you actually have the data, but this is like the part where you have your your documents and like like Your documents and downloads and stuff like that and maybe some some pieces of code there But then we have the major part of the system and that that is the scratch disk or the work direct directory and that is Well, it depends on the site, but in our case in although we have this fast lustre system That is meant for like that is like this parallel file system that is meant for high throughput and high performance And that is meant for your Calculation data. So how high of throughput and how high performance is it compared to let's say a modern consumer grade SSD disk Yes, so basically in this case, it's like like we're talking about like multiple giga gigabits per second gigabytes per second like 20 gigabytes of reading writing constantly throughout the file system And so of course that's lower like you might think that that's lower than SSD But the problem is that like when you get into like very high capacity Discs when you get into very high capacities You need to make all kinds of like security things or like You may need to make certain that if any of the disks actual physical disks in the system fail There's no like he the cluster doesn't go down basically. So there's all kinds of like this kind of a Well Well, what all kinds of failover things that similar to the to the fastly thing Should like work if if something breaks like the system shouldn't go down. So similarly to that They apparently they had some place in this system. That was like a single point of failure that took a Lot of the internet down and we want to make certain that the file system if one disk fails It doesn't go down the whole file system So this file system is meant for like or it's designed for big data Big data that is read like in big chunks. That is basically that that is the typical use case or the Optimal use case for this file system and the size in alto. It's two petabytes. We are getting a new file system That is five petabytes And will be even more have even more throughput But basically this is like very big big system that is shared among all of the compute nodes so basically all of the calculations that are supposed to be done on this system and So so basically you have the home folder What is basically just like to make certain that you can log in if even if the file system would be down or something like that It really happens, but just in case like you have the home folder for your own personal Scripts your SSH keys and stuff like that. Then you have the scratch disk and this is usually separate They like your own user space and then to a project directories and the project directories are Well, your groups directories your if you're Collaborating on a paper or something like that. You have might have a project directory for that Collaboration like places for shared shared files and your own folder. That is for you alone At then there's some of the nodes like most of our nodes nowadays are diskless So they don't have any disks. They all the whole operating system is is in the Memory of the of the system of the node But some of the systems already also have local disks that you can use for fast I operations But well, we'll talk about that especially on the GPU computing side At alto we also have these other like these Other places like on your workstation, you might have your alto home directory You're like these teamwork project folders and archive folders and these are separate try from try them They are like all the side but on login node you can access these but they are not meant for like doing Calculations on so that the shared file system is meant for like the calculations and stuff like that Those are meant for like longer term storage In general, you don't need to think about that much about like The file system itself You just use it. You just store data there and then you just use it But sometimes it's good idea like some in case some cases you might Cause problems because the file system is similarly to the login node is used by everybody It's used by the whole cluster and if you think about the cluster, it's hundreds of nodes and a like sometimes if you have certain kind of Problem programs the programs might have like this kind of bad IO patterns So what I mean if they might load like hundreds of thousands of files or they might write hundreds of thousands of files or something like that and that might cause problem for the whole file system and You usually want to avoid this by well asking us usually like What what could I do to improve it? But but there are many tricks and ticks and trips Many helpful hints here in our documentation of what you could do so you could Check how much IO they're creating as such forth. So how much IO is So small it doesn't matter like at what level would someone have to get to before You'd say they should talk to us or think about it deeply themselves That's a very good question. So so I think the main point is like if it's human readable, that's Not not a problem usually like if you can read it the computer can read it as well That's not a problem But if it's like if you scale it up, let's say like a simple example It would be to you have a Python code that runs on like it reads thousand people as big whoop like it reads Thousand files, but let's say you want to run the same program hundred times in the same the same cluster in that case like you might you Read the thousand files hundred times. So you get hundred thousand files Suddenly read so so in that kind of you need to think about the Like in the context of what you're doing and and like basically multiplying by the amount of times you're doing doing the Stuff to use a little problem becomes when you have individual files that are over a hundred gigabytes or something like that Or yeah, yeah, I guess maybe I sometimes say like a thousand files is nothing for the computer But if you have a thousand files and you read the same thing a thousand times Which is trivial to do with shell scripting and slurm as we'll learn Then you start to get to an issue. So yeah, yeah, yeah, it's like the the story of the Chessboard with rice on it like like you put put one Bit of rice on the first piece of text chessboard to piece to rice grains on the second Square in the chessboard and then you go forward and then you get like Exponential behavior and and these kinds of things can happen very fast in clusters. So if you if you do something Multiple times Multiple times and then you get into this kind of like exponential curve or or square square curve or this kind of like Behavior where you suddenly something that wasn't problem becomes a problem. So so usually you want to think about this is like Like is it is it something that either has a like a high I'm on high multiplier basically like you're going to be doing it a lot of times or is the single Right really big or single like Single use really big. So if your program every time runs it generates like a take a bite of data You you might think of like do I need all of this data these kind of situations you might need to Think and they are isn't one good solution to these problems. So there are many Small solutions that solve some of the problems, but usually it's good idea to think like am I having too many small files That is the biggest issue usually like how can I combine The stuff I have so that I don't have too many small files And the other thing is that how can I make certain that I don't generate too much like data that I don't actually use and and these are usually the The two things that are the biggest problems, but But uh, yeah Usually Everything in the between is is okay But okay, so let's let's actually look at what folders you have available. Uh, so this is Again, uh site specific So not not So in in in Triton we have the home folder So, uh, like us like I said, uh, you have the home folder and this is, uh The quota we have increased this is bit out of date the quotas nowadays like 20 gigabytes 20 gigabytes, but basically like the home folder is meant for for like storing Some some basic stuff, uh, maybe some programs and stuff like that and there's a nightly backup on this um Then you have your own, uh, work directory. So either you go to work directory here and you will get here and These has a 200 gigabyte quota Or one million files And then there's a project directories for your project. So you might want to ask your Well, people in your group whether you have a project folder What is how do you organize stuff in your project folder and if you're working with your collaborators? Should you get a project folder together? That's also a good question. We can create these for you and the quota is Negotiated with us. You basically ask how much you Need and we ask why do you need it and then you if it looks good? We just put the quota wherever we want. Yeah, well, whatever you need I mean, we basically never say no to quotas But we make sure you realize What you're doing and Yeah And the important thing about these systems like this high performance system is that there's no backup there And That doesn't mean that the system is like like I mentioned There's lots of failovers like we can lose an entire server and if the system shouldn't even go down we can lose One a few disks Some like with some two and the system doesn't You don't see the the change anywhere But but there's no backup Protecting against user failure and that is the most important thing like if you run Rm rf and it remove all of your files They're gone and we can't return them and the reason behind this is that like Taking backups of the amount of data that we have would require A lot more data. So so in order to keep backups, we would need like Uh have more well a lot more data and that would create this kind of a situation where Where you have less data? So it's basically like like if you if you know about Rockets like if you need to send a rocket to the orbit That's one thing But if you need to send the rocket to the moon you first have to send it to an orbit And then you need more fuel to get it to the moon And then the more you the longer you want to go the more fuel you need to add and you More fuel you need to add basically to To to get the fuel there where you want it to be. So so it's a similar kind of situation So if we add if we would have backups We would need more space for the backups and then we would have less space for the data and and We end up into this kind of a you Either or kind of a situation and in in this high-performance systems. There's usually no backup so it's Usually a good idea to use the the alt provided Like these teamwork drives and these project drives there to store like initial data sets important results Stuff like that and and then do the work in triton and then move the important stuff That is usually much smaller than the like the temporary stuff move that away from That from triton And I guess you could also say like you can't have something both large and backed up But you can have something large and something backed up And just like with memory inside of a computer Once you're getting to the scale of doing big things You have to manage this hierarchy of all these different storage locations and Realize this is the important stuff. I'm gonna back it up. This is the scratch stuff It's going on triton and sums in both and then you have to Keep it managed and that's the difficulty of scientific computing not the Executing the program necessarily Yeah, it's it's kind of like this kind of a trifecta of is it is it fast? Is it big? Is it secure and usually like you can get all of these if you have like huge amounts of money But usually like you don't have that kind of money So you need to take a few of these and then you have separate systems for these things. So basically the auto project folders, they are secure and they are Big well somewhat big and then But and the triton side we have well, we don't have the backups. It's still secure. It's not going to like go away But but it's it's more in the category of being big and being fast So so basically we have to take compromises in order to be able to provide the resources we have And this is very common in any system In the world. So basically you need to think your workflow through so that you have like What is something you cannot replicate and what is something you can replicate? But it just takes a bit more compute resources to do it. So basically if you have initial data The coding of your program is Labor intensive that requires lots of time to replicate You want that to be in git somewhere somewhere secure. Keep it hidden. Keep it safe somewhere and then you want to have Your compute stuff that is or like once you have the code you have the initial data You can create like your temporary results for your data like project And then you analyze those results and usually that is also labor intensive So you want to keep the results Secure and safe so that you can then give it give them to let's say publications if they need them Yeah, so what about quotas? How do you figure out how much space you have available? Yeah, that's a good question. So if you run this quota command on the login node Uh, you will see these kinds of output like depending on how many groups you belong to I belong to quite a many groups. So Well, it seems to Is there a problem here? Oh, there we go. I think the problem is that it's trying to find all of the groups, but basically you can see well for me There's the home quota that is 20 gigabytes um Then I have to use a quota in in the scratch it's For me, it's a bit larger because I need to install all kinds of temporary stuff for solving the problems and And then you might have these group group quotas in the future This will change a bit in the other side because we are getting a new system and there we will have these so-called project quotas But we'll inform you on that but basically you should know that there's like Difference between stuff that you own and stuff that the project owns there's also the thing that if you leave we If you leave the university, uh, we will like we we can't look at your personal data We won't look at it. Uh, so the home folder and your user folder that is Well, we will at some point remove it And if you if you want to it to be removed, we will remove it That's your data, but the data in the project folders that is owned by the group So the primary investigator in that case who Owns the group basically so your professor or something like that So if you're like a summer student or something like that If you want your stuff to be available for the other group members once you leave You want to store it in the project folder or the place such as that because otherwise like it's it's gone like Like they they they get nothing So, yeah It's good idea to separate. It's what you're going to be working on into separate folders. Yeah so If we scroll down more and on hack md, there's a question about the data availability. So Um, if you want to access the same data in other places, how does it work? So I know it used to be that traditionally Like if you needed to access data somewhere else, you had to make copies of it So you'd copy it to the cluster work copy it back Copy it to some other storage place Are things any better now? And do we still need to do things like that? Yeah, so so nowadays, of course like you can like you don't Like if we think about what the git provides you for you It it will keep track of like changes in the data and it changes what happens But when you are like and it deals to row by row basically in the files But when you're dealing with like big data big nummage data and stuff like that You can easily get into the situation where you have like dot back file somewhere like lying around every every which way Because you're copying data around and it it can't Really complicated. So so usually it's a good idea to get use some of these like Like already existing mounts to access the original data So you don't need to have like multiple copies of the data running in different places And and confusing you and making life harder for you Uh, you can So on many systems like for example the in the vdi that The work directories are mounted in this directory in Various department shell servers like basically these proxy hosts, uh, the the directories are mounted in different places If you're working with your On vpn or from your workstation, you can use these samba mounts to do these You can mount these directories directly And and work or like work there So basically access to with with a file browser There's various options here, uh, you can also use if you're using linux and you're not using If you're not using Like samba if you're not using vpn you can use these sshfs to to basically like Get like a file system access to the to the folder Uh, it's a bit more laborious usually and of course you can use tools like like this sftp You can use that to basically copy stuff from one folder to another There was a question earlier about on auto manage computers there is a um An automatically synced documents and desktop directory Do we have anything like that and would you recommend these kind of automatic syncing things for this kind of work? Yeah, so so usually this, um So do you mean like, uh Like automatically sync all of your data to to your laptop or something like that? Yeah, I mean yeah, I guess like usually usually it gets into this kind of situation where like if you're doing something serious you Don't want to waste your time doing that kind of stuff because like like usually I like this is my personal experience at least but like That that takes a lot of bandwidth to transfer this data Usually it's like the information that you need is much smaller than the actual data and you like I would recommend Working towards a situation where you can do as much of the data Processing and stuff like that on Triton or the cluster itself because then you just have to transfer. Let's say averages of the like the whole system or like like the only the Sometime series or something like that is that is like we're talking of order of megabytes Maybe and that also you can transfer to your computer and then you can visualize them and and work on that. So I would Highly recommend not like creating some sort of like a Elaborate our sync mechanism where you like sync stuff back and forward because that that is bound to like at some point you end up in a situation where you like Well, I hear you're constantly copying stuff around that you don't even need or you are Like, yeah, like that's not usually a good idea. There are some tools like let's say git large file system like this support that you can't do like out like It makes certain that when you're syncing stuff, they are similar, but like usually the easiest solution is just to like Like can I add up my workflow so that I can just run my stuff in? In let's say try it and so that I don't have to do like like for example personally Like in one of one project I was doing I had to do plotting and try and like plotting must plot leave plotting like somewhere and I just had I coded on my computer I made a git the repository for my code I just committed that to to centralize repository. I pulled the repository and tried and And then I run a plotting command there and it did the plotting there So and then I just copied the jpegs back and it's like It's maybe three commands more, but you don't have to think about like did I Did I remember to copy that 100 gigabytes of data to my own computer? Are these up to date then? Like it's much easier to usually to adapt your workflow Then to the system then fight the system like Yeah, like I tend to Recommend against doing too much automatic data Sinking kind of things because if you have small things a few gigabytes. Yeah, you sink it everywhere But as your work grows and grows then things get well, really Complex and it eventually comes back to bite you Oh, let's see. So What's next will we have These remote access section did we talk about that? That's a good question. I just have to answer this because this is pretty funny like this So what is the best way to copy files to a cluster directory? For example cmos copy work that they said so this copy work ss isn't actually to how to iCopy stuff in strident That is me testing out how are we are going to copy all of our Currently existing work directories to the new file system that we have so Copying there like one petabyte of data that we have so that's copying like the whole work not my work Yeah, I don't have that kind of a script like that. I used to sing stuff Yeah, so so basically That is like like we're going to be doing like this massive copying Operations in the future a thousand times bigger than what most people will be doing Yeah, yeah, like it will take a week like multiple days to do the copying and stuff like that even with the full file system Like we are going to be running on multiple nodes to do the copying but but yeah, so so I I personally like I could throw my laptop into like a lake I don't know still would have all my data somewhere because like I don't want it to be on my laptop because I will lose it. I will Like it might fail. It might like the disk might fail on my laptop. I don't want to think about that So I personally like have it stored like important stuff is in the repositories Other stuff is in the file systems that I know about Like I don't want to think about did I store my like important data on a drive that I have in my Like cupboard somewhere like that's that's way too much work. It's much easier to keep the data in secure places where you know where it's going to be and then just access that with some like Access pieces of that data individually. Yeah So what if we Well with the remaining time we have I propose we talk about these remote mountings either SMB or SSH FS Explain what they do in general and maybe we can give people a little bit of time to try Yeah This yourself. Yeah, like like try try doing some sort of like a Copying operation to try it and just to get like hang of like one way of getting data there So so personally I I use like Like I basically always use like Simple copy paste like if I just need to copy few lines of script That's copy paste it. I use git or I use scp But I rarely copy anything out of to write on because yeah Yeah, it's the best place for the work direct work data. So but of course My let's varies my let's maybe very try different tools. What seems to use most useful for you And if I do need to access something from try to an outside instead of copying it out I'll usually do this remote mount as a network drive So that way it stays in Triton, but I can access it from outside So did we already talk about this? I don't I don't think we talked about it. Did we About which remote mounting the either SMB or SSH FS I mentioned it mentioned it, but yeah, I'm not not really like Like Going through the whole steps because like all of these are like complementary They're not like you don't need to know every one of these like you can you just need to figure out one way of getting like I personally I I nowadays don't use Like file browser anymore like I'm just too much into the command line So I I I have forgotten a lot, but I know that many people like to do like open Use ids and open like a like open file there and go to a folder and stuff like that And it's fine. It's it's way of working with the System and you should work with whatever way feels best for you. So if you like working with that's a File windows and stuff like that you might want to look into the samba mounting because that's Similar kind of use use experience If you're working with command lines, then the argsink and scps Probably best if you're somewhere in between the ssh fs might be similar and if you feel like Like like not you don't want to do the copying you want to work closer to the flag system Then I recommend checking into the vdi and the Jupiter and using those to like do the plotting and and do the Running because then you don't have to you have the mounts already there Basically like Find one way that would work for your workflow. You don't need to use all of these tools They are not like required Someone's running a question. Can I mount the network drive on? hail so mounting network drives on the cluster itself is Sometimes possible on some clusters, but Perhaps not recommended because you don't have administrator access there and when you So ssh fs might work as a user and you can mount something from outside But you might not be able to unmount it or there might be some other problems So usually it goes the other way you mount well That usually you'd mount data from the cluster on to your local computer for working Of course when you need to connect to remote computers that becomes a little bit um more difficult So I sell a proposal here to do an exercise maybe together or type along and then Go to break and then we do Slurm intro or maybe slurm intro before the break um So we found that the data transferring tends to be a very difficult Thing more than you might expect or at least something that has more complex than you might expect so Maybe I would propose that we um Give maybe we have like 15 or 20 minutes for like self-work exercises um If we should have like a 15 minutes and then we could return in the stream discuss about like what's happening in the hack md like what if there's some interesting questions there Can we give one demonstration of either using rsync or ssh? Maybe maybe if we can demonstrate rsync once And then we demonstrate ssh Or ssh fs to show the concept of remote mounting and then we're Good and can do a break. I think the the demos would be important Yeah, I like in my case like I'm not in alt of epn right now. So let's say I want to copy uh, like I will create this like a Uh, like example that txt that says like Says something inside of it and I want to copy this to triton So of course like when I need to copy to triton Uh, I now have to copy Through this kind of a like jump host Well, uh, like because I'm not in the btm, but like let's say I I I like I can either copy the file first to The jump host and then copy to triton Uh, so basically I can do like Uh, like I previously made the connection with ssh k The ssh kosh and then ssh try but I can also like I'm in cs department. So we have the like this is cs department specific So other departments have your own but I might go to like this server and here the Work directory Is it here? Okay, it's it's it's broken currently. Okay. So this is not a good example. Okay. Let's let's okay I'll try to copy it. I'll use scp to copy the example to Example to kosh And when doing tab completion in these directories, please be careful. We don't tab complete user names of other people Yeah, so so I just made like these so it copies into the home my home folder. So Now if I look at kosh, I have the example here And I can scp this example the txt to Triton Alt i fi So this you can already see that this is pretty laverous like this might be Like the worst way like I wouldn't use this to copy stuff You can make this process a bit easier by by Writing these for example these ssh configs that automatically do jumping for you Yeah, like I would highly recommend that kind of thing. So that is what I what I do like I have Uh, let's say I modify the file. I have specified this jump automatic jumping through these ssh configs. So There is instructions on how to do this on the autoskeycom page, by the way Like in the ssh configuration page So I take scp example Just do this and it will go through the same same route and if I go to Here here I can see that the file has to change. Yeah So so this is like a simple example of a copying operation Uh, like this is something that I use all the time, but Lasso said you don't need to copy what I do. You need to find the way that you want to work It depends on your software the way you work stuff like that But yes, yeah, we'll try to figure this out, especially in the like if we have now the exercise session of trying to copy stuff Let's let's do that right now And then can you demonstrate ssh fs or Do you need to try? Yeah, let's try. I can also demonstrate that if you'd like. Yeah, if you if you want to take the share because uh, Yeah, I I don't use that often so I would probably mess up the commands Let's see. So here we are um So this is my computer I have Some directories here. This is linux uh These directories I make called mal triton And I have my ssh setup where I can just do ssh triton and ssh there without any Password and it has the jump host Let's see. Does this work? Okay, this does not work because only triton is the alias So now um If we use ssh fs The path I want to mount is triton or let's say it's triton um temp Let's do the whole thing Well, I'll do So you can give some path either relative to your home directory or absolute path And I want to mount it in this triton directory And if I push enter it goes And if I list mount triton I see This is what is in my work directory on triton And then if I want to unmount it This should do it And there we go so This works in linux and There are similar things in mac and I don't know probably I think I've seen it in windows before But in other operating systems you might use the smb mounting instead which is built-in and you well It's my sir But you need to be over the VPN to make it work So yeah Okay, um So what should we do now? All you're muted Should we have like a 10 minutes of uh exercises like people trying to solve this? Yeah Figure out their way of Yeah, and then should we resume at uh On the hour Yeah, so yeah 10 minutes of playing with the remote mounting And then 10 minutes of break And if this ends up taking too long then Please let us know and we'll extend the break Yeah, we we'll uh like at 10 10 to three We'll uh discuss what's in the hack and the and what's happening and Yeah Yeah Okay, so um, let's see. Where's hack and the let's write our notes so exercises And we will do So what should exactly should we ask people to accomplish? So Try number two On my yeah, so well I guess if you can do number two, that's good. You can try looking at number one But there's a lot of advanced things there which are basically not relevant for this course and might be useful later, but But just not worth our time right now Does that sound good? Yeah, sounds good Okay, so see you In 19 minutes. Oh, welcome back So Let's see what we've got now um Do we have any further follow-up comments from the data storage So Maybe the most important thing that I would say is that Yes, this is really complicated like Is there any reason that this data storage is so Impossible to do and there's so many different things and well I mean, maybe there's a reason and the reason is it's just intrinsically Complex, I mean like there's so many different options here So we hope that you were able to do something But we realize this is only a starting point Yeah, I'll quickly mention that that basically like like if you think about like any any systems nowadays in like for example like this this setup that we have currently rolling where we have zoom and we have a Tweets and we have OBS that of course and stuff like that We have many many different places where the data needs to move in order to get it working. So there needs to be like Uh, good pipes basically where the data flows the data flows from one place to another and like we saw in the in the break Like when the stuff breaks in the internet when the pipes break everybody notices it and and and these kind of Like setting up these kinds of setups Yeah, it's a lot of work and Very important work because stuff breaks So you should really think about when you like like even in small scale when you're working yourself You should think about your workflow as like How do I get the stuff where I need to be and how do I get it done? like efficiently so that the stuff is where it needs to be and um It's usually a good idea to check what are like what is your workflow? What do you want? How do you want to work because there's no unfortunately no goods one solution for everybody? Like there's multiple people have different workflows people want different things and and sometimes well Many of these are very different, but but usually it's a good idea to think What kind of data transfer you want to do and what kind of stuff you want to? Like how do you want to work with the data one good? The entire can also give is that like Like similarly like if you go to let's say To a holiday and you want to take a book with you You don't bring the whole like bookshelf with you You bring the book that you want to read a few books if you don't know what to read So similarly when you're doing data analysis, you don't usually meet the whole That you have generated like 100 gigabytes in order to do like data analysis You can take a sample of it or a subset of few time steps or something like that Like to give you like Something to to work with on your laptop with that Might not have the resources that that the cost are in right now So you can create like you can work on that you can fiddle around with that With which it is like a like a representation of the overall data that you have And then when you want to do like serious data analysis, you can do it on the site So you don't need to work with the whole thing at one time So it's usually a good idea to think about your workflow and and what pieces are you really need? Do you really need the whole bookshelf on your work like laptop? Do you need everything to be everywhere at all the time? And that is like a good thing to think when you're working with these kinds of systems that are spread out And if you if you think about what you actually need It usually becomes much easier to like manage and you don't have to think about that anymore Like you don't it's it's much clearer in your head where your data is. Yeah