 So we decided to rearrange things and we're going straight to this talk asking for help with supercomputers Yeah, basically like like we noticed that maybe like we have had this discussion of how do you ask admins for applications and such stuff and how do you discuss with the admins Related to the problems you're having and and basically these flows really nicely into Radovan's topic Which is like how do you ask for help with supercomputers because this is a like having this kind of a same Latin disc like It's a way of getting the issue solved. It's very important Yeah, so Radovan are you here? Yes Yeah, so Radovan is Works at the University of Tromsø in Norway a lot like us at Alto and He does well Because his title is research engineer or senior engineer or something like that But anyway, he's I know him from Cobra finery and a lot of other similar Joint projects we've been doing among the Nordics Let's see So someone I'm getting reports that actually now the twitch stream is not working but It is working for me Before we go on how about we have a quick poll here Can you answer if twitch is working for you or not? Well, that's a lot more yeses than no's so I guess let's go on Yeah, I Can recommend that you try refreshing the stream It might they might be some something that happens like if you refresh that usually helps Yeah, and also if you have some ad blocker or something that might interfere sometimes You're not gonna getting any money from ads if they're showing you then but Yeah But yeah, like they might eat it here with the screen, okay? So well, let's continue. So yeah, so right up on when we were talking about this workshop we were We made a call like there's anyone else want to present something The rather one had this talk on how to ask for help with supercomputers Which actually provides a lot of insight into the thought process behind us so he's here to present to us so your slides are now visible and Yes, please Okay, hi everybody on on stream and recording and hi Richard and I see more. Thanks so much for having me here They're really nice to be part of this HPC summer kickstart So I wish I some slides my understanding is that we have like 20 minutes time But let's also have a discussion questions. Welcome the slide deck. You find a link to the slide deck also in the in the HECMD And as Richard said, I'm Not working in Finland. I'm working our way, but we have many things in common. We have many projects in common we collaborate on Code refinery, so maybe maybe actually to the next slide which just a tiny slide about me So we work together on the code refinery project. I Have worked in past on research side. I'm coming from the theoretical chemistry side but the last Five six years. I've been mostly working in high-performance computing doing support in Sweden and Norway I'm working on research auto engineering teaching programming and really trying to help researchers with computing and programming and I Think I do really similar work as as you've seen today yesterday But in in Norway, so in Norway, we have the Sigma 2 meta center, which is on Which is an organization come composed of five organizations across all Norway where we support research and we do high-performance computing and In this presentation, so this is a talk that I've given a couple of I think months ago in Norway, but I think it's also really relevant here. It's about It's about discussing how can we write support requests questions so that you get quick answers and useful answers and Also, how do how should we answer them? So this is not only for Not only for the users, but this is also for the staff because it's a dialogue and And then hopefully what we will discuss will be a useful guide and it's not meant and Hopefully doesn't come across as you know, do this don't do that and this is how you should do things It's a process. We all are learning and Yeah, hopefully it's useful looking forward to questions. I will try to watch here on on I can be but Richard goes learning questions and funds so Just as a starting point. This is about improving your experience with the system So we tried on systems in at all those systems in Finland, but also maybe systems Nordic systems later when you scale up and as I said, it's not only for users in fact, I would like to also Create like a slide deck and training material for staff Because I think we should have training also for staff on how to support users and also there at least In in the institution where I work. I mean, there's definitely room for improvement So let's get started First I'm credit. I would like to thank my colleague beyond Helge from from Oslo But also Richard and the Kodi finally team. This is based on really a couple of Discussions, I want to say that I've been working on both sides Both on the research side and the support side. So I think Although I'm definitely in my bubble. I think I can still relate to To both sides of the dialogue and When we think about the two sides of asking questions answering questions I think it's it's really important to remember that on the other side. There is a human being When we ask and then we answer and what do I mean by that is that? When I'm now support staff and I'm getting a question through the Through the try it on issue tracker or through email through chat. I Should remember that It's it's maybe not so easy to ask for help. I Should remember that the person asking me for help has perhaps 20 years less experience with the command line with Linux with supercomputing Also, the person on the other side has perhaps spent weeks Wrestling with this problem and maybe has waited days or weeks for my answer So really important that we always Remember to be respectful So that's for the that's for the support staff now the It's also good for the users to know now. This is now This is the situation no way or that we actually rotate So the duty of who is answering these questions. We are rotating so it may be different people on different weeks I don't know whether this is the same situation out though But it can happen that you know some person answers your question But another person picks it up a week later or two weeks later The people answering questions. They don't know everything Also, they may not be spending all their work time on the supercomputer either so sometimes I'm helping Researchers users who spent more time on the computer supercomputer than I do so I don't know everything, but I'm trying to help and also as Seymour Explained it nicely like half an hour ago. Also, I I am then the middleman and I'm also going out to stick overflow and look for answers So I don't know everything. I may not know you so when you ask me a question I may not know you we may not have the context and it's good to create that context to be To give an helpful answer Okay, moving on When you ask questions on in the super computing centers and this is really common across Basically, all of them. There is some form of a ticketing system. So in auto, this is GitLab issue tracker as far as I understand, please correct me if this is wrong In yeah, that's correct. Yeah, so there is a ticket and it gets a number and then there is a discussion thread in In in Norway we get we get these requests via email, but also they open up a ticket on our side So on our side, it looks very similar to how it looks in auto So each of each of these tickets issues problems questions gets a number and then we try to have this conversation in this in this thread and Typically each of these tickets will have an owner So and this is somebody who's watching that this doesn't get forgotten and make sure that this is followed up And this will be typically the person answering you on the other side, but the owner can change and Independently of whether this comes in as an email or whether this is an issue on GitLab It gets a subject or a title and This is the first thing that we see and It's really useful if this is descriptive So already that can help a bit So here's an example for for a subject, which is not very useful problem or help because then Then we have to go in and read up on all the threat to find out what is it about It's good if already by looking I mean like when we read emails just by looking at the subject. I Can get an idea of what to provide me because then I can maybe it makes it easier for me to Locate the colleague who can help me answering this Why is my job crashing is better? If it is a new problem Then create a new New issue on GitLab or a new email new a new threat So just because you've been in conversation So maybe you have been working with a colleague at auto scientific computing for the last two weeks exchanging many emails working on a problem But now you see a new problem. It's really good to open a new issue and not reply to the To the unrelated conversation with this new question because that makes it easy then easier for the staff to Maybe somebody else can pick it up then the problem or maybe somebody else knows more about the solution and If it is the same problem then keep it on the threat So reply to that email reply to that issue threat so that a discussion stays connected because that makes it easier if One one colleague goes on vacation and somebody else picks it up and wants to wants to help they have to hold conversation in one place And it's not distributed across many different emails and threats and issues Also helpful to to give us context Like what is your username that can help we can find it out or through your email, but it takes a couple of minutes and also If you if you Discuss examples like my job is crashing. Can you please have a look in I have an example in December folder? Very helpful if you can use Explicit paths so the full path to your example Because here I can see who you are and I know how to find you if you tell me that You can find it in my home folder Then I have to first look where is the home folder all that can be done. So explicit is better than implicit If there are sometimes there are more machines than one So in no way we have I don't know four different Clusters, then it's good to say which one is it then Also tell us about your environment and what I mean by the environment what modules What modules have you loaded? if this is Like a conda environment, what is the core what are the dependencies if you load some Environment if you set some environment variables or modules in the or like dot-patch RC I will come back to that later. Please mention it because we Your computing environment may be different than mine and then if I try to reproduce our example It's good if we agree on the same environment Yeah, sometimes tech is better than a screenshot but screenshot is also can be fine Sometimes an attachment is better than a screenshot So this context can be really helpful. Okay looking at the time. We have Maybe like 10 15 minutes more to go So now about let's let's talk about formulating the question and here I have four really good Questions to ask yourself and I took them out from the help pages of auto scientific computing And I will say more about them in a bit So one question has it ever worked So is this the first time that you try this or has it stopped working? And if it's not working well, what has changed between, you know last weekend today Also tell us what you're really trying to accomplish and the goal not not only the current technical obstacle But where do you want to get at? I will come back to this What did you do? With this problem before sending this email or before opening this issue try to be reproducible. I Will also come back to this and What do you need from from the spot? Do you need a complete solution or do you want to get some hints to get started? How about if the solution takes really long time do you want us to that we recommend you Thinking about other solutions first So these can be really useful. Let's let's drill a little bit into these questions So about telling us what you have tried Really useful to know is this has it stopped working or has it ever worked? Because sometimes I'm getting like a new request Okay, I tried to do this on I try to run this on 30 on 30 cores and it doesn't work But it just by the question is not good to me Well, did it work on 10 cores or 8 cores or did it work on one core or has it never ever worked on this machine? So this complex can be useful If if something is failing Does it is it always failing or only sometimes? If it's always failing, it's it may be easy to do sometimes but it may only randomly fail but that can be good to know Have you tried to to isolate and simplify the problem? And if yes, how what have you tried more about this in a slide or two? Also check the documentation and the web But still good to not hesitate to ask because sometimes the web is wrong Especially when it's about installations because sometimes the web if you go on Stack Overflow and you check how should I install something then Stack Overflow says well, you do sudo apt-get install and That that doesn't work on a cluster. So don't hesitate to ask either and This connects to this question of please tell us what you really want to do at the end Not just the obstacle and this is the so-called XY problem. So please tell us what you really want and What is the XY problem? This is something that happens quite often and that is that So I as a user I want to achieve I Want to achieve X whatever that is And now I think that well, what do I need to do so that I can do X Hmm. I'm thinking maybe searching the web and now I realize that I need to install software why For instance So I need I think I need why So that I can solve this goal And I have this goal in my mind, but I don't tell anybody so it's in my mind, but the support for others know So now I'm trying to do this. I'm not trying to solve this problem. Why I hit the problem and Now I asked for help with why and Now we get a sport issue and we have a conversation lots of back and forth many emails many questions after One week or three weeks We finally resolve why But after much interaction only then we realized that Well, we thought something but then we realized that what the user really wanted is X but we didn't know because we it was never mentioned and And maybe all this all the effort and the solution that we invested into why it wasn't maybe in it was maybe not even the right solution so it's my recommendation is also communicate what you really want to What is what is your goal? Ask early When you only know X maybe you haven't tried why yet, maybe So let us know about the context. This can really help because maybe the the staff can guide you towards a Completely different solution that you may not be aware of But this goes also both ways is there a comment No So this goes both ways also for the staff and Well, we have Richard coin to this the reverse X Y problem Also for the stuff. It's important that we when we answer questions that we ask ourselves That we don't only answer what the users asked for But that we also try to read a little bit between the lines and Think deeper about what the user really needs or may need and the user may not be aware of What they really need because they may not be aware of what exists so it is also a task for the support staff to read a little bit between the lines and Maybe suggest a different route or at least have this conversation Okay, just posing here to see whether Any questions or comments coming up? So there's I will make one comment that is like right bang in the middle of your talk I we didn't want to interrupt because the talk was really good, but right bang in the middle It seems to the fastly like edge content delivery network Is down so like a big part of internet is down in the whole world. So so Tweets is down where the stream is supposed to be but fortunately some In zoom somebody managed to apparently stream this This talk still hopefully like or Richard at least Recorded it in the future, but yeah, you will be a problem by the content delivery network. So basically Twitch went down with her is down Like major news organizations are down ready this down paypal is down like It's a bit of a bit of a mess apparently in the internet So so yeah, this is a thing to remember Yeah, yeah, this is this kind of a situation where you're You're giving a talk when the bombs are falling like Okay, so that explains why my slides were not reloading on the top. I think Yeah, yeah, it's github down as well So yeah, like you go to github it looks pretty funny like it doesn't it's basically It's dml at least to me like so So, yeah, this is pretty interesting So what is the organization should be like boss and we start somewhere? I Think maybe we can keep Going I think people in zoom can still watch and people that have already have an open can watch in any way it's recorded. So We'll make use of it that way Yeah, okay Keep on going through this crisis Yeah, it's like a Bit of a nice like has it worked before this kind of like how to solve an error while the error is happening while you're Talking about how to solve an error. So yeah, this is the kind of a situation where it's Like you need to immediately start debugging situation when it happens Good. Okay. Yeah, we'll try to carry on a few more slides some recommendations on so this Actually some this is a recommendation on what to do when When you are new on a machine Either new to super computers in general or you have been working on a super computer one And now you are moving to super computer to and that is so first time on a new machine and the recommendation that I would like to discuss is Something that I see very often is to not start immediately With a gigantic job and gigantic can depend on the context. So of course if this is hoping the class of this but You're just an in some number. So don't immediately go for 16 nodes 20 nodes how many often how many nodes and not immediately for the 40 hour calculation but instead to to grow the calculation And I will discuss here a little bit why I think it's useful approach and also how that connects to to asking questions and reporting issues and And and this is something I wish somebody had told me when I started on super computers, which was maybe 2005 to To first calibrate my calculation and not immediately go for the big thing so to start with Something small a short five-minute run on one a core maybe and Once this is working then go for more course and Once this is working then go beyond the one out and Then once this is working increase the system size and make the calculation longer And this is not what I used to do because I When when you start with these tiny tiny calculations, they can be unphysical or The system size is like ridiculously small and has nothing to do with my research But it doesn't matter. It's it's first about getting the all these different parameters and and toggles, right? And then I can go for the real system So here a question to the audience, but I'm not sure how much audience is left as Moment so maybe I can discuss it here. So the question would be What are the advantages of this approach? And I think this is a bit related to Experimental research like if I would get an expensive instrument Into my lab And the supercomputer isn't is a big complicated expensive instrument. I would also not immediately Go big and do try something very novel on it. I would first calibrate it So this is about calibrating your job scripts the parameters and What is the advantage of it is that? Start simple and grow it because then you First of all, it will take if there is a problem. It will take less time until the problem appears So with the five minute calculation, it will not queue forever. It will queue for a couple of seconds Then it will start and If I have a typo in my script, I will immediately see it. I might have to wait for three days Until I see that I have some really basic mistake Also, if I have a little bit less trivial mistakes in my input files in my script in my configuration By growing the calculation, it will help me identifying them just by having a simple example And this can be also useful when you hit a problem. So when I hit a problem, it can be really useful to simplify the example So then you go the opposite way So then you make the you make the problem smaller and smaller and smaller because that can again help me identifying where the problem is It it can also it's also helpful for the staff Because also the staff has to when they debug your problems, they also have to queue and wait So if this So when you experience a problem Try to create a small reproducible example It's not always easy. It may not be always possible But it's a really good process which takes some time so create or make the example fail as early as possible this improvised debugging If you have an idea of how long it takes until it fails tell us that so if you know that this is Crashing after half an hour. It's really good to know Again, is this reproducible does it always happen? Does it only sometimes happen? If it fails after two seconds Then then in the job script don't ask for 48 hours Because the job script doesn't know that it it will crash after two seconds and if you ask For many many hours The queuing system Doesn't know that it will fail immediately and you may wait way too long in the queue For the job to start and to crash two seconds later All dependencies loaded in your script because then they are close to the calculation so I apologize. I don't know. I'm not sure whether this has been discussed. It may be discussed later today or tomorrow, but I recommend to not load Dependencies in your dot-page RC. Really the right place is in the job script Attach all necessary files. If this is an interactive job Oh, then It's useful if you can provide all the commands from login to the problem So so that I can reproduce them So from from the first login to the machine. You can give me a recipe Do this go to this folder Then do modulo to this modulo that now run this script and now you will see this happening and this recipe Can be this can this can help to stop a lot Otherwise we will have to do a lot of questions and answers to figure all this out Okay, very few most two more slides more Well, why this small example? because it It often just by making the example smaller It often simplifies the problem We remove variables And it can help us identifying the reason So just this process which can take some time can actually help us to to find the what the problem is We reduce the number of possible reasons Also, the shorter example is easier to debug and doesn't queue forever And Having this small example, I recommend to create one for you independently of whether you have a problem right now or not Create a small example for you something that runs in five minutes Something where you know the result And also where you know the timing Because then Next time something is happening with the machine And you are unsure like is this my is this me or is it them? You know, did I mess up something or is something or something happened with the cluster? Because then you can run your example the example that you know very well, and if it crashes immediately You know something happened on the machine And if it takes now five times longer Then you know that maybe something happened with the file system And this then you can send this example to the you can attach it to your issue and say that look, this is this is my example I know it very well. It always works in five minutes. Now. It doesn't Please have a look something's wrong with your cluster And That that is also very very helpful for the staff. So let me summarize, but hopefully we can have a discussion um So when facing a problem Oh spend a few minutes on it before writing the email or before opening the issue because it can help improve the information and it can speed up the rest of the process But also don't hesitate too long to ask I think it's wonderful that auto scientific computing provides this these garage sessions every day and Really cool that this is available. So I would encourage everybody to pop in ask questions there Many people don't know that you can you can actually install most software yourself And staff is will be happy to show you how because we really want to empower users to Take control over the software installations You don't need to be an administrator to install most of the software Let's remember the x5 problem both from both sides both as users and the staff. Let's communicate what we really want Not only the immediate technical obstacle I highly recommend to grow calculations before When I when looking for problems, but also to find out how do they scale with system size? so don't go immediately for 250 cores but grow it grow the calculation and find out how does it grow when I add more and more processes to it Create a small example for you. It will one day. It will be very handy So that you can find out is this problem on your side or is it on the cluster side? Try to make problem examples short and reproducible Thanks so much for listening. Do we have time for questions and other any questions or comments? Yeah, let's see um There are I Things adjusted Yeah, so, um, there is some questions on hack md that were mainly about the Well, like about scaling jobs themselves, but we'll cover all of those things tomorrow So that's not the main point here I guess one Question I have it seems a lot of these are somehow about, um like Complex computational jobs like when you have the simulation all ready to go and you're running it So How do you recommend people to ask for help at an earlier phase like when you're still Like before you're starting should you ask for help? Like I recommend to so there are no there are no stupid questions And and then again, I'm good for the staff to remember that many people are new to the terminal new to the command line New to Linux new to all of this It's good to to ask early even before you have any scripts and any Anything there because that can also help you Asking for the right resources. So because often Starting on a supercomputer is often connected to actually asking for compute time or storage, but you may not you may not really know And you may not know should I go for the cpu or the gpu? Should I go for this machine that machine? So I would recommend to ask. Yeah. Hey, I have this problem so that we we can help directing you towards the right service And towards the right resources to read yeah one of the things I would like to Have more of is sort of low threshold questions like I'm about to do such and such I don't need you to tell me a lot, but just Point me to the right direction. So I start reading Reading the right things myself and I can come back in a few weeks if I have questions Rather than spend a few weeks possibly doing the wrong thing and then Coming and asking for help But also at the same time all these other things like the garages the chats and so on These are good for these small questions, but they're not good for big questions If you go on the chat and say My code doesn't work And try to answer it The problem is we can either answer the question right away or it's probably going to be forgotten and scrolled away and Who knows if we'll get back to it So we'll also have a low threshold of directing you directing you to the issue trackers or the other systems in order to um to Like keep track of the longer term things So don't think that if we direct you to the tracker that means the question was bad It just means that we need to track it a different way I mean, it's also okay to come to chat and say I have a question about such and such Is this a good question to ask and if we say yes, then you send the email about it Yeah, and also is this like is this a good place to ask that question? Because there can be many different places. So then we can we can help directing where might be a good place Yeah Okay, so I don't see many other questions about this on the um hack and D The one one interesting, uh like mentioned here was like basically the xy problem Is like