 Hi, and welcome to the second video introduction for Hands on Science with Computing. My name is Essi and I work at AldoScience IT, Teaching and Communications, which means that I am connecting new researchers and new users to existing resources. I'm here with Richard, who can introduce himself. Yeah, I'm Richard Darst. I also work at AldoScience IT. And, well, like Essi, my interests these days are usability and teaching and helping everyone to use the resources that we provide. In this video, we will be referring to a survey conducted on scientific computing where one question was, what do you know now about scientific computing that you wish someone had told you when you began as a researcher or a student? Yeah, and this question really lets you see what kind of things that are sort of implicitly learned by people without being taught in courses, which are really important for your career. So hopefully this will give you a lot of good things to think about. There were 66 answers to that question and I have picked up some of them to talk about in the hope of them being helpful for someone who is just starting out. First answer that caught my eye was, scientific computing should be introduced earlier on the studies. It is basically mandatory, so that should be brought up. The leap from tiny Python scripts to scientific computing with large data sets is huge and quite intimidating, especially for an undercraft grad student. So what would you suggest for a person who is now moving from courses into research? What should they first pay attention to in order to keep up? Wow, that's a big question to start with. So maybe in part all of the other questions that we answer sort of relate to this, but I think maybe the biggest differences are the scale and the independence of what you're doing. So you're not being told here, dude, and you're told use these tools and then you do it and come back, but you're given a general problem and you need to find the right tools in order to do it. And a lot of this involves things like managing your data or code and then also spending time learning the tools that you need in order to solve the problem at the same time you're actually doing the problem. Then there were quite many repeating answers saying how to use the computing resources, how to access resources in practice and use them for computing. I still have a feeling that I do not know enough about practicalities, how to use all possible opportunities and how to Google properly. So how could these people be advised to find suitable resources for them? And how does one learn to find information efficiently? Well, so okay, that's the sort of interesting question. So I sort of interpret this as saying like oftentimes people may be working and you know, you do your courses, you have your laptop and then you start doing research and do you keep using your laptop? Or do you, like, I know there's been some people who've done a lot of work on their laptop and then realize, oh, I have a cluster that I could use much faster and I didn't know about. And well, for that we have different courses and things which can introduce to you the resources and also how to use them. But, you know, a lot of it is also before you go and immediately start working on a problem. Think about like do some searches, what is available? Like what do we provide for free? And what do we support? On the Altoscientific Computing pages, we have a lot of lists of different things and you can find a good page at Alto called, let's see, IT resources for research, or IT services for research. Yeah, and then also go and ask your friends and other people doing things like ask, okay, so what are, like, what are you using to do your computing? And maybe that's even the best way to learn from each other. Maybe we could talk about what the different general categories are. So one first category is your own laptop, whether it's your own or provided by the university or own desktop. And this is good because it's easy and you have full control but doesn't really have as much power as other things, and it doesn't really scale up. So if you can make a program run on your own thing in your own editor, there's a big gap between that and the cluster. And there's different remote resources, whether it's remote desktops or remote servers you can connect to and then run there. Obviously there are the computer clusters where you connect there, you run there, you have to script it, but once you script it, then you can run very many things at once. So instead of running, like, starting five copies of your program, you write a batch script and then you start 500 copies of your program. All right, and then we had the following comment. You generally need less computational power than you expect. That's true. And why do people usually get the impression that scientific computing requires huge computational resources? Yeah, that's a sort of interesting question, especially given that the previous answer was you might need to use a cluster that you can do things more often. And I think maybe there's two aspects here. First is the fact that the one processor on the cluster is not necessarily that much more powerful than one processor on your computer. The power of the cluster is that you can do things many times. So basically you can write the script to run your code over and over again with many different parameters or on many different data sets or things like that. So really the difficulty is not that you need one processor that's a lot more powerful, but you need to be able to use the processors better. You need to use the code so that it uses less memory or less time and runs faster. So basically, it's more about your own skills than the computer power you need. And then in those cases that you do need more computer power. Well, you also need the skills in order to use it. So once you're running things 500 times, you really have to make sure that your own code is reasonably fast and not doing things that are just slow for no reason and so on. And that also doesn't mean you have to spend forever optimizing things because at some point it's better to let the computer run it because the computer is cheaper than your own brain, but there's some balance there. Yeah. And just to remind our viewers that you don't have to use time optimizing your programs to perfection. Those might never be perfect, but they need to run and be correct to be changed when needed. Yeah. That's a good point about being correct because there's far more problems with things being wrong and then having to do it again. So what's the next? Next we have a fancy word, parallel coding. And somebody wished that they knew that paralyzing your script can be as simple as running the same script multiple times with different input datasets. From this, I thought that how does one know whether learning parallel computing is worth the effort and how big stuff should one be running? Yeah, that's a really good question. Like, when is it worth the effort? Now, to be honest, I have not done that much parallel programming myself. I've done some where it's needed, but these days I think it's very much different than 10 or 20 years ago. So first off, I think most people need to realize that most of the parallel coding is not that fancy. It's called embarrassingly parallel, which means instead of making one program that can use 500 processors, you make one program that runs 500 times each on one or a few processors. And this goes right back to the scripting and automation of things. So second is that these days oftentimes there's other libraries that do the parallelness for you. So for example, if you're doing a simulation, maybe the simulation package makes in parallel for you or if you're doing numerical stuff, then Python's NumPy library does it in parallel for you. And then in this case, it's not so much about knowing how to do it yourself, but knowing what it's doing so that way you can control it properly and have it work well. So one common problem is that people will run code and say, okay, I need 10 processors for this, and then it uses one, or it tries to use 40 and is inefficient, because they don't know how to run it on the cluster the right way. So once you do start doing things in parallel yourself, there's two main methods. One is shared memory, or the open MP model, which means it's running on one computer and using multiple processors. The second one is message passing, which basically lets it run on multiple computers and send data back and forth between the computers. So both have structured ways of doing things so you're not really making things all from scratch. And of these, the shared memory or open MP isn't that difficult overall, and is enough for many problems these days because one computer has so much memory and processors that well, it can really solve most problems. So I might recommend just looking at your own programming languages and seeing a little bit about how things are done in parallel there. Both the basics and some of the more advanced methods, but don't worry too much about it right now. So when the time comes, then you can use it, and also you'll know how to use the other libraries that do run this and how to control them properly. And next we have environments. Some people left these comments environment preparation and basic stuff such as installing software using condo environment. Environments play a part in reproducibility of research, but in a nutshell, what kind of issues can easily be avoided when getting used to using environments. That's a really good question. So I'd say that, without a doubt, most of our support issues these days come from people needing to install their own software. So let this be a lesson to everyone. So once you start making software that other others use spend some time making it easily installable, or else no one will use it and you've just wasted your time. So environments like, okay, so first off, learn how your own software environment works. So for example, Python, which many people use these days, there's PIP that installs packages, and it's important to know how this works. And also there's a concept of environment or virtual environments, as they're called in Python, which means that instead of taking an installing software for your whole user, which will be shared by everything. For every project you have you have only one environment, and you install software inside of that. So for example project a has virtual environment be and so on. This is really important because what you do for one project won't interfere with the other projects. It makes these environments like throw away or reproducible. So if you ever get a problem you can delete the whole environment and try to install things again. This has two good effects one when something goes wrong, you can start over without having without breaking every single project you're doing. And second, once you can reproduce it and make it again. That means you can sort of test how it works would make it like make it where other people can do it. And like when you see there's problems coming up find them early rather than in five years when you're trying to run things again. There's also the concept of when you're installing things, choosing the versions of things you install so that way you can go back in time. So you want a new version of, say, TensorFlow when it gets released to break all of your own code because it does something new. So you can install the exact version you need, and then keep it the same as you're doing your project until the point you're ready to upgrade. So, most modern programming languages have some way of doing this, both managing the dependencies that you need. And then taking the versions and then installing them in isolated environments. So whatever it is, take some time and learn about it. Yeah, and I would also like to mention here that maybe some people already are on a more advanced level. So how about containerized workflows for environments. Yeah, so the concept of a container is that it's practically like the whole operating system put into a single file, which can be run. So this is great because you're able to basically set up your environment and then copy it from computer to computer and you don't have to reinstall things. So if you know how to do that, it might be useful in some cases, but I think in many cases, being able to make the content environment or the virtual environment, or something like that being able like being able to make that reproducible is more useful and more important than being able to make the new the whole container as an operating system, at least for science where our tasks are not as difficult as say running Amazon or something like that. The next one is a very interesting one. When to do new things by myself versus when to reuse what others have done, because I think the schools teach us something very different than what it is actually like in the real life. Yeah, like, when you take a class, if you went and you found Oh, someone else has solved my problem for me. I'm going to use this to do my assignment, then you're cheating or something like that. Exactly. And that's bad. But when you're doing research if someone else has already solved it, then it's not research it's just doing something that someone else has already done. So you need to be able to use these other good libraries people have made. Whether it's something that's really big and important like say NumPy or TensorFlow or pandas or like there's many of these very common packages which make the basic data structures and data handling and simulations and computations easy. So also small things like if you're doing that someone else has done in another paper and they've released their code, you need to be able to take their code and reuse it in order to both check what they've done and then extend it to do more things yourself. And these are two very different tasks. The first one using the big packages is sort of knowing the best practices and reading documentation. The second one is being able to look at someone else's code and determine is this even correct is it even something I should be reusing, or am I going to try to use this and then realize it's full of bugs and then have to find something So, well, really this depends on what language you're using and how you do things, how you do things yourself, but it's something that you should think about. So, being able to I would say, for now, before you start reusing something, before you start doing something see what you can reuse and then before you start reusing it take a little look at it and decide, is this table enough and reusable enough for me to do. And you can often figure that out by looking at the documentation and install instructions and example use cases and things like that. Yeah, and also keeping in mind that it's the perfect learning opportunity because you actually have to think that how are you going to apply somebody else's code to your own code. So you really have to know your own code throughout. Yeah. And also a learning opportunity because you see what you need to do in your own code to make it where other people can use it, or even yourself can use it five years from now. Like, what something's missing from someone else's documentation, maybe make sure you write it down for yourself. Yeah, so on. Exactly. All right, and then setting up workflows and pipelines. So let's see. So the idea of workflow and pipeline is basically that instead of running things by hand, you somehow make it automatic where you can run one command and then all your work is done. You have like some magical dream ideal which doesn't happen for many people, but the opposite case where you have a bunch of scripts and they have to be run in the right order to get the results. Then in six months when you have to do things again because you need to update it for the next version of your paper, and he don't remember what you did. That's a really bad situation to be in and I've seen many people get there. So I guess the main lesson is at least try to script things a little bit so write down what you've done off script things a little bit so you can run runs one script that runs multiple commands. And if things get advanced enough, there's these workflow automation tools, which can basically understand what all the steps are and rerun everything in the right order, when the code or data has changed. Well, yeah, when you think many people don't get there, but it's something important to be aware that it exists. Yeah. And then we have how to manage my data and code. I think this is a topic that is not at all paid attention to in the courses or in your undergraduate or even graduate studies. Yeah, I've been questioned. I mean, I guess in courses after you do an assignment, you don't need to do it again so like there's not this concept of keeping track of stuff for a long term and so on. Yeah, so, um, when you're a researcher, then you do things for what three to five years or what even longer than if you aren't able to find the stuff you've done before you have a really major problem. So, well, everyone will have your own style. So, um, talk to your friends and colleagues and see what they do and use the best practices. So I guess version control is one of the most important things for managing code and possibly small data. At least you want to minimize copying and pasting. There's this one concept called single source of truth, which means instead of having like making copy and then modifying both and they're all doing slightly different things but they're all somehow related. So I think that you know the master place where everything you do is and try to have the fewer number of master places. For example, on my, all of my computers, I have one directory called get, and that's where all of my projects go these days. So if I need to find something I don't need to look in my data directory and my projects directory in this and that was just one get. And then if I need to synchronize this, or need to find something on another computer I know there's one place to look, either get hub or get lab, depending on what I'm doing. Yeah, but then you'll find you'll sort of find your own style so there may be big data that can be in get so I need to keep this organized a different way. When you're working with other people as a team, it also becomes quite difficult. So I can't I honestly can't tell you what you should do, but I can tell you don't do things without thinking about it a little bit at the start and discussing with the people on your team. Okay, and last but not least we have how to write in a reusable manner. I think first we should define the issue here. I think this means is that, or at least from my experience I start working on a single project, and I do a bunch of work for it. And I'll have some code and data and things like that. And then I start another project. And then, do I start project number two from scratch, or can I use everything I did in project one easily within project to. You can go on making these kinds of things you can reuse in each project is really a key to productivity and efficiency, and also doing things correctly and enjoying your work. So what I found here works well for me, I start one project and I don't really know what I'm doing. So I just do things, but over time I see okay this is a common task which I'm going to be able to reuse. I take that and I split it off into another directory, and make this another module, which can be imported into my other code. So in Python, this is easy there's modules like that. These days I would also take it and make it a installable Python package, which can be installed in my other environments and modified. And then you do this. So you sort of keep in mind, there's a difference between the code for one particular project you have, and then code which is reused and used among multiple projects, and the multi project code you spend more effort in documenting that and making it standardized and like actually working and things like that. So you just have to accept that things for one project well, you do a bunch but you don't spend as much time on it. But then also within the code, instead of making say one giant script. You need to be able to use things, functions or modules, and so on, which can be split out into these kinds of things. Once you start doing that it's also really important how do you make this interface, which is stable and reusable. And these are things that are not really taught in many courses, and maybe it might be if you take software engineering courses, but probably most people watching this video are not software engineers and that's not your goal your goals to do some other science or research, where you're basically learning a lot of this yourself. Yeah. So, well, I guess, in the video description that you'll see will have some links to some other resources here. So, maybe it's really important to learn that you're not alone. So, talk to your friends and colleagues, talk to the staff, us at Science IT, and we can give you some advice and make sure things are going in a good direction. Um, yeah. And always keep learning. So, please keep on going through the material on hands-onski.com.readthedocs.io and some additional info about Science IT and upcoming training can be found on ski.com.alto.fi Alright, thanks for watching. Thanks for watching.