 Hello, I'm John Day. I work at the Fred Hutch and support scientific computing there. It's kind of new in America, land acknowledgement. The Fred Hutch acknowledges that the land that the center is on belongs to the Salish peoples, and the waters that touch that belong to the tribes of the Duwamish, Tualips, Suquamish, Tualip and Muckleshoot nations. We've been using EasyBuild since about 2015, and it's really influenced how we design our computer now, and we've made a decision several years ago that everything will be deployed with Elmod and build with EasyBuild. Of course, there's some exceptions. Move this out of my way. We're a cancer research center, and recently in the last year we've changed, we've merged with a clinic, so we've dropped the word research out of our name. So, but we do have patients now on site. We do sequencing. We have 300 principal investigators and I guess that's a term used in Canada and the US, so a PI is somebody who has runs a grant. And some of these PIs are just individuals. Some of them have staffs of 50 to 100 people. We also have a lot of graduate students. We also have a lot of postdocs on campus. Fred Hutch is home to three Nobel laureates. And the Hutch has been the home of Gattlemen was at the Hutch for eight years. Also, bioconductors started there. So there's a big legacy at the Hutch of using R. I'm just throwing this up there. It's just kind of a demonstration of some of the scientific work that's done. Trevor Bedford, this is a complete map of phylogenic tree map of COVID-19. And this data is also geolocated. So you can zoom in around the world to see where it is. And this is an interactive graphic that's on our website. So you can zoom in and get incredible detail down to the individual samples or it's just a fun slide. I like to put it on there. And again, we're actually developing cancer cures at the Fred Hutch. And this is just again, I just love this graphic. So it's a T cells attacking cancer cell. After listening to all these amazing supercomputers at the conference, I'm never going to use the word HPC again. I'm going to just call it scientific computing of bioinformatics is very little to do with HPC. And again, when we deployed our new cluster, you know, easy build into software stack became so important that now we deploy our nodes to just the absolute bare minimum. It's just truly an Ubuntu that just write all the box, hardly anything's added to it. There's no dev packages. If you type make it says command not found. It's just a core that for running jobs. So everyone needs to load their software at the start of their job. And there's no sensors, no development tools. And we rarely get complaints about this people don't develop software like GCC like you would see at other supercomputing sites. And rarely when that comes up we just save module load false, and then they have all the tools they need. Some of the trends we're seeing is increased GPU usage. Three years ago we had no GPUs and now we have one GPU per node, and it's creating some interesting problems in the sense of you. If you start 100 or so jobs that require GPUs, your job runs on every single node, since there's only one per node. And the other problem we're seeing is we bought consumer grade GPUs just to just try it out put them out there. Maybe people will start to use them and they have with eight gigs it's not sufficient to build models so we do have some PIs have gone out and bought the a 100s and fully maxed out on memory. And models requires lots of memories what we're learning. We don't see too much case of just people using them in general computational sensors. PyTorch is probably one of our big and of course alpha fold is one of our new applications that people are very interested in. And of course the request for modules just continues to grow. When I started at the hutch we might have gotten one request once or twice a month. Now there's several a week and it's really daunting. And the quality of bioinformatics software I just hate to say this on a recording but it's really terrible. There's something I like to do I've started using elastic search and splunk many years ago, and I'm certain everyone does this with their site package Lua. But I like to do the key value and some of the other things that we monitor I, I try to actually obey completely like JSON rules. And the other thing I like to do is pick up and I've got this off the web somewhere but the job ID so if you're running a cluster job. It's in your environment you can pick that up with Lua with lmod and actually inject that straight into elastic search. So if someone has a ticket and they're saying my job doesn't run. It's often difficult to ask them what they're doing and we can just go into splunk type of query and we can see everything that they've loaded. We can associate all of those packages with the job so it helps us out a lot and debugging and working with users. And you can see so again, since I have all this stuff streamed into elastic search. I can get amazing metrics out of who's using what down to individuals and you can see that we're very heavy on our F HR spread HR and I'll talk about that in a minute our studio is certainly growing. And then things like BCF tools I think that number so high because it's being run out of our laboratory so there's just processes that people are running from instruments that stream into our cluster. So every time a laboratory process runs these things get run and the R usages really high also because you know people will use just a single functions are just they'll just use it to grab a column of data out of something and a, and a job so they'll load this huge software stack to do some things that most of us would find some very simple way to do. I just want to share this a couple years ago scientific computing we decided there's only about seven or eight of us on the staff we volunteer one hour a week to write documentation. And it, it seems like a small commitment but after two or three years we have a very valuable resource now we're partnering with the data science group. The first group we're meeting is also in the middle, and also this group called hutch data core and the people who do the run all the instruments are sequencers and things like that the source of the data. And this is public you can find this online. It's shared everywhere. So part of how this is coming about to as we get help desk tickets, you know, when you get the same one three or four times you just think that we need to write better documentation. And this is that constant feedback from our users as you know people don't understand how to do something or why things break. Continue to document that. And then this is also a great resource for postdocs people who just walk on to the campus for the first time is try to use our resources so it's a good entry point for users, and it's we've been really successful with this. So in the assignments talk the other day, we really need to make our smaller and Python smaller. So, I'd really incurred these are just like really kind of glossy overviews but so we have FH Python Fred hutch Python and Fred hutch are. And these have grown from user requests so every time I get a request for a package. So these the base are and we extend Fred hutch are so those are Fred hutch are very bioinformatics focused. It's this giant collection of requests that people have given us over the years. So, we continue to maintain that. In the last two years the number requests are just too high we're started to push back and we've documented that in the side wiki so please load your own modules use a virtual environment or something like that, unless it's a case where there's compiled C code or libraries or something difficult and then we'll help you out we'll add that into the, into the base and our Python is way too large is it's just got to get smaller, and I would volunteer to help with both brand and pipe I have document strings in there so I'm going to run through with a little I'll tweak easy update to print out those descriptions I mean it doesn't have the word science and it's going to end up on the cutting board floor just try to edit some of that stuff out. Yes, I can. I understand. So, again, these are, you know what we're doing at the hutch to support this. And it's, it certainly does help us to just use the community are and the community Python I strongly encourage that. And just, you know, try to build your own versions of these instead of extending the community version. R is wonderful. It's, it's run by CI it's not uncommon for a new version or to come out and actually see packages drop off because it didn't build. Sometimes I have to wait a week or so for people to patch their code, because our scientists want those packages but they're not available. They literally delete them. If you don't pass you have to, the stuff has to run. So, and sometimes I've patched it for other people because it's so important I'll, I'll do the patch form get it submitted get it back in the CRAN, and then build. R, the newest version. And the easy update code for R is super stable I haven't touched it in years it just seems to work. Pi Pi I just, this is what I can say on a recording but it's a, you know, if anyone can help me or has any ideas about how to make it better I'm at my wits and it's just going to Pi Pi. It seems like the only method to really be effective is to just build like, you know, start up a little container try to install it, but the time and energy to do that would be enormous to figure out all the dependencies. Right now I think it's really only good. Well my easy update code is really only good for going to Pi Pi, looking at your old version and writing new version in there dependency resolution is really difficult because it's not well enough annotated. And it just as a community to I think it's time for the scientific community to just say like, you know, this has got to get better. It's, it's really to have a quick remark to this there is actually a way to figure out dependencies of Python packages before installing them. It's not through Pi Pi directly, but there's a website called libraries.io, which tracks Pi Pi, which tracks C pan, whole bunch of sites like this, and they have a way of actually listing all the dependencies on that page and it has an API you can talk to and get all that information down. So not by talking to Pi Pi directly but talking to another website that somehow harvests that information. I'm not sure how they do it or how reliable it is, but it does list dependencies there. Excellent. That's easy to parse. I'll start on that Monday. And I just threw this together just as a quick example like when people are who are trying to build new easy configs that have anything to do with extensions lists. If you just get a request from a user for a name, just just put it in here and just put an X easy update. It's just parsing it if it'll go out it'll it'll find a new version and it's not going to match so it'll write the new one in there it's not like just save yourself the time and like going and doing it manually. And in the last year I've had to change easy update so now I have flags like update are update Python. Early on in the code I used to try to open up the easy config and figure out what language it is, and that was easy at first to do that. I think they had the word Python or are in it, but that's so unreliable now and so difficult, and the way easy configs are written now that's, I've just decided just implicitly say what you're trying to do and then the code will will go to the right section now. This is a slide I recycled from a few years ago but. And I think I have an open issue and I think the person is in this room to open up. The reason why, well when you start to update all the extensions list you'll notice the first thing I have to do is read your easy config and look at those dependencies because there'll be extension lists and those dependencies. So I need to discover all that and I recursively open every file so you know Python has pi bind, and I open every single file every dependency, try to extract every single extensions list and build a list of that first and I'm repeating something that's in another file. So, I actually look for the string. Easy config like it needs to be either in the directory I'm running in or or I need to discover it through where your easy build is installed so I need to resolve all that dependencies and where your libraries are you might have libraries that are local to you that aren't in the main repo so I look where we are in situ and also the installed base. So it's, it's good to have latest version of easy build installed. Make sure you have the latest versions of all the easy configs to resolve all these dependencies. And then I go out to pi pi or crayon and start to look at all those little version numbers and update them. And then you can see, I try to, this one's highlighted. config parser is a dependency from entry points so try to annotate that like where are these things coming from. And something I could easily add to the code is this idea of building a dependency trees like a dot file, like it's done. And that would that would be very telling to to see, you know how this looks like and if I had that information that would also help your parallel build. So you could just find the right place the tree and then just parallelize that in the build. Just a couple of observations about easy config. Some of the bioinformatics code is, I mean I don't know like, it's just, how could I, how could I submit this I mean the recursive gets stuff all over the place or there's a directory where they've hard coded like, yeah, it's just it's just so ugly and difficult and sometimes I have to cheat I just have to do a few things by hand or I make my own terrible by hand or and then I put it in our sources directory. And then from there the easy config will work. And sometimes easy easy build is filled with patches but most people are fixing the make files sometimes I'm actually fixing the end users code. So I'll take that and I'll actually try to push it upstream and then I'll immediately start using it on the site but also wait for it to come back, and then read update the easy config. So sometimes that process can take months and months and then I lose track of it and I forget. And so much of the bioinformatics code again it's just like a Python script so it seems odd to add that to the to the community. And about conda I mean often I go to these bioinformatics pages and it will just say that the only way to install it is conda they don't offer an alternative. So, encouraging people when you see that, especially once they've got an easy config or to go back and issue a PR against the original software to update their documentation and say you can also install this with easy build. So, I'm just to say containers have also been a sticky point like all of our users in bioinformatics want them and I think like, like that day is kind of coming gone for easy build. All my users now just go to bio containers and they just grab these things, but it was still. I spent a lot of time trying to use easy build put software in a container and I just not been very good at it. And it's kind of a lost cause I mean no one's no one asked me for containers anymore we were so unsuccessful for it that people just go out and find them they just go to the internet they Google they find a container and they start using it they don't care how optimized it is how well it's compiled, but the target is if it works they use it so. And I left out some slides I should have talked about. EBCB. So we build all of our software in a container. And the method I do to do that I use a multi stage build with Docker. So you need build essentials to build easy build and you need it to build boss. It works in all the stuff you don't want so I make easy build with one single line so it comes out of one layer in the container. And then I build the second container and I just grabbed that one volume and I put it in there. So my finished build container has no build tools it has no dev tools it has no libraries. If you type make it says command not found if you type GCC it says command not found it literally just has easy build and the false tool chain and just one false tool chain. And then from there, you know you can build software in a clean room has been demonstrated some other talks here. I think it's worth documenting to community because reading the email list I mean you see a lot of the issues that would be solved if people were always using consistent environment always using like a container or clean room to build their software So I think as a community maybe you know standardizing on things like that to would help to build better software. And again like going back to how we built our cluster so that's all circular. You know our build container looks a lot like our cluster node, you know we try to like have everything stripped out, and it really really needs to run with easy build first, you know before we put it out there. So that's. Thank you. It would be time and can you check if you can hear this small mic as well. Yeah so one follow up question on the container you're using you're saying you're kicking out everything. But these people does assume it can find stuff like make and patch in the OS or in the build environment so are you injecting that back in somehow or because currently we don't never miss make as a build dependency for stuff. We just assume it's there assume it's there. I'll have to look at my recipe. I don't, you know I don't add any build essentials back in and a bunch to I know with sent us and red hat make is there. I think I do something to inject it or might end up in false as a side side effect. There is actually an issue that was open on this this morning. More question for you Kenneth is there. Is there a reason why make a special because all the other build tools are like see make mess and ninja they're all built dependencies but it's just because it's a slow moving target. It's always there and it never breaks. So I've never run into a build problem. That was because we were using the wrong version of make. I don't think I've ever seen that for there is one. There's examples like this. Okay. Well maybe there is actually one example where we have to build make because that software requires a particular version of make. I said over that. Don't ask me. I have no idea. I think there's one or two cases where we do include a newer version of make as a built dependency. But usually we don't have for see me. It's very different. If you try your smart version, you're taking a big risk. And that's why we really include it everywhere as a built dependency. You don't assume it's there. And if it's there, we don't control that version. So that's I think. Now that said, we should probably still add it as a built dependency and have our own control version of it. And then maybe limit some of the problems that we see. Same thing for patch. If you want to get very far, same thing from tar, unzip, gzip, bzip. All these things were just silently assuming it's there in the US. And there's actually one change that Simon mentioned in Easy Build 5. Right now we're unzipping the source target before we load the build dependencies. You need to flip that around to actually be able to unpack the sources. And that's that's one change we're going to make in Easy Build 5. Just that wasn't let's say a historical mistake. Yeah, I really like the remark about documenting what you do with containers because it's something we don't use. And I think it's actually the better way to do it. It also makes me think, and again, this is more general remark. We do this in our Easy Build testing environment as well. I don't think we do right now. So at least one of the test systems would test in a very minimal time. And you know, even people who write in Easy Build 5 don't see this in a mobile-built container. It will be difficult to get that out to everybody. But even then, if they make a contribution, we'll at least see it in the test environment and we'll see how it works. You can find EBCB, which is kind of a pun on GBCB, Easy Build Container Build. It's out there on the Fred Hutch GitHub site and I've extracted all of our user IDs. So I have an idea. I have an eb underscore user and so in any individual site, you would just map that to whatever UID you have locally that owns your software. So as long as you just do that through the Docker environment variables, and it should work for anyone. Of course, mine is Ubuntu and everyone will want to change that. We actually made a step in that direction. So we had a similar discussion at the last Virtual Easy Build user meeting. And a couple of weeks after that, we started puzzling together container recipes and building them with Singularity or AppTainer and uploading them to the GitHub Container Registry. So now you can just do Singularity Shell or Singularity Run and that drops you into one of these containers. So we've done some work on that already. We have a bunch of containers for different versions of CentOS, different versions of Fedora, different versions of Ubuntu. So we have like a dozen or so different containers and right now it's only one per OS version, but we could have a minimal, like a clean room kind of thing and then a fat one that even has stuff like boost and things like this installed because sometimes stuff breaks because you have boost in the OS, right? So it's not only about testing in a clean room, but it's also about testing in a fat environment and that may sometimes raise different problems that you want to get rid of. And it's saved me too. Sometimes when you need to build a package that's four years old, the fact that I have a collection of these containers now that go back to like probably FOS 2019. So I can just pull that container out, build and not, even though it's like really out of date or 2018 and create something that works without having to find an old OS or try to re, you know, get those tools again. So it's nice having, when you do it for years, you end up with a nice list of containers that you can go back to if you have to. One of the things I've noticed is containers are popping up with in terms of installation on a regular base in the last three days or so. I mean, we are using it last night. I had to chat about how to do it and so on. So maybe it might be quite a good idea to basically have a container working group. I'm pretty sure somebody who is cleverer than me can put up a much better name. I'm not good in PR. I only open pull requests. So we basically become together and we share what is the best practice. And so we have a more or less, I don't want to say standardized way forward because that will not work, but we have a most common approach to how to install software on an HPC cluster. So it is actually working, documenting it. Exactly, exactly. That makes a lot of sense. And I was sort of the idea of the, the ease of those containers that we have in the central repository, I think that could be a start point and see what is there and how it has to be leveraged in tests and what can be changed or improved on top of that. That makes sense. Yeah. Sure. Yeah. It's not about the logo. It's not about the logo. It's okay. No logo. No. I don't know if you're aware, but there's actually a script in the easy build repo that is able to fetch all the Python packages that you will need. It uses another approach. So what it does is you have to first load all the dependencies and then it will start the virtual environment and actually install all your packages with VIP in this virtual environment. And then it will know exactly which, which dependencies are actually all needed and it will give you a list at the end. So maybe you should check it out. I'll have to follow up on that. It's just in the framework or is it an extra? It's an isolated script. An isolated script. But then it's from Flamefire. Alexander. Yeah. The downside of that is it can be very slow, right? Because people actually have to download everything and actually install it. And at the end of just going to throw that away because you only want those version numbers. Right. And like you said, there's no way of getting that straight out of tip. They have a hundred thousand different ways of specifying dependencies in Python requirements and by project or tunnel. Probably a whole bunch of others and there's just no way you can figure it out. And I've been looking for a tool that just if you have a compact writing package, durable source cycle, a tool that can tell me directly like these are the dependencies that I will need. I haven't found anything because it's just such a mess to look at all these files. So I haven't checked what lily.io do, but that information seems to be there and with the snap of the finger you can get it. But we should check how reliable it is or how they actually figure that out. Maybe they do do installations and then just... Yeah, if they do it, we can just harvest the data. Yeah, somebody's already done the hard work of opening it up, installing it, and then documenting that. And then you're still relying on how accurate the metadata is. Because there are, let's say, scientists out there, I don't want to call them bioinformaticians, but they usually are, who write writing software and they don't even bother to specify the dependencies in any meaningful way. If the import break, you install the missing packages. Because it installs perfectly on their laptop. Yeah, for them. They have their virtual environment where everything is, and it just works fine. It wouldn't occur to them. We would still run into trouble left and right, but if those are the exceptions, we just manually deal with those. Okay. Thank you.