 I'm from the University of Birmingham and I'm involved in providing the HPC facilities here. At the University of Birmingham we like a good acronym so the HPC facilities and other things we've provided are called Birmingham Environment for Academic Research. Obviously that allows us to use the word bear lots and when you see lots of our publicity and other information you'll see pictures of bears. On there in that photo are just a couple of the bears that one of my colleagues has sitting around in their office. On the other side that's our water cooled data center which opened a couple of years ago now or nearly three years ago. That's from quite early on actually when we had a bit of it you can see some of the compute nodes, some of the storage, a small amount of the Power9 AI cluster that we IBM Power9 AI cluster that we've got in. That's before all of it was there. Our group's also responsible for what we call the research data store which is providing storage across campus so that's both for people using HPC facilities but an awful lot of our researchers that use it don't use HPC at all. We've also got high speed networking setup that's for people who say either say in the medical research area doing gene sequencing. We've also got some the physics type equipment that's producing lots of data and chucking it back into the store. We've also got a GitLab setup internally for researchers who want to keep their Git repositories local. Our computer, our HPC machine, continuing the theme of bear is called blue bear. This is available to any researcher at the university. I have lots of bioinformaticians, a good chunk of engineers, computational chemists, physicists but a lot of what goes on in our system is high throughput computing rather than high performance computing. We do have a chunk of high performance computing type uses but we often see lots and lots of jobs and lots and lots of package requests from those bioinformaticians. Our system, we have the section of it that's available for free to all researchers but also research groups can come to us and ask to buy additional resources. That's either the groups that use lots of computational resources might want to have their own dedicated nodes as resources to them or they might have some extra requirements to do with particular dedicated access to GPUs or to large memory systems and where the available shared area it just isn't sufficient for their needs. This is the third generation of blue bear. This part of blue bear as we've actually been going which is a follow on from version two started in 2012 and what we do is we do rolling updates where of both the hardware on the software side of things. So we say it's the same blue bear but now actually we've got no original components from the 2012 iteration of it but in some way some of the components we've got now overlapped with some of the components we had there. We've now gone through two different storage systems that have become end of life and been decommissioned and so on. What do we have available in our system? Because of that we don't have separate clusters. So we have one login system and from that you can access a bunch of different types of CPUs, some with GPUs to make jobs out. At the moment we're running a mix of central seven or mainly central seven and we have Red Hat on the IBM power nine nodes. The sandy bridge I've crossed out there that was what we decommissioned last year. They'd become end of life so we decommissioned the last of those about Easter last year. At the moment most of our GPU our Intel based GPUs are attached to either Haswell or Broadwell CPUs or in fact all of them. So Cascade Lake we actually have two generations of Cascade Lake where the difference is in the type of interconnect. ConnectX5 versus ConnectX6 I think that's correct. Networking is always a little bit of somebody else takes care of that and tells me what we've got. We also provide a cloud infrastructure which is effectively where we allow the researchers access to equivalent of a compute node but has a virtual machine instead and these run a combination of either CentOS 7 or Ubuntu 16. They tend to have LFS and no Infinity band connections on them. What does our software stack or what sort of going to go with what we installed last year? We did over a thousand installs last year which is and we did over a thousand the year before. The image there was generated by one of my colleagues I'm just quickly reading his notes. The word cloud shows all the applications we built with the font size proportional to the number of modules were given an application name. The OSU micro benchmarks and HPL are quite big because I spent a decent chunk of last year trying to work out where debug some problems both with MPI and with our compute nodes and installed on every single tool chain we had. You can also see some of the things I rely on and TensorFlow that are some of our bigger users or bigger use cases. I've also split out there the where our installations are in terms of the easy build tool chains. You can see we're mainly using the FOS and FOS CUDA area tool chains. When I put FOS and FOS CUDA there that includes any of the things so FOS includes the GCC tool chain and the Compute tool chain those sub-chains and similarly for anything else. I've also put tool chain generation for the easy build tool chain generation for there. You'll see that well over half of our installations were against the 2019 B and again when I say 2019 B there I also mean whatever the underlying GCC call version was there I think is 8.3. We did end up doing more installations against 2020 A than I was expecting. Due to CUDA and some other issues we were looking at jumping to 2020 B and ignoring 2020 A but about September I think it was somebody asked us for our version 4 which in the easy build was in 2020 A so that I think accounts for most of those 254 by itself particularly when I think the other thing was asked for was a TensorFlow version against it. There's a TensorFlow version in that tool chain as well. So easy build at the University of Birmingham we've been going since 2016 with easy build which is about a year and a half before I joined the team in fact none of the current people building software was there when we started with easy build we've inherited it and changed it since then. All of our installations are by request so we don't preemptively install anything anymore we just wait for our users to ask for it we get quite enough installation requests in that method. In conjunction with talking with the user we decide what tool chain we're going to install it against that allows us to sort of manage where we're installing and not installing in too many different areas and for sort of our own sanity we tend to have only two active tool chain generations for installations at any one time so currently we'd only be installing in the 2020 A and 2019 B tool chains that's there. I mentioned that we had the different CPU types if you're on an Intel system and you module load something it will load a version that's being optimized for that CPU type that's transparent to the users they don't know what's going on they don't see what's going on it just happens you know we said the module passed correctly we have a few users that self build their software that's outside of easy build using the obviously modules from easy build and where they're looking at running on the different CPU types we talk through how to correctly set that up so that when they're installing they're all building their own software across different CPU types they'll end up with an optimized version for each CPU and we find that to be quite successful and the users there understand there. Talk about a few of the things we looked at doing last year um one of the major things was we were looking at doing an OS upgrade where we'd take the CentOS 7 go to 8 or write out at 7 to 8 we did all the installation work for there or nearly all the installation work we decided we were going to use tool chain 2018 B for the x86 systems 2019 A and newer for the power 9 systems we got all this work done and about a week before we were going to do CentOS turned around and went and made that announcement that I think you'll probably all be aware of we haven't quite decided what we're going to do yet we decided we'd go ahead with the upgrade of the GPU nodes and power 9 this is because most of our GPUs are actually on the power 9 that's red hat so we have support so there were didn't seem any reason to avoid that upgrade our CPU and only nodes are currently still on CentOS 7 we haven't yet decided I've put some of the options we could go with there there are plenty of others I think as well we will make a decision at some point um that's one of my you know one of our jobs for some time this year the next bit yeah MPI this is one of those areas that that caused me a bit of annoyance last year and I spent a while trying to debug Intel MPI issues I've referenced there to bug reports that are in the easy configs repository numbers there we I think we definitely experienced the first one I think that it was the second one we experienced or at least something very similar to it part way through the or part way towards the end of the year and then at beginning of this year I went 2% of our software uses Intel MPI or 2% of our installs last year uses Intel MPI I've had enough of this we're going to go I'm calling instead I using swapping out the Intel MPI for open MPI for those tool chains and that's work but basically that's because we already know how open MPI works on our cluster so we'll just use it everywhere and that seems to be going quite well and is there one of the things to note about that is most of our the software we install with the Intel Compilers in MKL are standalone bits of software where the users just use that one bit of software with not a lot of dependencies are there things like Quantum Express or NWKMR in that so you know they they benefit from the Intel Compilers in MKL or what they tend to from what we've seen but going out to open MPI is fine. Last year obviously we switched having a lot of people working from home and obvious reasons so we deployed and opened up and open on demand instance which is HPC portal to the infrastructure that we've got there to the compute nodes we provide through that that's from the Ohio Supercomputing Center we're providing a bunch of GUI apps from that ANSI Saboteurs Matlab Paraview I think there's some more that I've forgotten there also things like Jupyter Lab R Studio Server VS Code also provides shell access through a you can get into a bash shell on a compute node through a web browser and also a file browser we're finding that very useful there are alternatives to that and I know some of them have been mentioned already this week we're very much enjoying that and the last junk is I mentioned we were a tier 3 site well actually our big news of last year was the announcement of Baskerville which we are becoming a national tier 2 accelerated compute facility that's us in conjunction with the Diamondlight Source Rosalind Franklin Institute Allenturing Institute and EPSRC we're providing a tier 2 system with compute with GPUs I'll explain what we're putting in in a minute our technology partners are NVIDIA and the NOVO with OCF as our in deployment partner this is again going into our data center we're putting in compute nodes 46 one new systems one half of the one new system is two Intel Xeon processors 512 RAM and SSD the other half of the one new system is then four NVIDIA A140 GPUs apparently each of these will has about 50 53.4 teraflops per node and I forgot to copy in that from memory that's the compute sorry the GPU part and the CPUs have a bit more on top of that but I've forgotten to put down what they're rated at on here we're also 5 petabytes of storage going in spinning disks about half a petabyte in flash and one of the things we we put in last year in our own in Blue Bear we enabled the Globus endpoint which we're finding very good for data transfers and that was a precursor for the tier 2 test it on our own system first because to see whether it's all helped to provide to our partners and at that point yep I thank you for listening and are there any questions okay thank you very much Simon are there any questions for Simon either in Slack or in Zoom yes we have Yorg go ahead Yorg yes me again throwing in that question we all know about the center s problem you put it like that and you mentioned Susie you mentioned a few other distros what I haven't seen is something like Ubuntu or Debian do people actually consider that um but we probably would go down the center oh well you know sorry sorry some sort of red hat based distribution or Susie there are some questions about some of the applications we have to support as to whether they're supported on Ubuntu I'm simply thinking abacus ansis that's a decent chunk of our users um and I think at least one of those isn't supported or is only supported limited versions I can't remember exactly um I know we've had some problems setting it up on there um we might but I would think our Sus admins are more knowledgeable about red hat and prefer it so we're likely likely to go something in that line I don't know though we're still discussing we need to test it I think is the partly answer yeah I think that is really what what we need to do we need to test it on adebian based system like Ubuntu or Debian to really see can we switch over or are we really stuck in the red hat ie apm based world I don't have an answer to that as well there definitely are easy build sites who are using Ubuntu I think okra can maybe say something about that pain in the ass we all we only we only heard pain in the ass we didn't heard it part before abacus apart from that I haven't seen any problems almost anything yes unfortunately that's a big pain for us I have a chance yes um yeah it's difficult is the answer you know I think one of the things we might discuss is whether or not whether it's more work or harder work to split out in some way whether we might have bits on different OS's because we're already to some extent supporting different things but or you know deploying through some sort of container or something sitting on top there are options but we haven't explored them yet just a small comment from the easy build side if you're looking at SUSE as well that's definitely a way less popular OS than anything red hat based or anything Debian based in the easy build community and we are aware of some issues there which I think are small issues and and let's say rather obscure problems that are probably easy to fix but you may hit more especially if you're installing lots of applications if you're one of the only sites really using slas or SUSE to a big extent then yeah you may I mean you're already sort of in the corner and and that that maybe brings me to the power aspect you didn't say a lot about how how much of a pain it is to to support the power system and you definitely contributed back a lot of that stuff too easy um so we've changed focus a little bit with our power in that having tried lots of applications we now have a very good understanding of what is likely to work and which users we're targeting to use on it and what software we're going to use so we've tended to focus down the traditional AI so TensorFlow by Torch Gromax is heavily used and a couple of others yes it it's I think I personally wouldn't head towards IBM if I wanted to run everything um I would want to be focusing certain subset of applications compared to what you know I mean I talked about us having a thousand you know four thousand installs last year of most of those you know there's probably a couple of hundred of those on power in you know and a lot of that support library um one of the things we're going to red hat eight there is for CUDA versions 10 and below there is a glibc issue there is a book report on easy build with it which includes some of the links to the patches that have then been applied to some of the bits of software um but yeah seemingly anytime we change something we hit another problem so the power nine cluster you have you say you run some of the bigger applications there and and you try to upload those to the power system and free up the other system for the bioinformatics and and god knows what stuff yeah well yes so in terms of that's where our GP most of our GPUs are so if somebody's coming to us going I want to run TensorFlow I want to run PyTorch I want to run Gromax whenever that anybody asks us a question you know even if they're asking you as a completely different question we'll go well have you do you know about our power nine nodes go and use those instead or go and use those as well because that you know that's what we're finding is very good on them okay sounds good I don't see any other questions and we're getting close to the next talk so let's wrap up here thank you very much Simon