 Cool, yes, so I was having to think about what to do this time because we obviously didn't have an external speaker. We've got a few things on the topic backlog, which we're going to try and get booked in. But I was just thinking it might be a good opportunity just for the few of us here anyway, just to, I don't know, get a bit more up to date with what we're all doing. And yeah, hopefully you saw where I put three bullet points in the external Slack channel, just to spark the discussion, think a little bit about what we're up to, maybe what's new technologies we've discovered in the last few months. I've got a few things. And if there's any gaps in the ecosystem, and I'd be quite interested just to hear what you guys have to say, I can make some notes and paste it into the document later. And then it might help us think a little bit about what we could do in the future in this group and what else we need from the community. So I will start. What I'm working on at the moment, we've done a bit of reorg-ing inside my team actually. So I've now got a couple of managers working for me, which is great. So this is not specific to this group really, but just in terms of my workload, it's helpful. But internally in GR, what we're up to is obviously we talk quite a bit about our AMADA system, which is our system for high throughput computing on Kubernetes. We've scaled that pretty big now. So we've got some thousands of nodes running in one of our data centers with production workloads all flowing through it. So the big thing we're doing at the moment is basically keeping that happy and well and then looking at what new features we need to add for our researchers to make them more productive. So the core platform is working well, but what we don't really have a good story for at the moment is sensible observability for users. So all of the observability tends to be done through Grafana and metrics. So which is quite good for administrators, but not so good for users who just want to understand what's going on with their jobs and so forth. We've got some basic CLIs, but I think we really need to invest in the user interface, the actual UI, so that people can click around and understand what's going on in this great big machine we've built for them. In terms of new technologies, we've been looking at in the last few months. Something which we've started using to quite good effect is Envoy. So I think it's actually, it's a CNCF project. It's something that I think Istio uses internally, but we just use it as a performant, very configurable HTTP proxy. So historically for those sorts of things, we've actually used physical appliances almost, which are quite hard to operate and difficult to configure. And now we can just do all of this in Kubernetes using using Envoy, which is really, really powerful and easy to test the integration test and deploy changes too. So that's fantastic. And in terms of the third bullet point, so what gaps I'm seeing at the moment within our business anyway, I still feel we're really missing a good quality cross cluster software defined network or software defined firewall. So we want the ability to be able to say workload A, you can talk to workload B or this type of thing can talk to that kind of thing, but not that kind of thing. And to be able to do that with strongly typed metadata across clusters would be really powerful. And I don't feel like we've got a good solution for that out there at the moment. There's a couple of products we've looked at, which we've sort of then stopped using. So be really interested to hear if anyone has any good options for that. But yeah, that's that's me. I guess we'll go around and then I'll just make some notes then we'll just chat about it. So I don't know, Jeffrey, you want to go? Sure. And I apologize again for the construction in the background. Um, so right now at or now, the for the Oak Ridge leadership computer facility right now, the focus of all efforts is on X scale and frontier and getting that into a state where we can get the early science users on there in order to be able to actually start utilizing that resource for their for determining results for their research for the slate service running alongside of our compute clusters are our supercomputers. That work is probably going to be pretty similar to what we already have in place since the users are familiar with it. But until until the early science users actually start getting on it, I can actually start testing things. So we'll see how we'll see when that happens as far as how that testing goes. So in the interim, what I've been working on lately more being an open shift shop is working with the advanced cluster management pieces. The CNC f upstream is open cluster management in order to be able to try to get the systems underneath a get ups type workflow methodology that works with the native open shift tooling in order to be able to manage multiple clusters across the board. So that's that's a great shift. Say again, it's also very important. So advanced cluster management is the is the open shift piece, but the open cluster management piece, the open source one actually can do it can actually manage just about any cluster. It's not it's not tied into okd upstream open source at all. And in fact with open shift, you can they they say and I haven't tried this yet, but they say that you can also manage other other Kubernetes distributions as well using the same software since it's just basically open cluster management with the red head. So pieces rolled in with it, it looks like. So that's that's a large focus of what I'm working on at the moment. And the reason why I'm working on that my gap is people. The Great Resonation is real. And it's hit us. It's been as people are adjusting, we're closing things like that. And that's that's our biggest gap at the moment. What are they going? What's going on? So we've had one go out move out to the West Coast actually for a large corporation that's out there. We had one move to Red Hat. And then we had one move to engineering group database engineering group and for Kubernetes. So they're going to and the last one, where'd he go? Brantus? Where are you? Where are you physically? Tennessee. We're stuck. We're the we're the National Lab that's in the middle of the Green Mountains. I made some notes. I'll talk it up in the doc later and you can correct all my mistakes. That's interesting. So thank you. Here should we go for the next night? Does your mic work? I remember you said your mic doesn't work. Let's find out. Yeah. Oops. Okay. I'm just working on getting containers better container integration with Slurm. Okay. So the last release we added the ability to call OCI runtime OCI compliant runtimes. And now I'm working on making it a lot better. Nice. What's the set up there? Remind me. Who have you got working on this? You obviously. Me. Just you. Yeah. You're the hero. I'm the volunteer. I'm getting paid for this. I'm not a hero. Nice. How's that going? Really good. Actually, I got it working. I'm just working on getting it to the point that I'm actually willing to share the source code with a reviewer. Who would that be? It's just you. No, no, no. We have all code that goes to the Slurm is reviewed by somebody who didn't write it. All right. So it'll probably end up going to the CTO because it's a big change. Hopefully it'll make life a lot easier and now let people who run PubMed and Docker use Slurm so much easier or at all. Has there been any container integration before or other ways of doing that? Last release I added the modifications to the demon that runs on the nodes to actually call the OCI run times directly and to handle a little bit of a magic port and make it a container's first class citizen. So Slurm is actually aware you want a container and stuff like that. Middle-Eat that was incomplete because I ran out of time and this is more of the work on it to actually make it friendly. But Slurm actually had containers forever. It's just that everyone used them outside of Slurm or they used a plug-in to make it work. Okay. So it's really just native support in Slurm. Yeah. First class citizen. Yeah. Sorry, go on. Yeah. I don't know for you if this will be applicable. Yeah. Is there any new technologies or things you've been using in the last six months? Or maybe this containerization I suppose? A lot of the OCI specs. I have been diving into those very deep. And figuring out why Docker is insane and doesn't actually follow their own specs. I feel that a lot. Or I guess more actually they just keep rewriting them whenever they feel like it. And the specs are moving target. Definitely. Especially since they don't actually use the OCI run times in a compliant way because they use the detach mode of CRUN, which isn't even in the spec at all. But I got it. It just takes a little while to sit there with Debugger. Like why are you doing this Docker? Why? Are people generally wanting to use Docker, do you think? Still? I think we're obviously seeing a lot of people moving away from it with is not supporting it in the future. I mean, what I've seen that although Jeff could probably get better word on this is that Podman's coming in pretty popular with our DOE crowd. Docker has always been the one that users have asked for as far as I'm aware. They just want the thing to work. I mean, that's one of the reasons why Podman you could just do as an alias of Docker and it works. Yeah. I was under the impression that Podman was the sort of humans to use, but you wouldn't use it on servers like through Slur. Is that right? I'm fixing that. As far as I can say, the user should never have to worry, understand or care about how it works. It just should work. And then the system admin can do the fun work of setting up storage. If they aren't building their images in OpenShift, then we've been having direct users to use Podman as well and already be able to run their images, but running through Slur, I believe they run using Singularity. Singularity has been around for quite a while now, and it's new happy fork of AppTainer. So I heard an interesting question about Singularity, and I originally thought it might be possible, but I don't think it's true possible. Could you use Singularity as a container runtime for Kubernetes? And I'm thinking that's impossible. Actually, Scylabs is working on that as we speak. They have partial OCI support in the current release, but the next major release, they're supposed to be fully OCI runtime compliant. Well, that's interesting. At first, I thought it was just a question from a naive person, and then I thought, maybe you confirm that it might be possible then. Yeah, Scylabs posts all their plans online. I have to find a link for it, but it's part of their plan of record. And I want to note that I have nothing to do with Scylabs, completely independent. You just reminded me as well, completely unrelated, but the same name. There's a white paper or something that's come out of Microsoft about something they've created called Singularity, which is not a container runtime, but it's some kind of large-scale resource scheduler. Has anyone heard of that? Microsoft has been throwing large amounts of cash at their zero HPC stuff, hiring up tons of people. For all I know, the guy who left Ornel went to it there. I don't know anything else. Yeah, it looked quite new. It was just a paper. It had about 20 authors on it. It was massive. It was quite thick, and it seemed to be an all-sinking, all-dancing, resource scheduler equivalent to something like Commodore or Slur, I must suppose, but with the hardware as well. I don't know. And it sounded like it did all sorts of magic things around. Oh, you mean the one that Greg's working on. So it's not some Singularity. It's their container orchestrator. It sounded... Well, I haven't got it to hand, actually. But yeah, I'll look at it. It looks like a competitor to the Kubernetes. Yeah, possibly. The name of it is. It might be something else. Yeah, I'll find it. Last question then. Nate, I don't know. Any gaps in the ecosystem or areas that you're working that you need plugged other than the docker? No lack of them. The usual process as well. By now, what's broken fixing it? For the most part, the OCI standards are really helpful. Most people follow them, at least somewhat. The most amusing part about the OCI standards is they actually don't standardize what any of the runtime arguments are. They just have the general suggestion what each thing should do. For the most part, most people seem to go with the run CPath. But I've seen a few who don't. All right, I'll take some notes. Cheers. Cool, thanks. Timothy? Questions to you. Yeah, I just need to find my mute button on mute button and all this mess. Yeah, so I've been working through the ecosystem on Kubernetes capabilities from our researchers and on our CD professionals perspective. So I've done fabric. So I had to build a Kubernetes cluster using Kube admin and a mixture of Python and Cloud in it. Jetstream 2 was similar. Anvil has a Kubernetes cluster, small one, using Rancher. And that's another U.S. national system, about 1,000 nodes. Previously to that, I've played with the Pacific Research Platform, also known as Nautilus, soon to be known as the ERN, which used to be called the Eastern Regional Network. And that's a distributed Kubernetes cluster for researchers with GPUs. And so I've just been looking at it from, you know, how easy would it be for somebody who wanted to, as a researcher or somebody supporting researchers to leverage these technologies. So that's kind of what I've been up to doing. And just evangelizing the use of Kubernetes as a way to kind of abstract use from the public cloud and things like that. So given the complexity and cost of the public cloud and the more and more national systems that are using it in a researcher's perspective. When you say distributed, what do you mean like how distributed? For PRP and Nautilus, they are across the country. They have like, they're doing Ceph across hundreds of miles. They have 500 GPUs. And I'm guessing they don't have a few tens on each location. So they may have, I don't, I should look at what it is, but they may have like 20 or 30 different locations. They have storage, maybe five or six different locations. So it's rather interesting. Same with Fabric. Fabric is an experimental network. And so you can get nodes across the country that are connected via 100 gig plus networks. And you say, I want to, I want to, I want to a card all to myself. And so I built a Kubernetes cluster on top of that. In terms of new technology, I've been playing with KubeMint and Cloudinit systems recently. And in my beer time at home, I've built a Pi Kubernetes cluster from scratch. And so my constraints on that just because it's fun was to build it only using a Docker container. And it's container D and IPv6 only. So I can, I can build it and tear it down any time I want. And it's only starts with a Docker container. And right now I've got it up and running. And now I'm trying to make it self hosting. So it will provision itself. You said Pi at the beginning of that, as in Raspberry Pi. Nice. So one or multiple of those plugged in? I have, I want to get more, but I have a controller node and two worker nodes. And then I have a provision node as well. So the provisioner runs Docker and, and runs the runs the pixie boot and all that on top of it. And the plan is, is once it gets bootstrapped is that the cluster takes over that capability. Very good. That's pretty cool. So do you have people staring at you wondering what you're doing with these? No, it's my beer time. So I grab a beer and sit in front of the TV and listen to music and pound away. So, you know, it's the opportunity to do something slow and do it right. And I haven't played with seriously played Linux for a number of years. So just relearning things like system D and network D and all those kind of neat things from a deep perspective. So that's very cool. I have a Raspberry Pi. I need to dig it out. I even bought a cool little case for it. Well, I haven't used it for anything recently. So go on. It's pretty capable. And it's a, I think it's a would be a great tool versus admins trying to learn how to deploy a Kubernetes cluster because they're capable enough and fast enough to do that. Yeah. The last thing I did get it to do, I had a button, an external button and a speaker. And I could get it. So I pressed a button and it told me my train times at my local station before I left work. So I knew that they were running late or not. That was quite good fun. That's cool. My next project is to watch the RPI locator to see when they come for sale and put alarm off and flush and light. So I know when to go buy one. So yeah. And then so, yeah, the final question around the gaps in the ecosystem or anything you're looking. So, you know, in the research computing and data area, not a lot of the applications are cloud native. So they won't run well in containers. And so one project I've been doing lately is been trying to get open on demand to run inside Kubernetes as a container, having a little bit of success with that with talking to some of the folks that are working closely with that project. So the other one is a lot of these systems, national systems don't have the ability to kind of click and deploy as a user to create a Kubernetes cluster. You'd have to build it on your own. And I think the researchers would do well to be able to do that kind of thing. And then just from my experience and seeing how things are done, I think a simple provisioning system would be nice. I mean, you don't need some of these really heavy provisioners to build a cluster in terms of, I can't form in, is that? And some of the other ones, it'd be nice to just have a simply pixie boot in and bring up a node for Kubernetes. Yeah, I think that's always a challenge. We've made it as easy as possible in our environment to build clusters, but it's still not a one click thing. I guess, I don't know about you guys, but we don't tend to, because we don't give cluster as a service to people where you have namespaces on shared clusters, we haven't really had to completely streamline that building a cluster process. If Ricardo was here, he might have a view, because I know they do do that sort of model of spitting out whole clusters for people. So maybe they have got interface, like in a GKE like thing, I don't know. So part of my early exploration for beer time made me realize that they're all pretty heavyweight. I didn't, as deploying what's the open stack provisioner seemed rather overkill for doing a Kubernetes cluster. Yeah, is that Magnum? Yeah, I don't remember what it is. If that was my day job, I would have reservations for that, and that for beer time, it definitely was not worth going down that route, because I see what it took my team to to maintain an open stack and build a maintain an open stack cluster. Yeah, indeed. Cool, thanks. Alex, let's just shout out, hey. Hey, how are you? Good. Good, good. Are you your west at the moment, aren't you? Say again? Your west coast at the moment? Yeah. Yeah, early. I thought I would have a meeting for the first half an hour of this hour, but whoever was totally stood me up. So anyway, I'm joining this in protest. Well, thanks for coming. Yeah, I don't know if you saw my notes on the Slack channel, around what we're doing. No, that's fine. I can ambush you then. So we haven't got an external speaker this time, so we're just doing open discussion around a few things, but it'll be interesting just to everyone to update each other on what we're working on, any new technologies you've been playing with in the last six months that you might find interesting for other people, and if there's any gaps in the ecosystem at the moment that particularly want to be have filled, so go. What are you doing? No gaps, no. It's all fine. What I've been up to, well, a large part of my last week has been organizing the batch working group, the CNCF batch working group stuff, and doing a bunch of outreach to try and collect as many people to that conversation as possible. So Nathan, you were in there already, I think. So hi. Jeff and Tim, hey, there's a conversation going on that came out of the discussion with the tag runtime group, whatever they're called, where we were asked to spin up a conversation of working group around batch at the CNCF level. There's already a conversation going on at the Kubernetes level for batch, but there was a sense that we wanted to have a conversation at a higher level and discuss how, in particular, all the sort of projects that like Armada and Volcano and Unicorn and MCAD and there's a long list and Slurm and Condor and how all of those things interact with Kubernetes. We felt that there was a discussion to be had there, so I'm helping run that working group discussion and just trying to get a hold of it and try and gather interested parties towards it. So that's one thing that I've been doing that might be interesting to this crew here. I haven't been getting hold of the folks over Asiaways. A lot of people on Volcano and Klaus and others in China, are you finding it all right, contacting them? I mean, I've been okay sort of asynchronously chatting with Klaus. I'll ask him for access to the working group Google group and he'll eight hours later give it to me. But we haven't had many people actually join and we had started by having the meeting at something like seven o'clock our time so that it was 10 o'clock their time or something like that. But it turned out that they would never actually join anyway. So I think we're probably going to have to do something like what Cloudera does with Ozone where we just have a Western countries meeting on one hand and then an APAC meeting at some other point. But I think what I need to do first is to just get a critical mass of people in one of them and then I'll try and populate the other. So yeah I mean because there is a big crew of people who we want to have in that conversation from Asia but yeah just getting a hold of them is hard. Yeah from a researcher's perspective that that would be a huge capability for the adoption of Kubernetes and in a wide variety of areas. Yeah and you know the Kubernetes conversation, the batch conversation that's at the Kubernetes level is all talking about this project Q and what changes need to occur in the jobs API and in the scheduler and you know all the places where you know things need to be fixed to be able to run batch stuff really. So that might be an interesting conversation for you as well if you're interested in low-level Kubernetes stuff. But assuming that things get fixed at that level it probably implies changes to everybody else who have built things at a higher level. What are those implications? How does it change our worlds? How does it enable all of the rest of everyone to actually do batch on Kubernetes? That's what the CNCF conversation wants to look at. Yeah in terms of the types of people that are going so far down the roads they're on that it's difficult for them to wind back to using something fundamental when it even if it ever does exist. Yeah I mean we may be in that situation you know there's and then there's a real sense that after your team Jamie gets year's worth of experience running a system you know have that as a real skill and you won't want to change to something else just because there's a that'll be an upheaval and we're running lots of stuff on it already so you know who knows what will happen when we get to that point. Yeah everyone's got a certain amount of commitment in whatever they're doing aren't they? They don't want to just somebody go oh right I'll just throw that away and use this other thing. I mean ideally there's some sort of progression where new features come available in lower level Kubernetes and Armada can look at it and go oh actually we can rip out a chunk of our system because now we can actually rely on the jobs API directly and yeah maybe you just end up hollowing your thing out until there's nothing left. Yeah exactly for us maybe it just becomes a meta scheduler on top of Kubernetes jobs API and that it's as simple as that kind of thing. You know because until KubeFed is a thing that you can actually federate multiple Kubernetes clusters also natively in Kubernetes you'd still need something like what Armada provides even if below that the scheduling is actually done by Kubernetes as it should be. Yeah yeah indeed. So anyway that's the that's exactly the kind of question that we hope to entertain in the CNCF batch working group. So that's one large thing that I've been working on. The teams that I have in the OSPO are working on all sorts of projects you know Armada's an ever-increasingly big project of ours but we also do a bunch of work in directly on ML and data science tools you know Spark, Coravod, Ray, Arrow, all of that kind of universe like GBM I don't know you can go down the list. Are there any particular open source tools that all of you or all of your research teams use currently? I'm just curious whether any of them are ones that we're contributing to? Well you know we are. Jeff had to drop I guess the question might just be for Tim. Yeah and I more take the perspective of research computing and data professionals so they support all the things you know. Yeah well you know presumably everything that your teams are using are things that we're either contributing to or might contribute to because it's all open source and it's all stuff that we use to so. So there's that kind of stuff that's going on I suppose we've been doing one project in Kubernetes security looking at username spacing and giving researchers root access in a secure way and the person that I have working on that has some pretty promising results that like can confirm that he can do all sorts of things safely you know with some caveats. Also part of that project is looking at various EVPF things around Cilium and maybe Tetragon. So there's that's also work that we're doing. What other things have fun? What have I played with recently? You know I've been looking at that one of the things that I've been wrestling with is data science and machine learning. The ecosystem is really confusing. There's either you have to sort of cobble it all together yourself and really understand how each part works and how each part will interact with each other or you can go with these one-stop shop all-in-one solutions where just use our platform and you will have notebooks through to production model serving and there doesn't seem to be much in between and so I've been trying to go through the millions of data science platform offerings that are currently out there and trying to figure out which ones offer which parts and which things we actually want as g-research and just trying to get viewpoints on that. I was looking at Predebase for example which has some nice pieces to it. That's from a guy who did Horovod and Ludwig AI and it builds on top of Ludwig and there's some cool things about it might be cool for our NLP people because it packages up hugging phase models which currently our researchers have to email to themselves to get inside g-research and then the email fails because the models are too big and then they have to go through a whole problem it's a terrible terrible thing. So you know I was looking for a solution for them there but that it's a platform that includes a whole bunch of other things that we already do well inside g-research so it wouldn't need those things. So for us we want sort of this composable data science ecosystem that doesn't really exist unless you sit down and you wire everything together which is what g-research has done and I don't know I'm just I've just been sort of casting about the the ecosystem trying to figure out if there's a better way than the way that we've done it. If any of that makes sense I don't know. So yeah is everyone else cobbling things together? That's the question I suppose. Yeah probably I mean I don't know from speaking to others on these calls previously it sounds like the answer is yes. I don't think anybody has a choice. No that's it that's that's what you have to do. You want it to work the way you want it you have to cobble it together. Yeah unless you're a small data science shop where maybe you could choose one of these all-in-one platforms which will eventually be outdated pretty quickly and so then you're stuck as a small data science company being like I don't know whether I want to do this because it might be ugly for me later so I'm going to cobble it together myself anyway. You can only reuse it like a one-stop shop if you're starting from a sort of if you're at a starting point and then you use it and build around it like if you have any kind of pre-existing infrastructure or software or anything and then it's difficult to then just go oh I'll just use this one-stop shop because it won't fit. So I don't know you know nobody's really solved this the ecosystem. It feels like this particular solar system of products is you know still in the sort of large bodies of gas forming you know bits of rocks colliding into each other and eventually there'll be planets that you can visit but you know nothing's really really come of it yet so. Yeah no we are in the sort of primordial soup. Yeah yeah that's a good one yes primordial soup data science yes. Oh that's just working around waiting to evolve. So anyway that's I mean I suppose that answers question two and three yeah it's kind of what I've been working on is looking around the the hole in my heart of data science. It's almost missing its maturity. Yeah yeah exactly and a single answer I just want a single answer Jamie. I just want somebody to tell me how it works and then I'll just do that so. Yeah I think you might be you might be waiting a while but yes it's a virtue I think to ask for it. Yeah Dave you're on there too. Did I miss your update? Do you have a Dave? Oh we do have Dave. Yeah I mean I can tell you what Dave's working on. It's got Armada team valiantly working away on Armada. Oh yeah we've also got the ozone team doing snapshotting for actually ozone and some other pieces too in ozone land. And then a bunch of things in F sharp and C sharp develop productivity land just trying to improve build times and new get restores and all sorts of evil horrible things in the Microsoft ecosystem. So yeah good stuff over there. Sounds like fun. Yeah cool. All right well thank you. I think that's we've been through everyone I've been typing up some notes very amateurishly in notepad so I'll try and take all the typos out and pop it into the google doc and then if you want to go back and look and fix anything that I get wrong that'd be cool. Other than that I think we've got so this next session will be in two weeks looking at the calendar so yeah third of August. I'll chat with Ricardo I think he's back by then. I'll be around as well so we'll see if we can get. I'd really like to have that Sillium and EVVF session. We do have someone lined up to talk so we'll try and get that sorted and thank you for that. That's great well if that happens I'll invite David led better if he can join at this time it might be super late in Australia but yeah it will be pretty anti-sexual. Yeah it might be one of one of the morning or something a bit. I think it'll be worse than that. Where is he? Melbourne? I actually know you're right yeah you're right yeah one in the morning. He is a little kid maybe he'll be up who knows so anyway but he's the he's the person who's working on the username spacing and EVVF stuff so nice yeah that'd be really cool okay I'll chat with Ricardo and I'll let you know. Brilliant all right unless there's anything from Nate or Tim? Nope. Have a good day everyone. Cool. Bye.