 then press record it it's recording all right so any of us can do it right okay yeah any of us gonna do it all right so we can start I guess so yeah welcome everyone um today's session is the first one with this new platform so we got a few people here so maybe we lost some on the way but hopefully not many the topic for today is this HPC and HTC end user landscape review we sent a forum two weeks ago with a couple questions and this should be hopefully triggering some of the discussion I see some people don't have microphones available so hi Alex can you hear me I'm a little quiet can you hear me well Jamie I can hear you fine yep yes I'm here I think we're fine I don't know if other people were just muted or if they're uh okay it's just me too okay can you guys hear me everyone just yeah okay sorry yeah we have a few teething problems on the new platform probably none of us have used it in anger before all right just it keeps telling me my internet connection is unstable just keep going it does that to me as well I don't think it's serious right so yeah so the topic today was this HPC, HTC and user landscape with Jamie we've got some questions it's kind of more to trigger discussion today we can go through the replies and maybe stop on every topic and discuss a bit and hopefully like one thing that would be nice is to kind of come up with some next steps of what is needed in this area for the cloud native tooling and what would be really useful for people so if there's no other thing to start with we can just start by going through the survey yeah I guess make sure people put their names on the agenda as well at normal on the actually we don't need that anymore because we'll get the reports automatically from the platform I really I thought we said we were still going to do the Google doc that's fine okay so no names I think we can keep the agenda for the notes but I don't think we need to collect the attendees anymore fine so we just need to say just ask if there's any new joiners I mean there aren't this time but that's true yeah I don't think there's anyone joining for the first time no no I reckon Ricardo will there be a way for the rest of us to see who did attend each of these meetings if we if we want to follow up with anyone that's a good point I don't know is the attendees list in the sessions public to everyone no it's not public okay so do we want to keep the agenda then with the names can also do that I think it might still be helpful all right so that would be here you already that all right so maybe we just go through yeah everyone if you can add your names there and we can start by going through the through the questionnaire I think we can stop at each one and if anyone has anything to highlight I mean we can discuss in detail so the first question was what kind of solutions people are using for high performance computing or high throughput computing and other batch like workloads so in total we actually got eight responses which is not too bad I would say because it was kind of a long question I know so the top two were slurn and pure Kubernetes which is a pretty I was a bit surprised actually then we had two for htcoder and then we had one for Armada none for volcano I had added it just because it's like a native Kubernetes scheduler or cloud native scheduler there was one for cook flow which is interesting and then each agent or system so I think I don't know if anyone has any particular comments on this well one thing that strikes me is that people are clearly using more than one thing as well so that's not really you know five five people in this term and I've been a look at the so they're also not unique so yeah that's kind of interesting maybe maybe a question here is for the ones that if anyone here has answered with more than one is that like from a transition to something new or is it a plan to maintain both in parallel we're an example of people of doing both like for a transition and that would be like condor to Armada yeah well and also actually yeah so I was wondering too you know are we not I would be surprised if there's really nobody using volcano and Armada and other things like that right so like you know maybe maybe it's also kind of a call like hmm like maybe we need to put out feelers into those user groups and try to get them more involved with this SIG right because people who are doing it who are actually using that would certainly have overlapped in what we're doing I think I wonder if there's ways we could reach out to those groups or even coop flow too I mean you know that's those are I feel like those are I don't know too much about Armada and volcano but I feel like those are those are non-zero user communities right now torque I'm not really interested in reaching out to that community I'm just kidding but all right but yeah I think that's that's a good point actually and for volcano they they have quite a good structure of like weekly meetings I think they're mostly targeting or at least they're mostly have end users in Asia for now and those are weekly meetings and then they have like a every two weeks they have a meeting that is kind of Europe and North America friendly one other note on the volcano thing depending on the group they will not be able to access Google Docs or submit to Google Forms so like if trying to like that might segment off a you know other community there but they can use this platform right because I've had calls with them and they can only use Zoom but I think they will also be able to use this baby yes but if if creating a form for a global audience a survey monkey will work in in China okay that's good but yeah but in terms of reaching out maybe maybe that's a good point we can take as an action just to to advertise this group in these communities to see if they are interested in joining it's not like we cover every time this kind of work topic either like there's a lot of things that they will probably not be so interesting we need to find more people doing ML and dive science and stuff like that I think all right the other thing I don't know if like slurn has a pretty strong presence here maybe maybe it would be nice to know more of how people are using slurn and if they're deploying it on Kubernetes or managing it with Kubernetes what are their plans there as well to know if anyone wants to expand on that yeah do we have anyone on the call who is currently using slurn I can I can speak for my old job yeah they were using both a mix of slurn and Kubernetes there wasn't any real transition plan to go from one to the other but that's also it was the University of Michigan and a good chunk of their users were very familiar with you know slurn they didn't really want to like interrupt their workflow the biggest thing was all the the newer researchers and people coming on board that were interested in using things like keep flow using things like jupiter notebooks so that was kind of the you know separation there yeah I mean or no obviously you heavily using slurn lsf I put them all kind of together right to our slurn lsf I guess you know Torx kind of on the outs I guess these days but you know pbs all those right um you know a lot of times you know certainly for us right the the vendor that we buy the supercomputer from at the scale that they're building right you know like they're gonna it's gonna come it came with lsf right we bought an IBM machine it came with lsf so you know those sorts of things I don't think are going away for that traditional hbc community kind of like what Bob's talking about right and so for us it was the name of the game is like how do we how do we bridge that gap as much as possible how can people use the slurm command s batch from inside of a container that sort of thing so that's the way the direction that we've gone but yeah I don't I don't see those getting supplanted I mean you know they've got like 30 years of like industry research being poured into those batch schedulers they're not going to like go away overnight right so yeah that makes sense so anything else I guess for what it's worth I do see more people looking to transition or at least support running both largely just because it's a lot honestly easier especially these days to get been going in Kubernetes to potentially burst out to some place and a lot of the at least at my old job a lot of the researchers were more interested in using things like cute flow and it just made it a lot easier to get going there yeah I agree I think it's a both and I completely agree about yeah and see what one sorry go ahead go ahead no go ahead go ahead I was going to say like one question there would be is there anything that is prohibiting people from moving towards using vanilla Kubernetes as the scheduler of choice or is it mostly just familiarity with the old stuff so let's continue using what is not broken I guess yeah I think the answer there is there are things missing and at least for us at least for us the things that are missing is like priority queuing the notion of a queue on top of just the workloads on Kubernetes and then the notion of fair share to optimize the cluster usage that's another one that is that is also stopping us priorities and preemptions exist already at level of bots so that's something there yeah um there was a very nice talk at last cook on from I forget his name you on from apple apple right yeah you shared that I think on exactly yeah so they are trying to implement those concepts on on the on scheduler and this is also what like volcano is doing similar as it is yeah so then and then therefore a follow-up would be Kubernetes allows for custom schedulers so I mean in fact I think volcano is an example of that are people building their own custom schedulers because they're supposedly have not done myself but they're supposedly fairly straightforward to build so is it something that people consider and people say we'll just build our own scheduler is that an option like I've seen people building your own scheduler there was a it was a problem for a bit because there weren't a lot of hooks into the various different points I see scheduling gets considered however that has largely changed I think as of the 121 release so like this this past year there is a significantly more hooks added to potentially you know to make X writing or extending schedulers easier I see okay it might be just it might be more than just in the scheduler as well like if you want to just queues you actually need to handle the persistency of those queues and if you want to have multiple queues and high handle priority it's quite a bit of logic there that all these systems are very good at because they've been developed for ages now a lot of time so so it's not like a obvious transition I see okay better that's it thank you we looked at that briefly and the issue for us was that we wanted to be able to schedule across multiple clusters and so then you look at cube fed and that it didn't all work to do multiple clusters which is why we ended up writing armada I mean it was easy enough to do the custom scheduler part of it but not the multicluster part so I think that's a good point that's a good point actually like well the experiments we've been doing with managing things like hd kondor with Kubernetes even if we are still submitting with kondor we could have multiple clusters managing the kondor demons and then have central schedulers somewhere else so kind of benefit from the Kubernetes like operations simplification but then still use kondor as a scheduler I always thought your reason for not moving away from kondor was less about the lack of features in Kubernetes is more just the sort of inertia of being able to change user behavior in fact that they all know how to use kondor for thousands of them yeah but fair share like fair share is something that has to be there basically yeah but I mean even if you did let's say that was you know you either use armada or yes just in Kubernetes even so you'd still have to convince a large user base to start doing something differently yeah I think that's true for a lot of reasons yeah I'm my gut feeling is that's actually the one of the main reasons that lots of these traditional places are using the traditional software because it's how things have always been done and it's kind of hard to force people to change and then you just go ahead great I was just gonna say on that point I've been joining a bunch of HPC meetups and it's amazing how focused that group of people is on hardware like they're so into the late breaking hardware and how much we're going to be able to pump over this PCI pipe and the DRAM and the this and that and and they just fascinated at throwing more hardware at the problem as opposed to what we're talking about which is how to use that hardware more efficiently and you know yeah I've gone to a couple meetings now and it's absolutely at that lower level the entire conversation so just interesting where people's heads are at one last question sorry so do you guys anticipate that the existing Kubernetes ecosystem or the existing Kubernetes scheduler the default scheduler that comes with Kubernetes will have options for a bunch of these going forward or is this always going to be like you know default scheduler can only do this if you want something more specialized either build your own scheduler or like use this other open source scheduler or whatnot is that where do you see that going I guess it's like me personally I would expect that this goes into the into some not into the like necessarily upstream Kubernetes but in some sort of like CRD and well supported by like to make them almost first-class resources and if they're implemented as CRDs got it and it's not only like HPC HTC it's also the yeah workloads yeah all right I think we move to the next one then let's question one all right this one should be easier I think so actually this was pretty overwhelming on premises I think from all the responses we got one that mentioned hybrid so I think this is I think the the main question is is this staying like this or are people looking at hybrid deployment as well and what are the stoppage there I mean for us it's staying like this we're certainly evaluating and exploring hybrid for us the bigger issues were things like the United States government data protection stuff around Fed ramp authorizations and things like that that being a government entity that's that's the biggest barrier but we are starting to explore that that hybrid thing but not not really for HPC it would be more for workloads that could that I don't know we haven't we don't we don't certainly don't have any clear workloads that are like oh this would be perfect it's really just kind of exploring what's even possible so it's very very early stages I would imagine a lot of this group have got a relatively established infrastructure have already have on-prem so we start there probably all various different degrees of security concerns as well and you sort of know what you have and how to trust it and also probably just large data sets as well which is probably a factor which might keep you on-prem because you know transferring large amounts of data around the cloud could be prohibitively expensive and also needing that but the equivalent amount of compute to be able to make good use of it one of the reasons uh the University of Michigan was looking at it was because a lot of the grants were coming with like cloud credits so it'd be a lot easier to give people one interface that they're familiar with and just sort of abstract it all away so the cloud credits could go to you know it could be gcp it could be amazon it could be wherever but they're still getting an interface they're familiar with and know how to work with it'll be interesting to see what a new org would do that would fit into this group so if there was a new company or institution invented tomorrow where would they go you imagine they would start in the cloud because it's easy to do so but I don't know I suppose the counter argument Jamie to like the the big data sets that we have on-prem is that companies that use cloud data sets like data providers who are cloud-based initially then if you're in the cloud you don't have to move them as far to your on-prem location so you know we might even be in that state for some if you wanted to it depends where the on-net consumers or producers of data yeah yeah all right I cannot hear we I think the answer for hybrid is ours so we are already deploying some workloads in this hybrid mode and the ones we do are the ones that are from this embarrassing parallel type of workload but we also have a couple where we actually established links network links between our on-premises data center and some regions in different clouds it allows us to kind of expand the data center and in there we can do more like any kind of co-located workload can go anywhere and still have the bonuses on on-premises services it's much easier to do what Bob was describing which is you will depend on the Kubernetes API and you just use it for workloads that can be loosely coupled and don't have like interdependencies that would require low latency or some sort of special network connectivity and the motivation is really bursting and especially for accelerators which we don't have many on-premises right now do you mind sharing like how much do you guys go into public cloud like when do you when there is this thing how many I don't know number of nodes do you spin up uh in public cloud for these kinds of when that happens well it depends like which unit for for for the batch systems we can really tune the amount of resources that are there for for things like the ML workloads using things like kuflo for example we actually autoscale the clusters so they will they will only scale up when workloads go there and we try to define policies on what can go there and do you have one Kubernetes cluster spanning your that's kind of a hybrid cluster spanning your on-prem and okay yeah no these are multiple points yeah okay because I was like that that would be like a networking miracle no it's possible because we for if you choose a region within a cloud you can set up this uh this extensions of the network and we do that I please spend our own premises data center to this specific region but this is not not super flexible so ideally we would need this Kubernetes API and setting up that sort of VPN if needed got there's a there's a very cool project that is called the tensile coop uh it's a very hacky thing but it's specifically an implementation of the virtual coop like where the node is backed by a Kubernetes API oh yeah yeah yeah in fact yeah I've heard of one more thing called nodeless where essentially it's virtual quiblet again uh but so essentially you behind the scenes that you go to wherever you wanted to run and run wherever whatever you want but again cool very cool all right any other point here okay so then we move to a question which was if not already do you plan to build these workloads to Kubernetes to physics band and yeah a couple of questions we had was no uh but uh the once with more details it said we have workloads in Kubernetes that support HPC some use Kubernetes to launch jobs on support of yours I guess that kind of makes sense um then for some workloads I guess it's the answer um for stability being the reason and trying to burst this is in line with what Bob was describing earlier I guess uh mostly already on Kubernetes planning interested or not that so I guess the the next question is what's stopping us we already covered a bit I don't know if anyone wants to add something for those interested what is the the stopper right now anyone wants to dig in otherwise we move to the next one like it is honestly just a lot of it's what people are used to yeah sort of going back to what we were talking about earlier I guess we already covered most of this I guess all right so we know Jamie do you want to pick up this one yeah sure so this is around asking people if people access Kubernetes directly or via an indirection layer uh it's actually quite interesting I think so no responses for just directly majority over the majority to being both um none I suppose people not using Kubernetes at all presumably um I suppose really is probably what I expected to see in a way we can't really tell within the both how much is one or the other um and we've got different I mean in our case anyway we've got different groups of users where some people are a bit more sort of power users and do access Kubernetes directly and obviously the administrator is there of um but most of the our research is anyway go through tools which we built for them to help them do what they need to do rather than using Kubernetes directly I don't know what anyone else's thoughts on this one question here when someone says that they use indirectly does it also mean like CRDs and stuff is that also kind of indirect use of Kubernetes or no I guess I was thinking more like whether you're using kubectl or not uh so kubectl user kubectl is direct anything outside of kubectl is indirect essentially right yeah so some kind of python wrapper framework got it yeah yeah some some of our users will be doing stuff on kubectl same amount even though they're sort of using kubectl in a way they're sort of using some tools and build an image and push it and then run some things and you know that's why we're considering it. It's important also for the all the role-based access control that we've discussed in the past and in the past and credential management of this yeah at the University of Michigan we had people like at like every tier people that didn't care they didn't they never wanted to see Kubernetes they just cared that they had a container and it ran someplace we had people that like wanted access for like troubleshooting purposes or just to diagnose problems but yeah we were able to like and then people wanted direct access to API and we got really good at having RBAC profiles to allow that sort of thing and make sure people you know couldn't you know couldn't get out of their basically namespace. Yeah I'm also learning a little bit about some of these modern newer projects or at least modern and newer for me which is REVAEX and I think one more I think DAX no not DAX DAX is different DAX yeah and and some of those things I believe the way they expect you to run with Kubernetes is you have your local kube config and so they the DAX scheduler will run pods inside Kubernetes so the user who's using it doesn't know that you know I mean they know that there is Kubernetes they have to set up some things but they are not the user is not the one who runs the kube ctl create and whatnot so it is again I don't know how many people are using these modern services yet again I don't know about modern sorry I keep saying modern as if modern for me but but but it's possible that some people may not they since they themselves don't use Kubernetes it's some abstraction layer there it's those might be the cases yeah I think for for the particular case of task it also depends how you use it because there's these two modes where you can simply directly or using the task gateway yes that's right yeah the task gateway one where each user has their own cluster basically it's it's quite interesting we had we had a presentation a while ago about this actually oh somewhere in the archive let's move on so scale so we just asked uh compute resources in terms of order of magnitude of cpu cores from 100 out to over 10 000 um the big well not once a majority but yeah the the biggest response is large which maybe isn't too surprising because that's the kind of thing we're all doing um I don't know who's got less than 100 cpu cores interesting what they're up to probably they've got one cluster I suppose and they're playing with it um although it was 20 that was two out of the eight responses actually yeah we have half of the replies being 10 000 or more so it's really relevant sizes yeah how how big is the biggest if people want to know to say can you share Jamie or is it because I don't think it's bigger than that okay okay yeah yeah how about you guys you're quite large how we can we can share it so our data center is 300,080 percent of that is for cores right cpu cores of 300,000 cores yeah yeah I think it's actually more now but yeah sometimes what's the refresh cycle for you on hardware five years that just goes along with the experiment uh that's different yeah no it doesn't no it doesn't go with it it's just a five-year warranty all right should we move on should I take this one yeah yeah yeah so the question was about gpus um it was actually the interest here was to see how much people are already integrating accelerators into these kind of systems so the replies are pretty much uh integrating them although um only one in one case or no like yeah still quite relevant with a thousand or more we still have quite a bit there so one question I had I don't know if people want to say other things about this but one question I had was um what types of GPUs are this is it all Nvidia or and also is there any sort of virtualization or is it all like PCI pastoral like and and dedicated cards for for the jobs anyone wants to pick this one I'm wondering if the people not speaking are just not able or shout for help on the chat if you can't communicate uh I just want to say that um the uh we we added some GPUs it that the hardware took like three months to come in and we using the GPU operator got the nodes up and running allocatable in the cluster in like two days you know the GPU operator was awesome and uh I really can't say enough about that I think it's really cool how that's uh how Nvidia is able to kind of do that and just kind of throw that over the fence and I don't even know how much they support it I mean they do some but um it's uh it's just pretty solid I don't know it was neat I was excited I second that are you doing any any sort of virtualization of the GPUs or is it just so actually we just hear um uh GPUs and to start doing some of that stuff with but we haven't haven't played with those yet those are sitting on the floor getting getting installed hopefully in the next week so but yeah no we haven't it was they were Volta's I believe was the ones that we have today so but yeah so so then you know we get like a Jupyter Notebook that allocates a full Volta and they use it like maybe less than 10 percent of the time so we're like well this isn't you know that was kind of what kind of prompted the the ampere having a little bit more finer grained control over scheduling so yeah we we we offer also the possibility to do this virtual GPU that NVIDIA already supported with T4s and V100s but it was kind of time-sharing oh okay we realized that in addition to being very unstable in terms of performance there were limitations in doing things like um that there were there were some bits of functionality that that were not available um for for for this sort of driver it also needs an additional license but that we managed I know that the new versions the 13x drivers already support all this functionality that we required so we are giving it another go but we also are expecting the 100s for for me cool okay that's good to know is anyone doing anything other than NVIDIA I don't know I don't think we're doing anything specific and I'm also not sure what we're allowed to talk about Jamie in terms of what we're doing other than GPUs what do you think uh you mean other other vendors other than NVIDIA for GPUs though it was a question I think yeah or other type of accelerators are we talking about that question at this point uh I don't think we got onto that yeah okay I thought that was that yeah no no uh I think we're just NVIDIA at the moment kind of yeah no Ricardo what about you guys saying this for now but uh yeah we would like to get something in addition there there are there are sites because we collaborate with a bunch of sites around the world and there are sites that have an AMD cards as well so we started looking at integrating them so that they run properly code but I yeah for for now it's it's all NVIDIA yeah I think they've got pretty mad market share I would get but we also have issues with delivery times yeah they've been waiting for them for months I think that's the case across the board at the moment but any kind of card already all right I'll move to the next one because we're actually going fast on time so uh so the next one is uh other types of accelerators I put here FPGAs but actually one another reason we burst into the cloud is to use things like TPUs as well so I don't know if someone wants to expand here especially on the FPGAs and maybe it gives some details of how they are integrated we're starting to look at FPGAs we haven't it is there's they're not integrated in any of our sort of cloud native kubernetes type stuff yet though those are quite quite separate beasts currently not to say it won't be ever yeah and I think that sort of following that there are a bunch of IPUs and TPUs and you know insert whatever character you want thing that people are dreaming up these days that we're looking at in lots and lots of ways but uh yeah there's nothing that's actually doing anything at the moment you know we're looking at all the graph cores and sabonovas and uh what are some of the other ones uh tachyons or those other ones um ascension or something like that there's a there's a bunch of those things that are being tested and played around with but nothing that's gone near to production or kubernetes status so but do you know any any specifics about running FPGs and kubernetes honestly I experimented with it back when I was at the university but like outside of uh mounting the device into into the container like beyond that not really um it it never got beyond essentially me messing with it okay um I haven't really looked at I haven't looked at it uh really since then it's okay now maybe maybe we take it as an action act I'm also to to to uh investigate a bit uh where we are at this I'm actually not taking observations I have at least two things which would be uh to engage with uh other communities that we mentioned above and now maybe investigate a bit more information with FPGs this would be an interesting one for those that replied like this is just like seen as a like an extra PCI device that is given to the job or others that one oh sorry I was uh um I know there is a way of mounting the device directly in there uh Intel actually has like an operator that that does it too if I recall um okay it's all through uh device plugins all right okay I think we can move to the next one then uh authentication so we covered this a couple of times in the past um I think I don't know if anyone wants to add anything to what is already here I think we see yeah X519 curve is so off and the main thing would be how are these credentials being maintained and like refreshed for long-lived jobs and things like this I guess everyone has this sorted out or any problems there I just think it's interesting that almost everyone who responds it's got multiple things which I think quite telling I don't know anyone who seems to have got their stories straight on this completely it's a lot I guess when you're dealing with legacy things and new things which we all are there's going to be a combination floating around and that's always a challenge we haven't got away from Kerberos it's still alive and well I mean if anything we're getting doing even more of it by the day yeah it's like the zombies hand coming out of the crypt grabbing your ankle I'm just thinking yeah totally there you go okay so uh maybe we jump to storage then uh Jeremy do you want to take this one yeah sure uh so yeah question around how we handle data in our clusters what kind of file systems people use or other um quite split lots of seph seph fs that is um people choosing multiple as well but yeah plaster gpfs hdfs as well quite I mean that's yeah lots of different various responses I don't know is there anyone interested to know if the hdfs people are on the call actually haven't talked much about that previously in our group yeah it'd be interesting group whether any hdfs users are looking at ozone Apache ozone is a replacement for hdfs if any hdfs users are on the call be interested but maybe they're not it doesn't seem like anyone I'm I'm misreading the color actually I'll just realize that would be why yeah the pinkish thing is seph fs luster deprecating then gps gps as the future all right I just saw here also in the chat Nathan I don't know if you cannot turn the microphone because I just saw a couple of comments that you had that would be quite interesting which would be how many sites how many of these sites are using containers in slur yeah I mean we slurbs have had all I mean forever had integration with the standalone container systems and the singularity has been quite popular and the new container support coming in just be nice to understand how sites are doing that I mean in theory you could take a container run it here run it there yeah I mean that's the goal right sort of I don't know I mean so did you see that app tainer is the new singularity they just they just announced that the other day um the I feel like I feel like you know HPC containers people want more than than what they think they want kind of thing right I feel like the name of the game you know we were working for a while on trying to replicate singularity container HPC containers with pod man and really the amount of holes that you kind of poke in the container it turns into more of a sieve than it does like a container right because you really you really want to bind mount up all of all of your your you know your blasts libraries you know the gpu you know you want to pull all that stuff in off the host right you know it kind of kind of necessarily breaks that isolation you know I mean I still think there's really good stuff about it um and and even like like NERSC NERSC showed that uh with one of the it's not singularity they've got there they've got another one based on docker but that um python applications actually perform faster across a cluster in a container than outside of a container right it has to do with the python looks up paths for linking for dynamic libraries and stuff it doesn't have as many paths to look up in a container because of the way that the way that you link in a container I guess is compared to like a normal hbc host so it's kind of funny but um but I don't know I mean I don't know we get we get tons of requests for people to to to support hbc containers but um and people do use them but I don't know I feel like I feel like uh we always have to have this like hard conversation of like okay well you're not gonna get you know repeatability completely you're not gonna get isolation completely you know it's like all these you add on to all these things you know well that's there's been a lot of research like a good number of sites on how to get the performance out of it like the common trick now is to bind mount uh the npi layer in so that you use you know especially on the craze and that of course breaks a whole bunch of other stuff none of these limits are new actually let me go look up the paper or the presentation I have on it these these these these issues have been around a long time um yep uh there was a good quote when you want to use when you want to go fast you know who's compatibility yeah I mean exactly necessarily right when any stuff I have seen um you know 100 utilization isn't exactly on the top of their list or they want to have some kind of network isolation and they are willing to pay the price for having um uh was it plaid or something along those lines on an hbc that's just not acceptable I mean you pay an obscene amount of money for your if in a band or whatever latency a low latency interconnect and you want to use it because otherwise you could just be using the one gigabit ethernet it wouldn't matter yep uh that link there's a good good quote um from another guy at a different lab who said that uh that hbc containers is teaching a whole new generation um the uh of linking um for errors right library linking errors right you know and it's it's so true because you're right that's what you're doing you're you're mounting it off if you want to get the performance so yeah here I put the uh the link in there I mean we did this back in 2017 um and these aren't solvable by containers or anything else I mean and the whole world of issues come in when you want to swap architectures or compile against ssc 4 versus ssc 3 or whatever um and then there's a lot of sites with the uh hard requirement of reproducibility and bit for bit reproducibility at which point you have to basically run an emulator on a new piece of hardware to get that yeah and I mean it really does matter because you run uh csm or wharf and your hurricane hits Louisiana versus Florida and it's the same input same program uh there's a lot of problems with that especially with the move to the uh the single floats on the gpu's I mean you get the fancier invidia ones with the double floats it's not as much of an issue but it still matters and then the lack of the IEEE float standard being consistently implemented completely makes it entertaining um yeah I posted the link of a lot of the limits that you know been around for a while um I don't think anybody's gonna really solve any of these anytime soon I mean once they solve the halting problem they can but my thought is more just to make sure that the containers can work so the user can you know develop on their laptop throw it on the hbc system throw in their kubernetes system or have them burst to each other whatever but it'd be really nice to know how sites are doing that I mean right now there's a lot of uh glue work that goes in for like um getting jupiter books to work on hbc or some of them run them on kubernetes and then burst out to hbc and stuff like that really nice to know about you know what the sites really need what they're doing I mean I understand the use case of you know you want to use kubeflow you use argo or something like that or helm you don't care how it runs you just wanted to run that but uh you hit a lot of the complications I mean in a lot of cases you're gonna have to recompile absolutely everything to get it to the full performance you know when you're jumping from your laptop which may be like an arm chromebook to you know a xeon box or something like that or even a power eight or power one power what are we power 10 now we should probably stop the rest of this so when we've got a few minutes yep let's browse for the rest and then we can come back uh I think we can go over like three or four minutes though we started it late as well yeah thanks a lot to me yeah thanks I can see um should we browse real quickly then monitoring monitoring we see promissios clk uh floaty alerting with no use there's nothing like very outstanding there it's pretty standard I guess yeah that's important all right so this is coming a bit to what Nathan was just referring which is how our container image is built I think this is a one of the replies I had for him which is in most cases we don't have people building locally they just push somewhere and there's some sort of CI CD that will build for multiple architectures um so those systems here we get git lab jenkins over the tecton and then manually as well uh then git lab tecton manual again uh very likely manual okay there's quite a lot of manual git lab git lab ci tecton jenkins it's nice to know how they're actually using the git lab runners with jenkins to do it yeah so I can tell you I can tell you how we do it we have multiple runners on each of the platforms and when you push your image they will build in parallel and then push to the same registry and then uh whatever runtime is pulling the image will pull uh based on the architecture that they they are deployed on does that answer your question me too I meant more like they'd use an ssh to go on a rest api what is the runner calling or it's going straight through the coups the coup api the runner for building the image you mean or yeah well like here the top one git lab runners I mean there's a few things you could do there right so I'm just wondering how they do it so for the git lab runners you push to uh to a branch and then the the runner will will get a webhook and will just pull clone the code and build locally on whatever hardware the run is running and we we basically replicate them on all the architectures someone wants to add something maybe all right then we cross through uh registries so we have all it's the answer uh anyone particularly happy or unhappy with their current choice any hard issue to raise we use artifactory we've run into some scaling problems with it but we've recently started looking at um dragonfly it's like a sort of caching it's like a particular yeah and uh it's very early days but it looks pretty good actually we started originally looking at something called kraken which I think was out of uber but it seems to have died in a ditch so then we sort of move sideways onto uh dragonfly and it looks pretty good and that's sort of taking some of the pain away from the factory anyone else all right let's go through I think we'll have two more so languages pretty much like half is python and then the other half split some four turn so let's bring it anything to highlight anyone all right so we go quickly the last one additional tools I guess it's more regarding deployment here there's terraform argo city public twice yeah at home sounds pretty reasonable I think I think that's it I don't know do we want to highlight anything in particular we're already three minutes over pretty good and just wanted to say thank you for the people that responded yeah thank you um yeah I think I think I took a couple of action items I took also from the discussion we just had here about how people are actually using containers in these environments what's the motivation to do that and then the kind of limitations they're writing on their setups so maybe maybe we take those as as topics for the for next session yeah otherwise yeah thank you very much everyone and we meet in two weeks for Jamie's jam session I think we are going to try and reach out to the call to graph us working group right let's see if it doesn't work I hope you have your guitar this sounds like it's on me to sort that out right right cool okay thank you very much thanks everyone thank you later