 Hey Kevin, hey, thanks so much for joining us in the car for our little tour of Amsterdam And we can talk a little bit about Kubernetes and you know software while we go Do you wish to or could you introduce yourself to the listeners? Yeah, thank you for having me here. I'm coming one. I actually started my contribution to upstream Kubernetes back to 2015 and I used to do a lot of Around the scheduling part and also since 2018 I started to working on a little bit more about Kubernetes to the industry stuff So one of the project I initiated was Q-Age to basically leverage the Kubernetes and Clown-native locations running on the age computing environment scenarios and also later on another kind of path is that there are a lot of AI Machine learning big data workloads trying to run on top of Kubernetes and we have seen that actually, you know, Kubernetes is more likely Born a very Well supporting the microservices. So so we build a volcano project to kind of porting the Like the the job management the queue stuff and like the GAN scheduling to better leverage the batch Machine learning workloads running on Kubernetes Actually, I also have been a little bit background about the multicluster multi Cloud part. Oh, okay. I am a participant of the Multicluster sig in a early time when it was called the Federation and also after two version the Federation V1 and the KubeFed Trying we started we started actually the V3 project It's called the K-AMADA and also it's kind of it's a Sandbox project in the CNCF Yeah, gotcha. So when I was kind of doing some of the background research You know, my immediate thought is like, okay Kube edge and volcano like do you are they related somehow that in your mind like they seem to not be connected really? I mean both interesting But like do you see them as a like somehow paired together and that's why you're kind of working on both? I think not from especially from the user perspective Maybe not quite connected especially the typical user of Kube edge It's like more from the industry is like a manufacturing that like public public Transporting or the other people building the smart compass for example while the user of volcano are more likely the academics search research organizations or some just to run in the Michelin a big data Try on top of Kubernetes, so a little bit of the connection we think is that There are a lot of AI workloads exploring to the path to run on Age and people are talking about like the in the future we may have some kind of Redundant resources that they can be used like in the nights for training For example a lot of EV they are become more and more smart, but in the nights it's kind of the resource are just Sitting there. Yeah. Yeah. Yeah. Yeah. Yeah, so that's kind of future But from now it's kind of still to projects doing their own stuff. That's interesting. So, yeah kind of Old-school grid computing Where you're like, hey, maybe we could use some of these resources to when they're not being used for other things That's interesting. So with cube edge You know, what do you think and this is kind of the nature of this show, right? It's like, what do you think is that the thing that you think looks really interesting about it in say a Six months or a year like what's what's the next thing that's going to drop that you think looks really really cool Yeah, actually, I think that the exciting thing in the cuba is especially from the community perspective There are a lot of a new usage new ideas kind of adopting Especially on the age that is keep moving Right in the early days. We start the time this we start this project We are more likely focusing on the fix the location age like like The kind of some of the devices like a thermostat. Yeah. Yeah, and then also some of the manufacturing they just Deploy their applications with cuba in their in their factory in the workshop but during the last two year One of the academic they want they try to manage workloads on the low-orbit satellite Okay, yeah, those are definitely moving Yeah, and it's a low orbit. It's actually moving very fast. It's like you you got six to ten minute time window to stay alive. Yeah and Yeah, and they it's like each day you will have like eight to ten times got this time window and the resource on the Satellite it's quite limited especially they're using the the battery Solarized the battery so it's kind of the We need to save the race battery so a the satellite can serve longer, right, right? Yeah, and keep being a satellite And and you know sending data to the ground it's kind of cost a lot of battery usage so so that That academic they are trying to have some some more model Running on the satellite to do some like easy analysis to future some of the input data so they can send the most valuable data back rather than all of it. Yeah. Yeah, that's interesting I bet that's a that is a hard decision to get to as well, right? because one of the Excuse me one of the fascinating things about data in recent history, right? It's that you know people are collecting everything because they don't know what might be useful in the future So that's a especially hard problem, right because it just disappears, right? Yeah, yeah, so another very interesting usage is that one of the Actually the automotive manufacturer mm-hmm. They use Q wage in their EV They're starting mode with the to writing one of the MPV because it's a kind of more More and have more resource there. So they are exploring the path of Enabling the S OTA those basically the softer level a software level upgrading. Oh, yeah Yeah, yeah, because before that they they have their own kind of FOT it's a firm firm upgrading, but You know when you are upgrading the firm, it's kind of the car need to stop there. Mm-hmm Wait the whole upgrading finished and you can go right, right, but the It's kind of take a long time and a lot of actually on relevant thing need to be stopped But the software level give you a kind of smaller Component to update. Yeah, you do like smaller chunks at a time. Mm-hmm. Yeah And so they're using they're looking at Q badge to kind of deliver those updates in a sense Yeah, yeah, yeah, and that's kind of just a very studying usage and also we know that in China there are kind of companies exploring the kind of The charging especially changing the battery. Mm-hmm. So they have their own station there and They are exploring the scenario like when your car goes to near the station They they want to use some of the like local network to guide your car to the right place Oh, and also during the changing of the replacing the battery You can also trigger some of the upgrade right, right, right? You can use it as an opportunity for basically fast data transfer, right? Yeah, no, that's interesting Huh, and so So is that what you've been kind of focusing on from your work perspective in in Q badge itself? Or or is that just something that is going on that you think sounds really interesting? so currently we are collaborating with these users to to implement that for example satellite they have already Some of the satellite running in orbit already But they want to kind of improve the whole platform to make them much Smother and also like the vehicle country the the in car deployment is already there, but the charging station is kind of a new New idea on the way Yeah, and from my perspective, we are kind of trying to to decouple different things because Cubase from the Framework layer we want to keep it more general and open to more of the scenarios At the same time we want to Kind of provide more functionalities to help people simplify their adoption. So there will be some like toolkits feet to each To kind of enable the consumption of these new features. Yeah. Yeah. Yeah, that's yeah interesting and I mean part of the a lot of the problem with you know, especially Incredibly sophisticated software like things like Q-Bedge, right? Like it's just getting people to wrap their head around what it's gonna do for them or how it does it or whatever It's often half the battle, right? So I was curious also about a volcano, which I thought was you know interesting project Can you tell us give us a little more detail and on what volcano does and what it's for? Yeah, you know at a very stunning part of the point actually we were building the GANS scheduling that's kind of the first requirement of all this AI machine learning and users and we know that In Kubernetes, it's kind of scheduled part by part, right? But for example the tensorflow people want kind of have a column of the PS the parameter server or the worker node Start then load the data to start chaining if only a few Instances there it's kind of Waste of the GPU especially GPU time so we introduced that the part of the group concept and also the Volcano job concept concept is to support multiple AI framework definition. It's kind of you can define TF kind of cluster or torch and it'll basically modify the scheduling for example Yeah, yeah, yeah per you know for that set of things Or whatever. Yeah. Yeah. Yeah, and also actually For the kind of traditional batch user they they are very familiar with Q functionality. Yeah, like like batching up your work. Yeah. Yeah. Yeah. Yeah. So we also imported Q concept and the build like the Hieratic queue and also Implemented some of the algorithm that's quite familiar to the batch users like fair share sharing or being packing right to kind of Also balance the resource usage and the recently what we are doing is to to implement like the Better resource sharing taking the Especially the online services and the batch workloads Both into consideration enable them to co-locate on the node and also To provide the oversubscript subscription of the resources. Yeah, and also we are actually collaborating with Operating system community to kind of achieve the know the level Her resource preemption. Oh interesting. Okay. Yeah, we So I was I was out of thing or whatever and so are you using things like EBP MF? No, wait Yeah, like EBP F to do like how how are you communicating with like the OS level? From inside Kubernetes like what what technique is it using? Yeah, so actually It's like that the whole system Majorly we have to part of like scheduling level. It's like we need to Schedule according people requested right, right? Yeah. Yeah, and the people are able to declare like I have like to to make up to gigabytes that's kind of Not exactly using all the time. So it can be Reused if if my if I'm not in a heavy load situation Yeah, and then when we are creating this Container and on the node It's like we need to do some of the notation to through the CI mm-hmm to tell the basically to tell the OS when you are Creating you you are locating these resources so you can Put some of the part of the resource that being shared, right? So we let you there to define the different priority of their Of the batches or we work load Basically, and also define what a kind of how much they can share their resources and then we can we will just Put more like low priority work load there the The role what OS to do is that they they make sure they the part of resource can be shared with to the Lower priority workloads, but when the pressure comes to the higher priority the OS Needs to be a free amp to basically and yeah, yeah take back the resource very right, right and the You know that the interesting thing is that the sharing among CPU times. It's it's kind of much easier but for the memory It's kind of more Interesting because memory with it's kind of not easy to compress or right? Yeah, well, there's a lot. There's also a much larger like security component, right? You know because you don't really want this process to be able to read that processes memory, right? Yeah, and so that can also be a bit of a challenge. Yeah, so so so so what what? the OS team they are exploring is that like they will Check and basically determine which kind which application is kind of running and also they can use like pages to to to kind of save this memory and Compress it and they store it to the disk and They later are resuméed to the memory. Yeah. Yeah, and So you were saying before that you can kind of implement a different scheduler for different You know types of like AI approaches or whatever Can you do it per workload or do you do it for the cluster or like how do you? Like what level is the scheduler kind of going in? Yeah. Yeah. Yeah. Yeah at the very beginning we just replaced the scheduler but at the same time we Kind of relying the multi scheduler mechanism from upstream. We are also contributed of that. So basically you can For the each workloader. I remember there is actually a scheduler name failed. So so so people For the user they can just determine which scheduler The this part that gets scheduled, right? Okay. Yeah, and also for volcano. Actually, it's kind of We we we don't want to we don't think that multiple scheduler running inside a cluster is a good idea Because you kind of to brand to make decision, right, right? Yeah, they're gonna like fight, right? Yeah, that's that was actually kind of why I asked the question because I was curious. How do you make them play nicely together? So so so the currently we actually import the upstream scheduler algorithms So basically scheduler. It's kind of Super set of all the functionalities. So you can also Use volcano to schedule the microservice Oh Okay, and get so kind of to So you can do other other types of batch processes besides kind of like the AI work, right? and Yeah, that's cool. It's funny Because yes, I'm very familiar with the batch process kind of for research because like I'm a Professor at a Boston University and and we have this whole research computing cluster It has like all the batch processes and stuff and it's kind of like I you know I didn't know about the volcano before and so now I kind of want to go back and I actually Communicate with those folks a fair amount and kind of be like hey Did you know about this project over here? You might you might find it really useful rather than I'm sure they're maintaining their own scheduler right now Or the batch scheduler not like the scheduling scheduler So yeah, that's super that's super interesting. I really you know, I'm glad to kind of hear about it So so what do you think the next cool thing is going to be in volcano? Like what's the what do you think the next big feature that you really want to see land? Actually recently we are we are taking a look about the multi cluster parts, okay, especially For the Kamada project we already did some kind of trying and also some basic implementation about scheduling the workload among clusters, okay, I think that that's kind of very New area because people today the way they use multiple scheduler. They are just a kind of manually picking the cluster to where the yeah, especially for kind of the batch users They just want resources. They don't care about yeah, I don't care where it comes from. Just you know, let me know when it's done, right? Yeah, yeah, yeah, so Actually, we already have a very basic implementation in the Kamada project It's kind of help you automatically Divide the replicas to multiple clusters according to the resource availability And also the most interesting thing is that people today they are they want to take the like the quota into consideration take the price of the underlying resources like the Different cloud they have different price for the CPU memory GPU. Yeah, right. So so they want once they kind of Imported this perspective into consideration. They can get cheaper They can actually use the use part of the scheduling is also is not necessarily finding the most optimal place to run it Or part of optimal includes price, right? which yeah, which is kind of a I think a new thing in Kind of all of our computing where it's like we're we need to work in the price point because we're not used to all this like public cloud And everything else and where you know, it's like you pay per, you know minute or whatever and that Now you want to build that we probably should have been building it into like all of our data centers for all these years But you know now that we really have to because we really care about the pricing You know, it's really it's interesting how much it's become part of the schedulers and that kind of stuff. Yeah, there was actually a guy who's working on his PhD at at Boston University Who was working on? Tooling or trying to figure out how he can essentially do a scheduler for serverless functions So that the serverless function would kind of like move around Onto the different clouds based on price point And you know and so can can you optimize with that without the massive, you know some sort of massive performance impact? So I just think it's all kind of really interesting to look at it from that perspective I've always thought schedulers were really interesting. And so the more you know, especially when they're really complex So what I wanted to ask you about the multi cluster So in the multi cluster scenario, you kind of already were saying that it's not great to run multiple scheduler types in One cluster. So do you now when you want to identify where to put a batch load? Can can you kind of indicate does a cluster then decide, you know what? We're gonna go all tensor flow because we have a lot of pent up workload Instead of whatever other schedule or where you are on and so that it can pick up those workloads Like are you gonna be able to modify or think about modifying kind of the cluster itself as part of the The activity of balancing the workloads across the clusters You mean modifying the clusters. Yeah, so like okay multi cluster scenario, right? And each of the clusters is probably running some scheduler, right? We don't know what in advance, right? but our batch in you know Tools or engine or whatever is saying. Hey, you know what? I've got a lot of tensor flow stuff coming along You cluster over there. You're not really doing a ton of stuff. So can you modify your scheduler? To be the TF one so that I can push a bunch of that workload there because I think that's really interesting. Yeah that's why Actually, we are trying to do a more kind of powerful version in the Volcano project. Yeah, because we are Expected expecting people to kind of use volcano in the single cluster as the scheduler because it's able to schedule the batch workloads as well as the kind of classic Kubernetes applications, right? Right and also Actually, we already see a lot of challenges doing the two-level scheduling Schedule to cluster and the cluster doing the their job, right? And and actually you cannot always Make sure the The federation level scheduling is correct or is the best because there's always some risk condition and The federation layer It does scheduler see there are some resource available Also, we can take kind of the resource fragmentation into consideration, right? Mm-hmm, but the the underlying In cluster scheduler. They are also doing their own work. So so the things can be changing so what we do today is that we we we have a kind of Dscheduler at the federation level to kind of You know get fixed. Yeah, like like leave me alone so that I can I can be do proper scheduling Yeah, but but you know, it's a we think it's kind of expensive because It takes a little bit time and computing steps to to make sure to find out which cluster or which set of cluster to run these workloads and some of the part some of the Actually, the replicas may fail and then you'll come back. It's kind of It's a long cycle so that's why we think that maybe the We provided Multi cluster scheduling functionality from the volcano and make Kind of federation layer scheduler and the single cluster scheduler Collaborate with each other more closer, right? We'll resolve this situation better. Yeah, especially we want to improve the kind of The accuracy of the first the time scheduling. Yeah. Yeah, yeah I mean, you know scheduler are interesting, right? Because one of the things that is a challenge is like You can't spend so much, you know processing, you know power Doing the perfect schedule such that it actually uses up whatever optimization that you you know We're gonna get from your perfect scheduling, which I always think is you know It's weird when you write have to write software that is gonna, you know Optimizations are sometimes the attic the expense of developing the optimization, you know, which I think is kind of cool But yeah, that's that's neat I I think a lot of that stuff is really interesting and then you mentioned a there's a third project that you've been working on Commada. Oh, right, right. So and so multi cluster management, but you're mostly thinking about it in terms of like scheduling rather than like Kind of like you personally from a like the scheduling kind of scenarios rather than You know, how do you manage a large number of clusters, right? Yeah, actually for the commander way, we are thinking that people have different kind of Progress of using the multi cluster architecture or multi-clouder Especially for the very beginning that the very beginners they they just want to reduce some of the Repeating work like spinning up clusters Spin up for the cluster life cycle management We have the like cluster API already Right, but there are a lot of work when the cluster is there, for example Configuring the namespaces configuring the rbex and also Setting up the culture for different teams, right? Right? so the idea of commada is that we for this level we want to kind of Reduce the work from there at the meeting perspective Mm-hmm. So commada provided some of the mechanism called the propagation policy so it's basically Able to propagate any type of the Kubernetes resources including the custom resource Right, so so people are able to like propagate our back configurations Maps namespaces or resource quota and as well as the deployments, but for deployments for this they said we are able to kind of compute the the resource consumption of the part from the part of the perspective to make sure you you you got appropriate replicas of the Application running in different clusters, right? Right and also like today actually, you know When people accessing multiple clusters the management of the cuba config It's actually a big challenge both from the user perspective and from the admin perspective So so commada also provide the kind of unified entry point to multiple clusters The from a little bit of the implementation Perspective is that we use the impersonate mechanism from the HDB protocol, so it's kind of actually The online from the user. It's like I will use one of my cuba config token to access All the clusters through a single entry point So I just need to need to select which cluster I want to go but I use the same cuba config and the commada use its own token actually on the line under the hood to to connect with the online cluster, but impersonate as the user so so it means that any request to go into the Request to go into the online member cluster. It's authorized as the user Sorry, I just Thought that was a stop light and the guy was right behind me But okay. Yeah, so continue to basically you can actually If you use the kind of the impersonator you can you can kind of audit and have an idea of who's doing the work To make the changes, right? Yeah Yeah, so so from the at the main perspective, it's super useful because you you've got a Single point of the entry for all these kind of users or application Operators, so you can audit just that there and and you got unified Authentication authorization, it's much easier to Manage it over. Yeah, right. Yeah. Yeah. Yeah. No, that's cool So let's kind of ask a slightly broader question What you know kind of what got you into like open source and kind of Kubernetes like how did you get started? You know kind of working in this space Yeah, yeah, yeah That's a very interesting topic. So in the early days, it's like we we are trying to build the Platform as a service product internally, but later on actually our team so how we decide to to do the cloud business and so we started to building services for our customers and actually before Kubernetes we also tried the some the other technologies, but the Kubernetes is the first one that we see a very open community in the early days Like back to 2015 The scalability it's like 500 nodes, right? Yeah, there are a lot of Scheduling features to be added and so in that day I Kind of very fortunately Join the community to discuss with the people about our idea and then we got very short time of the for the ideas to be accepted and we started contributing, right? So that that actually kind of experienced affect us a lot So we actually benefit a lot of from collaborating with collaborating with the upstream community Yeah, and that's why when we are building the new services, we also open source it Yeah, and then donate it to CNCF. Yeah. Yeah, so you can kind of benefit from that So so that was your personal first for a into open source in general. Yes. Yes. Oh cool Yeah, that's a I that's a you know, it's a great like it's a great opening story, right? I mean a lot of times, you know your first You know entrance into open source is because you have some sort of problem, right? Yeah, but it's us to me more like you were You know you kind of had hey We think you've got the right solution if we can you know kind of make this work for us. It'd be really cool That's awesome. So so you So your organization has like done heavy adoption of Kubernetes kind of internally as well Yes, yes, yes. Yes. Well, we actually there are also our kind of internal User internal customer. They have a very large scale deployment of the Kubernetes so actually from our perspective we are trying to Provide the best the solution so actually We actually the internal user they are using the same version we offer to the customer. Oh, cool. Yeah, so it's When I was at Red Hat we used to refer to it as eating your own dog food You know, which I would say is probably a relatively common phrase But you know, I think it's always such a good idea to you know If you're first customers of your of your software It makes it, you know a lot less painful for your customers You know so Yeah, and especially I think one of the interesting things that Recent years more and more customers also Collaborate with us on the open source. Mm-hmm. So it's kind of not just using the commercial part Yeah, they also have some idea. They also have some requirement, right? Actually, they are not kind of 100% Certain about their requirement, especially from the solution perspective, right? Right people are always kind of Trying to avoid Falling into some like x y problem situation. Yeah in the community We can discuss with more people and to come up with a more general Solution mm-hmm. So so I think that that's also Very exciting thing, right? So like today we have a lot of Features actually the idea is coming from one of the end users who are one of the Service providers in the community and that we discuss together and implement it together Yeah, yeah, that's awesome Well, why don't we end the interview there? Thank you so much for taking a little tour of Amsterdam with me I hope you enjoyed it and You know and but it was a pleasure to talk to you I like I said, I've always been fascinated by schedulers. So I really like talking about them even though I've never actually worked on any But you know I get I get really interested in technology so they don't actually do sometimes so thanks again I really appreciate it. Thank you