 Ok. Myślę, że możemy zacząć. Dajmy się zrozumieć. Nazywam się Michał. Jestem samochodem software'ego pracującym na Google w GK Batch Team. Dzisiaj poprezentuję się z Vanossa z Lawrence Livermore National Laboratory, który niestety nie może być z mną na koniec, ale zazwyczaj będę grać z nią, a ona będzie później dla Wasze pytania na slajk. Zacznijmy z historii. Jak all of you know, Kubernetes was built originally with the focus on long running and stateless applications. Naturalnie, there was a lot of feature gaps in the early job API to run batch workloads. But over the years, users who wanted to use batch were still very determined to run batch on Kubernetes, even at the cost of worker runs or re-implementing features which should have been provided by the core Kubernetes. But this leads to a lot of fragmentation in the ecosystem of batch. So we would like to improve this situation and reduce the fragmentation. So that's why we started the batch working group initiative and we work under its umbrella to bring all the necessary features and primitives to the core Kubernetes. So we believe that this will improve workload portability between different frameworks and between different cloud providers. And also this will allow framework developers to focus on the value added rather than re-implementation of the core functionality. Ok, so let's take a look at what has been done on that front. So here you can see on the left the list of problems and on the right the list of new features that address these problems in the recent releases of Kubernetes. So the list encompasses many areas. You can see for example improved handling of periodic jobs or the job suspend field, which is a stepping stone for job queuing, something that Aldo talked at length. But in this talk I would like to focus on two features, that is index job and portfolio policy. So let me first introduce index job by the use case of processing large data set. So if we have a large data set we naturally want to split it into smaller chunks and process the data set in parallel by multiple workers in the world of Kubernetes represented by pods. But the problem in the early API and like the recommended approach was to create and maintain your own queue of tasks. However, if the data set isn't changing then we can have a simpler solution and just assign the specific chunk of the data set based on the worker index. And this is essentially what the index job gives us. So we simply set the completion mode as shown in the yamlin on the right as indexed and then the environment variable with the index is injected into the worker process which can be then used by the worker to load a specific chunk of the data set. So that makes it simple. And also if we add another requirement that the pods need to communicate while processing the data set then index job also makes it simpler because if you create the index job as shown on the left and then match it with the headless service as shown in the middle, in the yamlin then all the pods will have the stable DNS names and predictable upfront. So the DNS names will include the job name, the worker index and the service name. And this makes it convenient to set up distributed network of pods. And in the use cases as we are about to see. Ok, so important part of the talk is to, I would like to present selection of use cases where the index job feature finds its use at DeepMind. And here I want to say big thanks to George and Lena for sharing the insights. And George should also be available on Slack at the end of the session if you have some questions. So the first use case that I want to present is we want to train machine learning model and we want to train it on a large data set. And in order to make the training fast we need to shard the data set across many devices while the devices are split between multiple nodes and each node also can have multiple devices inside. The devices are TPGPU or TPU. So let's take a look how index job can be helpful here. So in the first step we create one pod per node and here one pod with the zero index is a distinguished pod called coordinator. So because as I said before it has a stable DNS name predictable upfront while creating the other pods it makes it easy so other pods register their presence at the start of the system and then the coordinator can set up the distributed environment. So in particular it will await for all the pods to be ready and all the communications channel to be established between the pods. So once we have that all the pods can load the shard big data set and split into smaller mini batches corresponding loaded on the particular devices. And along with the data we also load the model on the devices and here we have the simplifying assumption that the model can fit into memory of a single device. In practice this might be more complex in this case. Once we have that the devices train on the loaded mini batches of the data and the communication channels are used to exchange partial results of the training on the smaller chunks of the data set so that once the results are exchanged we obtain a single model that is trained on the entire data set. If you are interested then I would also like to refer you to some code samples inspired by this use case that I prepared both for PyTorch and JAX ML libraries. Ok, so in the next use case we want to simulate an agent in a virtual reality or environment for the purpose of reinforcement learning. So in this setup the agent performs different actions on the environment and also it can observe the environment to update its knowledge of the world and for some achievements in the environment it collects rewards. So the assumption is that by correlating the actions with the rewards we can make the new generation of the agent that is more likely to repeat the actions that led to rewards. However in practice we don't want to run just a single simulation but we want to run many and the reasons include for example just that we want to collect more data so that we can make some of the less likely trajectories also to be explored. We also want to account for a variation in the initial conditions or we also want to test agent with different tendencies for exploration versus execution. So again let's take a look at how we can set up this with index job. So the assumption is that we have the agents and environments containerized. Then here we have two index jobs one will represent multiple players and the other multiple environments. And then by using again the feature of stable and predictable DNS names the agents can connect easily to the environment with the same index and create communication channel that will be used for the simulation. So once we have that the simulation can start and actions, observations, rewards can be passed over the course of the simulation and this is nicely abstracted out by the open source library by DeepMind. And if you are interested I would also like to refer you to the code samples that I prepared for the simple catch game so what you can see on the gif on the right is an example trajectory in this game played for a single index. Ok so I would like to conclude this part by saying that while you can think of many work rounds and they will work to some extent so this is also similar to what happened here so prior to index job DeepMind actually used stateful sets but there were some problems with them. So one, I mean the root cause was actually the lack of the notion of completion that is characteristic to all batch workloads and because of the lack of the notion of completion so for example in this use case of the simulation some simulation can end earlier but the pods would still continue running consuming resources and also you would need to have some custom code to detect when the simulation is over and also there was no like mechanism, native mechanism to limit the number of failures in case of a software bug let's say. So index jobs nicely fits this use case. Ok and now is the second part of the talk when I will, Vanessa will play about I will play Vanessa's talk when she talks about flux operator I will also make it clear that she will be or already is even at CNCF Slack channels so if you think of some questions that you want to target to Vanessa or George then you should be able to find them on Slack after the session. Hi fellow Kubernetes, I'm Vanessa Socket and I'm going to be talking about an example use case for index jobs a project called the flux operator that we've been working on at Lawrence Livermore National Lab let's get started. Once upon a time there was a resource manager named Flux Framework and Flux lived in HPC land along with the other resource managers a few container technologies and of course assistant men are too and Flux was really great at a lot of things that you see in this table, but especially Flux was great at full hierarchical and graph based resource management. Oh, either little friend you have a question what does graph based resource management mean? That is a good question. So let's say that we have a resource allocation with four nodes doesn't matter if this is on HPC or on a Kubernetes cluster we could theoretically install Flux and start what's called a Flux instance. Now the Flux instance can actually see the resources that are available to it and then if we were to create a job launch a job the really cool part of that is that Flux is going to create instances of itself to run on the sub resources and if you're looking at this you're like, hmm, I don't know this looks a little bit graphy you're totally right we're looking at different depths of a graph where each depth knows about invalidates its own resources so this means that Flux is really good at portability you can run it on a cluster you can run it alongside Luster you can run it on a share you can really run it like anywhere you can run it in a container using Flux as a total no brainer Flux is also really good at co-scheduling because it's able to know the node pathology so let's say that you have a workflow that requires GPUs to communicate Flux can schedule them to be physically close together Flux is also really good at jobs coordination so here we have the Moomi workflow and Moomi was incredibly heterogeneous in terms of the different needs for the workflow components Flux was able to intelligently schedule them so that those components best match the resources they needed across a very large set of resources Flux is hanging out over here at HPC land and as you know over here there's this cloud land where we have technologies like Kubernetes or if you've ever played a strategy game and in this fog of war the idea is that there's something that needs to be uncovered and we need to go on a journey and so last year in the lab this is exactly what we decided to do we said I'm working the space between HPC and cloud so in this space one of the early projects to emerge is the Flux operator and this is going to be what I'm talking about today okay let's get started on today's journey starting with a stop at definition island so probably most of you know what an operator is it is a controller for a Kubernetes cluster to manage objects so the Flux operator is a controller that allows us to set up that Flux instance to run across pods and specifically in this part here we have a special term for it we don't need a cluster for ants I actually mean a set of duplicate pods created by the index job here is where the index job comes in and Kubernetes configured to run a Flux instance it's really cool conceptually because it's like you have an entire cluster in the cloud an HPC cluster for you to control so let's say that we start with a Kubernetes cluster of size nine we could theoretically create a mini cluster also of size nine to maximally utilize our resources and index zero what that brought the job is called the broker orchestrating the job the way the pods communicate is via a tree based overlay network and it has kind of all the niceties that you'd expect so batch jobs, queuing, etc okay basic question if you come visit us in HPC land we're gonna give you a command line thing if you off to cloud land someone's gonna hand you a YAML file so to like start off we figured okay we'll just define the needs of a job in a mini cluster.yaml file this is our custom resource definition or CRD so basically you define your job in this file you give it to the Flux operator it's going to create you a mini cluster and then your job is going to complete, run and everything cleans up make a mini cluster the next stop in our journey is going to be to experiment empire where we ask empirical questions like how well does this work again? so we decided we wanted to compare it to the MPI operator which is another operator in the space that is very similar in nature this started as part of the coot flow project defines an MPI job as its custom resource it has a slightly different design it uses a launcher node to coordinate workers via SSH like the Flux operator it uses a dedicated hostname and a service for workers and we had to use a modified version to scale to over 100 MPI ranks check out the paper right there if you want to learn more about that but like how do they compare so we decided to run an experiment that looked at lamps in molecular simulation on an unoptimized containers this is what that experiment looked like so we needed to use a 65 node cluster to account for that extra launcher node but then we want to test on sizes 64 down to 8 of a mini cluster or just sort of an index job and you can also see the corresponding number of ranks which are the MPI processes ok so and then for each of the operators we're going to launch a job or create the mini cluster across each of those different sizes we're going to record dimings and we're going to save the outputs apparently our sun god at the empirical experiment island is angry right away sun god do the results ok sun god has questions if the Flux operator mini cluster is created via an index job how well does that scale well here you're looking at mini cluster creation and deletion times so this includes entire bringing up and then bringing down of the pods but does not include lamps and as we move across the x-axis we move from size 8 to size 64 so the cluster gets bigger and the really cool part is that this scales really nicely like the index job is doing a great job ok so next question from the sun god if the Flux and MPI operator have different designs like how efficient is each operator setup so because we are comparing apples and oranges here we need to look at them separately starting with the MPI operator here is the end-to-end time so this is the notification of the job through the timestamp when it's completed this I must note is when the pods are ready to go there's absolutely no waiting for pods here and it does not include the lamps run and what you see is that there's a two fold increase in time from size 8 to size 64 now the similar thing we could compare to Flux as a Flux start this is from when the broker comes to life through when he shuts down or when it shuts down so I need to point out this includes the broker waiting for all the other pods we don't know when the broker is going to come up relative to the other pods it also does not include the lamps run and it looks pretty ok to me ok so when you remove the setup the apples how do the run times these things compare we're going to look at Flux Submit vs MPI run this is like if you logged into an HPC center and you like wanted to run this with Flux directly or with MPI run that's the command you would type and this does include the lamps run this is like the direct wrapper to lamps as close as we can get without being inside lamps and I want to point out that for these experiments so we can't really generalize to like everything but for these experiments we did note that the Flux operator is consistently faster we think it might be related to the bootstrap or other MPI variables but the difference is like really insignificant ok so when we peel back another layer of the onion and we look just at the lamps time reported by lamps so no wrappers we again see that they're even smaller but if you kind of visually look at the mediums they're about 10% lower for Flux and we think that like for larger workflows this could potentially translate to cost savings what did we learn? well we learned that the index job just allow the mini cluster pods to scale really nicely very happy about that we think that Flux's 0MQ bootstrap might be related to why it's a little bit faster because the MPI operator uses an SSH based bootstrap more work of course is needed to investigate performance like folks lamps is not it important point of the entire talk so listen up look at texture of the Flux operator allows for multiple jobs to be run on the mini cluster so we avoid the infamous at CDAPI server bottlenecks and it enables high throughput and finally we want to point out that the MPI operator does require that extra laundry node and it could also benefit from using an index job they seem pretty great to me already we've learned so much of the experiment empire the Flux operator has promised some questions I hope you do too we need to take a quick stop at the reality republic so this question how do I submit a job did you really think to run these experiments we applied like a YAML file like a thousand times do you think that's how I want to spend my work day absolutely not we actually ran these experiments using a tool called Flux Cloud in Flux Cloud you define your experiments in YAML file yeah I know we can't escape the YAML it's everywhere so we still use it here and then there's just three commands so up, apply and down to bring everything up run your experiments and then bring everything down so you can kind of like work on other things watch containers, logs and have a sandwich, have an avocado it's super easy and then when you're done all of your config files, data and output are saved for reproducibility take a breath our vision for converge computing is not applying a bit.ly in the YAML files it's a comfortable intuitive user interface ok so one really awesome thing about being in reality republic is we know that reality is informed by vision so as we're here let's take a ride on the visionary vehicle we're going to jump on and ask this question how could we submit jobs so I played a fiendish trick on you I didn't tell you that if you don't give a command to the Flux operator the Flux operator will actually bring up an interactive interface for you to submit jobs for you to monitor your jobs in a table or check output logs and it also serves a restful api that can be interacted with via sdk and so that is closer to our vision for this future of converge computing and we also are thinking about some of these other things coming soon to a theater or Kubernetes cluster near you so keep the watch out for us to get to the next question ok so ok ok ok ok ok ok ok ok ok Możesz używać dojściu w górze i, zresztą, niebezpieczeństwo, jako odpowiedni pomysł z całkowicie kwalifikowanymi domaniami. Specjalizuj drogę, żeby stworzyć coś. Rysuj w górze. Jeśli to jest tylko jednym razem rzecz, możesz stworzyć podgę isolatową, żeby skończyć dojściu w górze, lub możesz używać dojściu w górze. Specjalizuj! To są specjalne sklepy! Dobra. Możesz używać drogę, która stwierdziła na górze. To jest dokładnie to, co robimy dla brokerów. I jedna sprawa jest, że może drogę api potrafią nam zróbować grupy, z węgą w górze. Dobra. Potrzebuję dojściu w górze. Dobra, więc nie mam żadnej solucji na to, ale to, co robimy, jest, że stwierdziły dojściu w górze. Zobaczymy wszystko, a potem stwierdziły dojściu w górze. Więc to jest tylko początek, zapytając tego konwerytu, to, co jest so exciting, jest, że jest więcej projekty, żeby zostać skończone i pracowali. Potrzebujemy Twojej pomocy. Chcecie się z tym, proszę. Sprawdźmy się na GitHub na Flux Framework. Projekt operator jest też tam też. Flux Cloud jest pod konwerytem komputerym, i zapytajmy więcej o Flux Checkout Flux Framework.org. I to jest to, jak dodać mnie, czy żadne z moich klonów tutaj, aparatnie, przez e-mail i jestem z Wsachem na wszystkie społeczeństwo media. Dzięki do tych klubobierów, za supportując nas w naszej przyjaciółach. Dobra, Kupon, ja bym zniszczyła. Michał, wracaj do Ciebie. Dzięki Vanessa. Trzymaj, trzymaj. To znaczy, tak. Ale jeszcze jedna z tych klubobierów. Dobra, więc chcę też przedstawić Policji Podfiler, który jest recentnym spotkaniem pracy kontrolerów. Niestety w Beta, więc jeśli masz jakieś idei, to także przyjdziemy na twoje włożenie. Dobra, dobra, więc tutaj chcesz uruchomić pracownic wielkie, kompracować 100, czy 1000 podfilerów. I kiedy uruchomić taką pracownicę, wiele rzeczy mogą być w porządku. Podfiler są prawie ephemeralne. Więc podfiler mogą znać czy może być wystarczający pracownicą, która będzie w porządku. Więc do pewnych względów, to było już wypowiedane w poprzedniej opołowie, ale z powrotem limitem, gdzie możesz specyfować drugi ruch. Natomiast w wielu miejscach, tak jak ten, to jest naprawdę niebezpieczny. Dlatego, że to jest w porządku, więc jeżeli wypowiedź z powrotem limitem do razu, to znalazłeś, że znalazłeś ruch w porządku, tak jak ruch w porządku i tak dalej. Ale jeśli z drugiej strony wypowiedź z powrotem limitem do razu, to znalazłeś ruch w porządku, tak jak wiele niebezpiecznych ruch w porządku, w porządku ruch w porządku. To znalazłeś ruch. Okej, więc to jest niebezpieczny. Co będziemy chciały zrobić idealnie, jest tylko ruch w porządku, w porządku, czy śwberry i wszystkie znalazły pod wpływem ruchem, będzie tylko ruch w porządku. I znalazłem ruch w porządku i tak jak się troszeczkę COVID-19 ułożyłem ruch, dalszy ruch ruch i w porządku. W porządku w porządku To będzie powiedziało, że w przypadku zespołu, bo twoje bieższe bieższe bieższe bieższe bieższe bieższe bieższe bieższe ale to też będzie powiedziało, jeśli twój podłatw ten podłatw ten podłatw podłatw ten podłatw ten podłatw i był umkliwiony. Więc solucja, którą proponujemy, jest based na podkondycji w kombinacji z podłatwem, właściwie. Więc podkondycje są takie, w statyce, że masz listy z podłatwem i jak ważna część, więc tutaj był przykład podkondycji, ale jak ważna część tego udziału, zróbujemy miejsca w podłatwie kubernetowej, która jest wiktora podłatwem. Wskazywamy, które są podłatwowane do zespołu i modułamy te zespoły podłatwowe kubernetowe aby dodawać podłatwę, która pokazuje, że podłatwowała i podłatwem jest podłatwem. Więc zobaczmy. Przede wszystkim, taki podłatw jest podłatwowany przez kontrolne komponenty, więc np. przez skadłowę, w przypadku podłatwowania przez wyższe prioriowanie. I jeszcze kilka rzeczy, ale też modułamy kubernetowe kubernetowe w pewnych scenariach, jak np. podłatwowanie podłatwem przez odzyskanie podłatwowania. Ale jesteśmy zauważyci, że to nie pokazuje wszystkich okazji, może być trochę pierwszych kontrolowach, które mają własne zespoły dla wyższe podłatwowania, etc. Więc jak podłatwowanie podłatwowania jest podłatwowane przez kubernetowe kubernetowe żeby używają używający podłatwowania policyjne do składu podłatwowania. I też jest to rozwzyskanie do wyniki bardziej podmiotnej rozgrzeźnianie, ale to nie wiesz. Jednak zapadniemy na przykład pod Census Unii Policji. Pod Census Unii Policji jest podanalizacjaencedwacji rół, i każda rysowa wy concretuje 잘bezpieczający mały podmieszkanie pod ognąży. I mały podmieszkanie jest natomiast spowodowa na podmieszkanie podlewej z leży, więc w pierwszej ryciegroaning chce 평awusto wystrzeć całą działalność, w przypadku konfiguracji problemów. Konfiguracji problemów są anotowane przez konfigurację customa. Kondycent, który jest dodany w przypadku niebezpiecznego konfiguracji na konfigurację konfiguracji. W drugim rynku, user chce po prostu zrozumieć i rozpocząć trzy poty, które znalazły do dyskupcji. I dla tego propozycja, w związku z dyskupcją targową, jest używane. I w drugim rynku, user chce po prostu zrozumieć konwencję znalazły do dyskupcją targową. To znalazło do dyskupcją targową. Dobra, tutaj też chcę powiedzieć dziękuję do ludzi z reszty, którzy też były ładni, aby porozmawiać z nas results do testu do polisji podfalujących. W tym setie, znalazły też znalazły do dyskupcją targową. I znalazły bardzo podobne problemy z połączeniem podfalujących polisji podfalujących. I tak jak w Siemle, znalazły do przerwania do dyskupcją targową podfalujących podfalujących. W sumie, znalazły do dyskupcją targową for machine learning, for reinforcement learning or setting up HPC environment that Vanessa talked about flux operator. There are code examples, for the use cases inspired by the use cases, which I like you to take a look at and also you can take a look at the flux operator code. And finally we looked at the new feature of interest that is podfaler policy. But the list of new jest tutaj i nie tutaj, więc tu jest wiele rzeczy i nowych funkcji, które teraz pracujemy, żeby namenić jakiś dojścier elastyczny albo dojścier. Więc jeśli chcesz zainteresować się i participationać w dyskusjach, to Batch Working Group jest dobrym miejscem do przyjechać. Chciałbym również zapomnieć, aby was oglądać. Słucham, Słatie i Aldo mówią o budowaniu budowalnej, ale w sensie, jeśli masz problemy z tym, czy chcesz przedstawić coś, pomyślać, zaskoczyć, to przyjdź i przyjdź. I z tym jestem zadowolony na pytania, i też będzie Vanessa i George zadowolony na pytania. Dzięki.