 Alright, hi everyone. I'm Flaviu and together with my friend JP we're very excited to talk about some of the infrastructure elements They're going to fine-tuning large language models such as MetaSlamma2 on communities with Argo workflows and Hera We'll do some quick intros again. My name is Flaviu. I'm a staff engineer dyno therapeutics a biotech company in Watertown, Massachusetts where we focus on Designing delivery mechanisms for gene therapies and my name is JP Zivillich I am the CTO and co-founder of Pipekit where we do expertise services with Argo and Additionally provide a control plane on top of Argo workflows Alright, so doing a quick outline first We're going to talk about the motivation for this talk Who it's for and what you should expect to get out of it Then Flav will do a bit of a walkthrough of what foundation models are the process of fine-tuning and why you would be interested in fine-tuning I'll be handling the infrastructure piece and then lastly I'll hand it back to Flav Who's going to do a full-on walkthrough of the heracode needed to actually do the fine-tuning? Cool, so doing the motivation So we wanted to show in this talk how to do scalable distributed fine-tuning for LLMs so The target audience is really going to be these like individuals teams companies who want to use LLMs But need to do some additional customization like let's say for example, you wanted to customize an LLM to represent like a certain tone of voice maybe Alan Arkin from the 2012 movie named Argo perhaps or Christopher Walken right that would be a like a good use case Also, if you're just interested in distributed model training generally you're going to get a lot out of this talk All right, we'll get started with foundation models Which is a relatively recent term that was introduced to describe these generally available open-source models that have been trained on vast amounts of data For tasks such as image generation or or text generation. These are generally very prohibitively expensive to train they require a lot of in-house scientific and engineering expertise and they take a lot of data and a lot of time which Essentially provides the motivation to take these models essentially off the shelf and fine-tune them on your own data Which is a process that I'll explain in a minute But it essentially boils down to taking these models feeding your feeding them that your own data so that they you improve them at a Particular task that your business might be interested in for the applications that you build for your customers And this can be applied to a variety of domains such as you know If you have medical notes or support tickets or access logs or something like that that you want these models to Essentially perform Q&A on They're very good for for that Fine-tuning generally speaking It is a transfer learning technique You're taking essentially the knowledge that's embedded in a model that has already been trained on a particular task and you're transferring that to your specific domain and By fine-tuning by fight fine-tuning the model by giving it more of your own data You're essentially making it better at a task that you're interested in of course for example If you take matters Lama to Lama to hasn't been trained on your proprietary data but you can actually take this open-source model and If you did some of your own examples in order to do better at your particular application There's multiple stages that go that go into fine-tuning such as setting up your infrastructure You know getting access to these specific models and then feeding them your your own data And in this talk we're gonna focus on as JP mentioned on the infrastructure of that which JP will start describing now Thank you Flav. All right, so given this is Argo con and kubcom we will need a kubernetes cluster. I think that one is a given So we'll need some GPUs in which we're going to be doing the actual model training on Within that kubernetes cluster. We're gonna need principally three things We're gonna need a custom storage class. We'll delve into the reasons for that in just a minute We're gonna need the GPUs as aforementioned and then lastly we're going to need Argo workflows installed Which is going to be the workflow orchestration system of choice that will allow us to string all of these pieces together Outside of the kubernetes cluster. We're going to need a hugging face account Saw the hugging face individuals earlier this morning. They have very distinctive shirts love the logo and Lastly we need approval from meta that you can use llama I think that's a pretty simple process, but it is a checkbox that you need if you're using another LLM you might not necessarily need that Cool, so we will jump into a quick architecture diagram of how this is going to look so the data scientists or the data engineer in this case the you with the Exclamation point on this one is going to be submitting the workflow that Flav you is going to demonstrate to the Argo workflows server The Argo workflows server is then a wrapper for the Argo workflows controller Which manages the workflow state across the kubernetes cluster that is going to be creating the workflow And this workflow is going to consist of three parts. The first is creating a distributed key value store That is going to be handling the metadata for the model training In this instance, we have chosen to use at CD to clarify We are not using the at CD that is coming built in with kubernetes, but rather standing up a separate at CD instance Right at CD is really just a distributed key value store So why not use it to do other things than just storing kubernetes state second We are going to be creating four nodes each with four GPUs to specify We were talking about server nodes not nodes within an Argo workflow. So these are four GPU instances This could be more about we're doing for for this example And then lastly at the end we're going to be deleting all of the resources and treating each of these workflows as ephemeral So next touching a little bit more around the distributed key value store So the problem is that we want to track which shards of the model have been trained on which sections of the data set and make sure that each section of the GPUs can share metadata With one another So again, we want to be using a distributed key value Store and it could be any key value store We chose to use at CD in this instance because we are familiar with it There is existing by torch documentation on how to use at CD and we found that spinning it up on kubernetes was relatively easy So again highlighting that this is separate from the existing at CD instance that kubernetes Uses so we will refer to this as the training at CD instance just for clarity's sake All right, so next i'll be handing it to flab to talk about the workflow steps All right, so now that we have that etc the deployment actually available in the cluster Which we're going to cover how to do in in a few minutes I wanted to illustrate what happens with these Containers once they spin up so each container here mounts four GPUs So we have 16 GPUs in total each container talks to etcd the moment it actually spins up to communicate via etcd it's about its existence and Readiness and it just waits until all of the other neighbor peer containers are actually spun up as well Once that is done all of them start downloading the data and the model And the data will be sharded across the independent GPUs And in addition the model will also be sharded across the different GPUs Because the model such as lama lama 2 that has 7 billion Parameters does not fit on its own on a single GPU so There are different approaches that you can use to take chunks of the model and then put that at a time on the GPU And they're just trained and essentially pass data just through that specific That specific shard I wanted to show an example About how this actually works in in pytorch, but they have an amazing diagram that illustrates just this So we're going to walk through that Right now this is taken directly from the pytorch documentation. So this process is called fully sharded data parallel And here we have an example where we have two parallel processes processing data And different chunks of the of the model that we're training. So at the top We really just have a single GPU and at the bottom we have another GPU That those pull from the same data set they pull one simple One example at a time One of the the the chunk is loaded Or the model shard is loaded onto the GPU There's some synchronization that happens across the GPUs, which is where Etcd is useful because it stores things Metadata for communication and synchronization across the different ranks Data is actually passed through that specific chunk And the details of what happens, you know, with the forward and backward and gradient computation Don't generally matter for this in this specific diagram the take-home message is that at different stages in training There are different Synchronization steps such as this gathering of weights, which is ensures that the parameters are consistent across the GPUs Because the they they can process different data sets they'll get different weights which have to be Again gathered and synchronized across the ranks And then once we actually modify the network in the learning process We have to sync the so-called gradients As well and once that specific chunk is done it gets offloaded Optionally to to to cpu memory and then the next chunk can be can be processed Now, of course, we have to take all we can easily we can easily spin up all of the resources and infrastructure Necessary for actually running this on Kubernetes But we also have to take it down because it's very expensive and I'll pass it to JP to show us how we do that Aren't you glad we scheduled a very light and fluffy talk to be the first talk right after lunch All right, so I'm going to go over the teardown of the workflow that you just saw earlier. So teardowns are Where we're going to treat each of these workflow instances as being ephemeral So walking through the steps in the teardown First the training at cd instance is going to be torn down at the end of the workflow using an exit handler In harrow or argo workflows For those not the most familiar with harrow or argo workflows This is a concept where at the end of a single workflow run. It's just a cleanup step It can be a teardown You can tell it to do other things like post a slack etc etc etc But whether a workflow succeeds or fails it is going to execute Second the cluster autoscaler is going to be tearing down the gpu's Since they are no longer needed right like this is a Pretty normal kubernetes concept. We have several nodes. They are now no longer going to have pods scheduled on them We don't need them anymore. Don't want to get billed for them next We are Guaranteed that the teardown is going to happen regardless of the success or failure of the workflow run itself again Since we're using that exit handler concepts right whether it succeeds fails Doesn't matter and then as a general rule we want to treat all of these workflow runs as being As ephemeral as possible. There are some situations in which you might not want to do that which we'll cover a little bit later But for now we'll keep them ephemeral All right, so next we'll hand it back to flab who's going to actually walk you through the harrow code on how to accomplish this All right, so now we're going to step to the deepest level and actually go through the code that is used to schedule All of these resources spin them up on kubernetes via harrow and argo workflows and talk about some of the requirements so Of course, you can access this. There's a public repository of all the code And for those of you who are stalker is stuck in yaml land. Welcome to python. It's much nicer here So very briefly harrow is a is a is a python sdk for argo workflows that allows you to Set up basically your own mini platform through harrow by Importing, you know things like global configs hooks and things like that to to set up everything that your internal users might need So such as authentication the host the tokens and things like that so they don't have to worry about it So I wrote Such a wrapper just for the purposes of the talk and we're going to go through now So there are requirements such as setting your host Host of your argo server your token in case that is actually necessary. Of course, you can connect through local host as well You need kubernetes namespace where all of these resources will actually be provisioned And in this case I created a single docker file that I am using for all of my all of my resources that is set globally here Then I have some hooks that will intercept any Containers that are created and container here is essentially the same It is one to one mapped with concepts in argo workflow. So container is a docker container that is created by argo workflows I'm always setting an image policy of always And I'm also adding the necessary tolerations because if you're using specialized infrastructures such as gpus kubernetes will often have Tains and it will require tolerations to be set on your pods in order for those pods to be schedulable on those specific nodes The dem the talk was actually run on gke So I'm adding a specific node selector here if you have a different crowd provider This might change for you, but it is it is easily adjustable for for your own infrastructure And I'm also adding this A very important bit. It's an empty directory that is mounted to slash dev slash shm And this is the shared memory space of the node which is required for Inter gpu communication on the same node on the same kubernetes node And if you don't set this you'll get things like python bus errors, which are incredibly hard to debug The next thing I'll talk about is how we actually spin up the etcd resources So one of the great things about argo water flows is that if it doesn't have a primitive such as a service That you can just like spin up dynamically You still have the the liberty to use a resource to give it a specific yaml Definition and it will just create it for you. So I'm creating that here I'm creating a load balancer for the etcd instance that jp described Then I'm just defining an etcd stateful set that actually creates the replicas the the replicas that are the independent etcd Workers and these ones mount An ssd and this is required Well, it's not required. It is recommended by etcd because etcd writes data to disk So that depending on the the performance of etcd is tightly coupled with the performance of the disk itself And then finally deleting those resources And I also have this magic container that runs a very ugly bash command And the reason I have this is because waiting for I we need During training the ip of the load balancer and getting that ip actually provision takes a bit of time So I added a bit of a a small container that will wait For that ip to become available All of these are independent components we're covering right now And we're going to get to how we're actually putting all of these things together into a dag on argo workflows in a second I'll show you really quick how the ssd is defined We're using An ssd on gke that has a volume binding mode of wait for first consumer So that the disk is actually attached to your it is provisioned when a specific pod actually wants to mount it because otherwise you can get Pods and disks in different zones and they become they're not mountable in that case All right, so finally the the actual training workflow So before we actually define the the actual workflow and all of the dependencies I'm going to show you what we actually create for the independent containers that we previously talked about that have the four gpus attached to them So we have a fine-tuning container That has some environment variables for the hugging face token that jp mentioned It has an image we're using torch run as the core command, which is part of the container This is the one that will actually run the script and do the message passing and the synchronization between the gpus and it's part of of pytorch It has specific flags to actually inform pytorch how many how many gpus there are how many ranks there are So how many how many containers how many nodes? And that is specified through n nodes here. So in our case, we have four of course we have four processes per node or four gpus But we specified that the rendezvous backend is etcd the rendezvous backend is used for synchronization By pytorch between the different ranks We have the etcd ip that is actually passed as a parameter from through our go workflows And we have some extra parameters here that represent the id of the training session Which is like a unique identifier in etcd for all of the metadata of the training run And finally we have the actual fine-tuning script Which we which we borrowed from Lama recipes that has the actual training loop that For the model and the download of the data set We're focused on the infrastructure aspect if you're curious about the details there the the details of the training loop It is a great Lama recipes is a great repository for getting some of those details Okay, so back to back to training We have some impulse defined which are the input parameters to the specific job And of course we have the actual resources that we have to use so we have four gpus We need quite a bit of memory for this so 120 gigabytes and several CPUs We also mount some Some volumes that are dynamically provisioned this container actually turns into a argo workflows template inside the workflow itself So if you just use dynamically provisioned volumes all of the container copies will actually mount the same volume So you need to create volume mounts that mount dynamic volumes dynamically which are created through the workflow All right, so here's where we define the workflow through hera We have several volume claim templates that one volume claim template that is defined for every container that we create So here's where we say Give me a persisting volume claim called rank i for zero through three through three Each will have 20 gigabytes And this is useful if you want to save the the model to disk after you actually train it to save it in your own buckets or registries or something like that So these will this will Tell Kubernetes that four of these Containers four of the volumes should be provisioned and this volume mount will actually mount the The independent volumes to the right container based on the the name So finally we get to the part where we actually define our dependencies. So we have a workflow We define a rendezvous ID, which is the training session ID that goes to etcd We define an argo workflows DAG called fine tune We first the first step we do is create the ssd storage class That that that is required for for etcd In parallel, we have this list here. We create the etcd stateful set And the load balancer in in parallel Then we have the the independent container that I showed you to wait for that specific load balancer ip to become available And then it has an argument called etcd service name So that specific job requires an input for the service name that it should wait And monitor for for that specific ip to become available And finally we have in parallel In here in a loop we're defining we're calling fine tune Four times of course or up to the number of nodes and we pass in the rendezvous ID The node rank the node volume that should be mounted and finally the etcd ip that it that each container should connect to And as jp mentioned, we have to somehow actually delete all of the resources That that are spun up dynamically for for the training session So we Called the delete etcd resources in the end as an exit As an exit DAG in our go workflows, which will execute irrespective of the result of the of the workflow And then finally we create the training workflow now. I'm not going to run it at the moment because these are First of all expensive resources second take a lot of time to actually spin up And it would be dreadful to wait for them to actually come up So we'll go back to the talk None of this is actually it would have been possible without a lot of community involvement So we're very thankful for contributions to hera argo workflows pie torch the open source single lama 2 Which are all amazing resources that you should definitely check out We'd love to hear your feedback. Um, and here's a qr code for the repository As well. Thank you for the crew for welcoming us in our talk and thank you everyone for your attention I think we have about three and a half minutes if anyone wants to ask some questions Thanks everybody. I I see you you want to answer ask a question. Yep I'm going to hand you my mic like in the center How many different uh instances have you tried running with this because I imagine now it's easy to spin up many different models When you say instance do you mean like uh different llms like lama versus other models? Exactly Yeah Yeah, we've only focused on training lama 2 The train the the training loop that you find in the lab recipes We'll provide a lot of examples of how you can actually wrap other llms through through a similar infrastructure And and pie torch and all of this so you should be able to train From scratch if your business is interested in that or fine tune on your own data using this type of model as well And that's regardless of the model or their specific limitations on which models can be used This type of infrastructure is useful for large models If you don't have models that benefit from sharding independent chunks of a model to a gpu then don't use this If the model is small don't use this. Uh, it's going to be very costly for the actual requirements of the model Looks like we got another one. Yep Imagine that your gpus are mostly reliable and you have now a very large expensive job and your gpu number three Just failed that you don't want to lose the inputs that you had on the other three gpus So what's the story here? How do I rerun exactly that chart in that chart only? There's actually a retry flag on the On the on torch run itself and I believe it keeps track Of the failures within a specific rank or one of these specific containers and it will try retry that specific job Or continuing fine-tuning from that specific point And there's also check pointing that you can use and I believe it uses that as well which Saves the models to the disk and then it really is it really loves them upon retrying All right 50 seconds any other questions Going once twice thrice again. Thank you guys so much for coming out and checking out the talk. We really appreciate y'all