 So we have our next session here. We have Vaibhav Srivastav. I'm sorry if I pronounce it wrong. And Vaibhav is a data scientist and a master's candidate at the University of Duttgart. And this topic is building petabyte scale ML models with Python. So great. Take it away. Perfect. Just quick check, you're able to see my screen, right? Oh, yeah. Everything works fine. Awesome. Perfect. All right. Well, good morning to everyone. And today I'm going to be talking about building petabyte scale machine learning models with Python. And before we get started, let's set some context over here, right? When it comes to machine learning or when it comes to building any deep learning ML model, you want two things. You want to be able to train your model faster, and you want to be able to reach a higher accuracy or whatever measure that you're tracking as soon as possible, right? Because if an experiment is going to fail, you want it to fail faster, right? You don't want to wait for, say, two, three days, and then get to know that this particular experiment failed. And these two charts over here sort of set the motivation quite all right. So on the left, you see we ran an experiment across multiple number of workers, and we just benchmarked how fast was the model able to sort of train. And you can see quite clearly over here that at around eight number of workers, we got about 5x speedup. So what that means is that we were able to train the model five times faster than it would have been on just one device, right? So we were able to get results five times faster, essentially. What you see on the right is that as we increase the number of workers, we were able to reach 80% accuracy for our particular benchmark faster, right? So for like one, this was about 50 units of time and whereas for eight, this was roughly about 15 or 16 and so on. So this effectively sort of helps us understand that we do not really want to run our experiments on just one machine. And if we have capability to run them on multiple machines, then we should. And here's a quick walkthrough of what I'll be talking about. So we first talk about what exactly distributed machine learning is. We talk about out of code machine learning. Then we look at a couple of ways through which we can build scalable workflows. And then we briefly touch base upon which one should we use and when? And then we head over to Q&A's. Feel free to put your questions in the chat and I can answer them later as well. Yeah. So before we get into how do we actually build a distributed machine learning pipeline, let's talk about what exactly happens in a typical machine learning model, right? So for any machine learning exercise, you have some sort of data and you have a task, right? So this task can be, for the lack of better example, can be say stock price prediction, right? So you have certain information or certain data regarding stock price of say S&P 500, right? And then you have a task which could be, you know, to predict the price of it, say tomorrow, day after, whatever it may be. So that's your task, which is regression, right? And then you have some sort of a model. So this model could be a neural network, it could be a linear model and so on. And then you have a notion of quality, which is effectively the loss, right? Which is how well is the model learning, whatever it is that you're throwing at it. So that's your loss function. And then you have the, you basically have the optimizer, which optimizes this loss and, you know, essentially tells the, or not just the model in a different direction when updating the weights, right? And this optimization is done using, typically in a neural network setup is done using stochastic gradient descent. And I mean, you can have different optimizers and so on. But yeah, that's the gist of how a machine learning exercise looks like. Typically, this would all work synchronously. So you will first have data, you will process it, then you will have the model and then you like pass your data to that model, you calculate the loss and then you do this multiple times through epochs till the time your model is effectively trained, right? And this is like synchronous. So one step happens after the other. And while this is all good and chill, why do we still need distributed machine learning? Because you can still just train it one by one, right? You can still train it on one device. So why do we need distributed machine learning? And that's where some of the most recent models, if you look at, so if you look at in the realm of national language processing, we have GPT J level models, which are billion parameter models. So billion parameter models are kind of like the norm now, right? So you need to have one massively huge compute to one be able to just load that model, let alone find unit or train it from scratch and so on. So you need to be able to load that model in the first place, which requires a lot of compute. And second of all, you might want to use a lot of data to be able to train your model. So in a typical, say speech recognition exercise, you might throw in roughly about, 5,000 to 10,000 hours of audio data at your model for it to be able to learn the patterns within the speech. And you won't really be able to do that with just one device, right? So you would want to be able to scale your training across multiple devices. And this is what we're gonna be talking about in a bit. And second of all, we effectively want efficient computation for our algorithms, right? So if there's something that can be parallelized and we want it to be parallelized, that's the key over here. And we don't want to be running into out-of-memory errors and errors pertaining to us not being able to us not being able to get to results faster and so on. And this is where out-of-core machine learning comes in. So out-of-core ML is basically a way to exploit external storage when you have large data and you cannot load all of your data into your GPU or your RAM. So you effectively take batches of data and then you throw it to the model and the model learns on one batch and then you throw in another batch and then you throw in another batch and model effectively learns on small, small snippets of data or like partitions of data. And then you add up all of those gradients or whatever that loss is and then you optimize on that and so on. And this is something that we typically use a lot when it comes to building large scale models or just want to ensure that we do not get some sort of out-of-memory error or an error when we're effectively training the module. So how does this look like if we were to talk about in the same way as we did at the start of the presentation? So again, now we have some sort of a data set but now we can partition that data set into two partitions. Let's take two partitions. So you have say I have 100 rows then like data set partition one would have one like zero up until the 50th or like the 49th row whereas the data set partition two would have from 49 or from 50 all the way to 99, right? And so we have these two partitions, right? We still have the same task. We still have the same model, right? And the only thing that changes is the loss function. The loss function now only works on the first partition of data whereas in the second case it only works on the second partition of the data and then the optimizer effectively just adds, adds both the losses and then optimizes further, right? And we just looked into the use case of having two partitions but we can have multiple partitions. So we can have four partitions, six partitions, 10 partitions and so on. And each of these partitions can effectively work parallely, right? So I can have one of this partition being trained separately on one GPU and I can have this partition being trained on another GPU parallely at the same time and then at the optimization scale time I can just add all of these up, right? So in this way we effectively cut the synchrony of us waiting and we parallelize the entire operation across these GPUs to be able to then train the model faster. So this way, like just in this case roughly we get roughly about 1.75 to 1.65 x speedup because the only blocking process here is the optimization process, right? Everything else can be easily parallelized. And this is effectively something which all deep learning or machine learning libraries now use. So this is called a mirroring strategy and you effectively mirror your model along with your model and you just pass it different data partition. So how does this happen in real life? So one is by using partial fit and we'll cover that in a bit. And the second is that you sort of use some sort of a deep learning framework like TensorFlow, PyTorch, MXNet, Jax, whatever the new frameworks keep on coming in and you use their distributed training APIs which will effectively do this for you, right? So you create a model and then you can tell it that these are the device IDs or like these are the GPU IDs that I want my model to be pushed on and trained on. So as long as the GPU itself can load your model and can process the minimum batch size that you've given to it or the batch size that you've given to it, the model training would happen in a distributed fashion. What partial fit does, this is typically useful for like low resource. Like when you have low compute, you don't have say five, six GPUs on your hand and you just have your training statistical ML model which says cycle tone or something like that. And you have say seven, eight gigabytes of data and your RAM just cannot fit all the data in one go. In that case, you use partial fit wherein you just give it some batches of data in one by one synchronously and then the model effectively trains on those smaller batches and keeps on updating the weights and so on. So both of these methods are equally used in practice. If you're more deep learning focus, then you would want to be able to use say PyTorch or say TensorFlow or JAX nowadays. But if you're building statistical models and you would want to just use something like task along with cycle tone to just do partial fit for large data use cases. So now we get to the point of how do we actually build these scalable machine learning workflows. So there's the first scenario wherein like you have some sort of legacy code base wherein you have a machine learning workload and then cycle tone stats models or something like that. And the second one would be wherein you're building a new experiment from scratch or you're creating a new experiment from scratch. In both of those cases, in the first case with like the cycle tone one the more acceptable approach right now would be to use something like task to create data batches. And then you use incremental pipelines provided by task to train your model on these smaller batches for more deep learning base models or for like new models, you use something like tf.data or it's equivalent in whichever framework you're using to create data batches and you use something like torch.nn.data parallel or tf.distribute strategy to load your model on separate GPUs and then train your model and so on. So before I get to these links I put in some co-lab links in the slides which you could quite easily just run and see how distributed training works with task and with TensorFlow. I wouldn't be going through those right now in the interest of time, but I do want to quickly show you how easy it is to do distributed model training with PyTorch. So this is the DDP page which is distributed data path parallel page which is a module within torch.nn. And it's relatively simple, right? So you first just set device. So set device would be just one GPU or could be multiple GPUs. It could be say three, four, how seven number of GPUs you have in your cluster. You just pass all the IDs for those and you can just set them up in the environment variable of CUDA visible devices and so on. And then you effectively just create the distributed group and then you just pass your, like you pass your model as a distributed data parallel model and you provide all the device IDs that you want this model to be trained on and you provide an output device. So this can be any of the GPUs or can be CPU as well and then you effectively train the module, right? So it's literally three lines of code that you have to add. If you have a legacy by torch code, you can use distributed data parallel to be able to parallelize your model training or distribute your model training across multiple GPUs and so on. Similarly, if you have TensorFlow code, then again, you can use something like tf.distribute strategy. So once, yeah. There are multiple strategies that you can use with TensorFlow depending upon where you wrote your code. So if you wrote your code using Keras, then you can use something like mirrored strategy or TPU strategy. There are different strategies for different use cases. Mirrored strategy is the one that we spoke about wherein you load the model along with the partition in one of the GPUs and then once the training is done on that particular or like once the model's gone through that particular batch and has a loss of value, then it aggregates all the losses across all the GPUs and then updates the model weights. There is, if you're using with TPUs, then there is another TPU strategy for it and so on. So it's, and this is as easy as literally just using tf.distribute strategy. And calling the mirrored strategy and passing your GPU IDs as you see over here and that's pretty much it. Once you do that, it effectively automatically takes the model from the code that you've written and mirrors it across the devices that you've passed to it. And you can see that quite clearly in the collabs that have put into this slides, these slides are already on the schedule. So you can just go up to the talk and click on download the slides and then you should be able to access these collabs as well. So now comes the million dollar question. We spoke about desk, we spoke about fire torch, we spoke about TensorFlow, which one should you use and when to this? I put forth the most cited quote, which is there are no solutions, there are only trade-offs. So typically you would, if you're doing something from scratch and if it's deep learning related, it's advised or it's a good call to use something like PyTorch or TensorFlow for your experiments because they provide very nice APIs to be able to just distribute your model across devices. If you're building a statistical model or just building classical machine learning models since I could learn, then it's a very good idea to parallelize it with desk and avoid out of memory errors and so on. So that's what my recommendation would be, but then again, it depends a lot on what your exact use cases are, whether you have access to GPUs, whether you have access to only CPUs, whether you're constrained by RAM and so on. But typically these two solutions or like these three solutions that we spoke about if you consider TensorFlow and PyTorch as separate solutions is something that would help you out. So do look at all three options depending upon what your use case is. With that, thank you. Thank you so much.