 Welcome to my talk. My name is Daniel Odevia. I'm an engineer here at Minalio. And I'm going to do a talk on writing machine learning pipelines against object storage. So I'm going to start with the premise of the problem that I'm trying to address with this talk. So nowadays, machine learning is becoming a very relevant area. And it's been growing. And it's building on top of the history that big data has been bringing to the table. So what's happening right now is you have these machine learning engineers working on their pipelines. And traditionally, they're working on their own personal computer, which may have some CPUs, GPUs. There's a data set laying around on that machine. And they're so used to having in their local storage, they're bringing their tools and building all these machine learning algorithms. So they're perfecting it and saying, OK, I have a model. This will train. This will work. That's it. Now, the problem here is that after they have actually built this pipeline, what they will do is they will grab all this stuff and throw it over the wall and throw it over the wall to who? To the DevOps guy. And this guy is in charge of actually taking care of grabbing these files. Sometimes he'll just get a bunch of files, some source code. And he needs to figure out how to containerize or putting a machine running on top of the infrastructure. And that has been traditionally the model, right? Data scientists, machine learning engineers have been building the pipeline. But they don't take account of how this algorithm is actually going to be deployed at the end, how it's actually going to be scaled. And that is part of the problem. Now, when we start coming into more modern machine learning pipelines, the problems start becoming a little bit heavier, right? What I mean by that? So once you have built all these machine learning models, my slides are completely out of sync. So now, if you're building for a larger machine learning model, let's say the two trendiest models right now are both diffusion models or large language models. For example, if you were trying to train your own version of Lama, right? They publish the paper. The data set is public. And that data set is 4.71 terabytes in size. That's not a data set that you can even fit in a laptop, right? So even if the engineer was trying to work with this data set, he will have a hard time. He will need a bigger computer to fit that. And moreover, when you're training on top of this, you need expensive type of notes that have either a lot of GPUs or TPUs or HPUs to train on top of these algorithms. And that is the problem. So as these algorithms are scaling in size, right? And it's not only a problem of the data set that can no longer fit into the engineer's machine. Also, this data set might be also too big to actually be loading onto every single note that you'll be using for training. So that is sort of the problem that I'm trying to trace with this conversation, which is, OK. Now, if you are designing the pipeline and you're entirely basing it against object storage, the problem will start solving on it of itself, coming from inception, from the engineer's side, into the production side. So what's the main difference that I'm trying to assess initially? OK, so this training infrastructure, whether it's a bunch of GPUs or you're using TensorFlow processing units or Havana processing units, it's actually very expensive. However, storage infrastructure usually tends to go on the cheaper side, right? Companies build a storage infrastructure to match their use case. You may be building for very large data sets because you are doing analytics on large, big data. So you may actually go for solid state drives, hard disk drives, those are cheaper and deeper, right? Perhaps you're training on the cutting edge and you need NVMe because you're training at very high speed. You can build your storage infrastructure for that. But then you can completely separate it, right? And at Minaya, we've been making the case since the Hadoop days, since the big data days, whereas the coupling computing storage actually allows you to go into different directions with how you want to treat your algorithms, right? In this case, you no longer need to build a very expensive infrastructure. You don't need to make sure that every single node can fit these very large data sets, right? If you were doing self-driving car training, multi-petabyte data sets won't fit into every single node. This is no longer a possibility that if the scientists assume that the data set will be present locally, it will actually work out. So now moving on to the solution. So, oh well, comparing these two, right? So the computing infrastructure tends to be pricier than the storage infrastructure. So let's actually leverage the storage infrastructure to reduce the cost of training. So if we look at the state of a basic machine learning pipeline, almost every single machine learning pipeline is built out of a stage of extracting data, right? Then someone extracts the data. This may be coming out of production logs, right? Millions and millions of production logs. And after you extract the data you care, you preprocess it, right? Not everything in the log is relevant to the algorithm. So now you need to take that, put it into a file that makes sense to the machine learning pipeline that you're training, and then start running it to a training loop, right? So the training loop will be crunching through the data. It will go into evaluation, be if it's at this stage the training is done now. Okay, I'll go back and keep training. At some point when you're done, and the evaluation says this is a great model, you want to deploy it to production, and you then want to go and acquire more data. So this is also what is known as machine learning operations, MLOs, right? So if you take this algorithm and you start standardizing on this model, you can actually make every stage an independent component that can be optimized and a version around. So now why it's important to build this sort of, or follow this pattern when you're developing a machine learning algorithm? So pretty much let's say you have a version one model and you build state-of-the-art model that can recognize zeros and ones. So it's like great, you have your engineers build it, deploy it, and it's working amazing. It's however, every now and then someone comes up and they invent a new digit. Number two gets invented. Your model doesn't recognize it, it's just an ugly zero, right? So now if you're keeping acquiring data and you start noticing, okay, there's new type of data that need to be, to detect the machine learning algorithm. If the pipeline has been established, just rerunning it through the pipeline with the new data, whether it's labeled or labeled, we'll be able to actually have the model start recognizing this properly, right? So that's why it's so important to have a proper machine learning pipeline philosophy. And then how does that go into the machine learning into the open source paradigm and for the data scientists, means that they can start building these modules that says, okay, this is a module that will only track data, right? I only did this once and I have two types of building, extracting data, multiple versions. I'm improving this over time. So going into the stack, how you can actually build this, for example, for a data scientist, what he will carry is that he can work this on his laptop, right? So usually data scientist, machine learning engineers like working everything locally. Here an example is, I'm using PyTorch to build an algorithm, right? So they actually like to say, okay, I'll be extracting my data through pandas, training it out of some sample. And then I'll be preprocessing with PyTorch itself. I'll be training on top of PyTorch and such everything, right? So they're acquiring data, it does not relevant for me as a data scientist. I only care about how I'm designing my algorithm. Now, here you know this mix. I'm using one core technology, Jupiter, right? So, but I'm using it as notebook just to model a single stage of the pipeline, right? But the fact that I'm, how I can glue all of these together is I can bring another open source project called Ellyta, right? Ellyta is actually great for actually taking each of these individual notebooks and gluing them into a single pipeline. Now, the advantage of doing it this way is that now the DevOps guy can grab this algorithm and just hit run on top of a distributed training infrastructure such as Airflow or Qflow. And now we can actually transform all of these individual notebooks into separate containers and have them do their job, right? And now we just made it so that as long as the data scientist or the machine learning engineer is designing and working each stage of the pipeline as a separate component that can be taken out and put in, it makes it easy to run. So now this could keep evolving, right? So I just said, well, we just run it on Airflow. Of course, I can also take it and run it into Qflow. But then I start replacing the components that matter with the other components that will actually go to the scale that I need. Let's say if I'm running with production logs, maybe I need to query a multi-petabyte dataset with Trino, extract that dataset and put it in some other location. Then bring a more heavy lifting processing framework like Spark to actually crunch the data into the format that I care about. And then I keep the machine learning algorithm running on PyTorch, that's the original one. I do the evaluation on PyTorch. When it's ready, I put it somewhere and then I have KSERF run, right? And then I build my pipeline so that Kafka keeps capturing more data. So the algorithm is always ready and up to speed. So, but if you see here, I have all these arrows, we're missing the glue, right? So the glue that we're proposing here to this solution is actually the object storage, the Minio layer. Because if you see where you keep the data, you could just keep it on object storage, right? And then as you're extracting from multiple petabytes of logs, you're extracting the data that you're looking at through your ETL, you need to put it back, you can put it back in object storage. You don't need to actually have it on a long running process or any long running node, right? Next stage couldn't just go and pick up from where the previous stage left, right? And then just take the data that is sitting there, doesn't matter the size anymore, right? Object storage is meant for scale, right? And we just take the data, we go to the next stage and we continue. And this is the matter point of building the machine learning pipeline with object storage in the mine that if the engineer is no longer having to worry about, okay, I'm just gonna build this as a single notebook that does everything and someone else will figure out how to productize this, we're just wasting time. But if the machine learning engineer from start, he says, okay, I'll take an object storage solution that I can run locally and start running it and working against it, right? And it's even better because I'm just getting used to, like maybe sometimes the dataset won't fit in my laptop and I'll be working with a dataset that's actually on company servers, right? On some other cloud provider and I'll just be running against that and working with it. It's just gonna make it so much more productive. So if I were to summarize what I just told you but in a vertical way, right? So let's say you were working with Lama again, right? Very trendy language models right now, extremely trendy. And that's a 4.71 terabyte dataset size, 800 billion documents. So if you wanted to train on top of this, you can just put that on object storage, right? And then go on, work with it, pull the images as you need, convert it into a race and then once you have reprocess all those documents, these are web scripts, you put them on object storage. Whichever technology you use to build that doesn't need to actually care that the next stage will come online and it just needs to agree on the format. Then it just goes through the formats, right? You could use a binary format like TensorFlow has this TF record format for storing in binary but you might as well just see non-py records or anything that fits the algorithm. The next algorithm can come and just start grabbing those, training, right? And then this, that gets something very interesting, like if you were training one of these trendy large language models, it's also something you cannot train on a single node, right? Famously, Facebook said, or Meta said, when they published this model, they said, well, we train this on 2000 GPUs and it took us a week. And it's like, okay, how many GPUs can I decently fit in a single box, eight? That means they actually have plenty of nodes. So this opens up the gate for distributed training, right? Distribute training becomes easy if the whole algorithm was built on top of object storage because now I can just keep running nodes and then as I'm distributing the training load, I say each node will be taken a different part of the dataset, a different batch and they'll start training and training and training, returning the results to object storage, averaging out and continuing on the next epoch. It makes it trivial to actually train large-scale machine learning models. So yeah, now I'll do a small demonstration, right? That's something I want to include so that it doesn't look like I'm, it's just like a nice proposal. It's something that you can actually perform right now. So I like to start with my Jupyter Lab, right? Jupyter Lab is a nice tool just for running and organizing a variety of different types of notebooks, right? Everyday scientist is pretty used to this technology. Now what I'm bringing in is the Elyder notebook, right? So Elyder notebook is a nice way to actually graph and design pipelines. You see here I have a pipeline that says, well, I have a stage that extracts, preprocess, trains, evaluates, and then deploys. So let's look at my, I don't know, which one is the more interesting one, training is the more interesting one. If I open any one of these stages, I can immediately see what's going on, right? So I'm designing my pipeline to assume that things will be given to me through environment. But of course I'm defaulting to some values and as a data scientist, I'm just making a mess here and running testing things here. They're working, yes, this works. Then I'll do this, they'll do that. I design my neural architecture. It's actually in a separate file. So I have my neural architecture here. It's just like a very simple neural network with four layers, right? This is what I'm using. But this is all things that as a data scientist, I find natural to do. I could even have the neural architecture in the notebook, nothing's making me, like making it in a Python file. But what's interesting here is at the end of the day, even if I say, okay, I'll take, build all these, this stage output, is it gonna stay on my local or is it just gonna be placed on all two object storage? And then the next stage, its starting point will be to make the assumption that the pipeline will notify me where should I be grabbing my model or my input and continuing from there. So the ELITE as an open source product makes this trivial, right? It makes it very nice. And now you can actually go to your data scientist and make a strong case. I want you to build the machine learning algorithms following this design or this pattern, right? So that now once you are over productizing it, it's actually easier and simpler. Now, something really nice about ELITE itself on its own is that if you want to connect it to a distributed training platform, such as Airflow or Qflow, right? So two of the most popular rainouts for distributed training. It's as trivial as saying, okay, if this is a pipeline of type Qflow, I'll just hit run, select a runtime. Right here, I have a Qflow runtime. And that's it. Now, even though I'm working on local host, right? I may have a Qflow running on local, who knows? Or I have the company Qflow, let's say, online. And I just go into that environment and start seeing how the pipeline is actually performing. So I've turned this into a tool that I can use to work locally, right? Into actually getting the machine learning engineer involved into actually building and productizing its own pipeline that he's building so that he can see, okay, I'm building and I'm working against a smaller brick storage, whether it's on my laptop or it's on the company infrastructure. And I'll use this one and learn. So what things are working for me that if I were to run this in the final product are not actually working for anyone. I can go and explore and see the extraction process, the training, the logs. These are some of the nice things that Qflow brings to the table. But of course, this is where the DevOps guy will actually come into place. He will actually care about how do I run this? How do I run this continuously? And of course, if this pipeline is already built into, I can also run experiments on top of it. So, but I don't need to worry about, okay, this guy said that he needs a whole dataset for the extraction. I don't know if he needs the same dataset on the next stage and on the next stage and on the next stage. That is a problem that the DevOps guy doesn't need to care. He only needs to care that he's sending the proper access to the pipeline for it to run. And once that part is ready, it just keeps going, right? And for the last stage, of course, in this something very interesting from this is that if there's some output from these pipelines, even the Qflow uses object storage to store the intermediate stages of information between each different pipeline into object store. And you can even go to object store and look at the old individual runs, right? Actually, this is the wrong location. So I'm using a bucket called ML Operations, ML Ops. And I have all my runs here. Every time I run this pipeline, everything is actually kept here as an evidence of what was used to run the pipeline at that point in time was the output of every of these stages in the machine learning pipeline. So it's actually an experiment that can be reproduced, right? It's not also that the DevOps did something and suddenly the result was pretty good so no one knows what hyperparameters they used in that run, right? So that's also taken care of by actually structuring your pipelines in a way that's repeatable, right? I have an interesting story that we once trained on my previous startup, a very cool model, right? It was the state-of-the-art model for their product. And I never backed up the hyperparameters. So then after a month, I was like, I'm gonna retrain this, I will just fight up a VM with some GPUs and hit train and oh, it looks great, deploy to production. And then it goes to production and it has the worst performance ever. Worst performance. And then it was panicking, it was like, let's go back to the previous model. And I was like, I don't know where's the previous model? I was not putting it in anywhere, right? I was just taking it out of the machine and I was a machine learning engineer, right? Taking it out of the machine, putting it in the container in my laptop and putting it over there. So now we lost it. So we had to actually go into the production containers and sneak through the Docker socket and extract the model to recover that, right? So, but if I knew what were the hyperparameters that I used, I knew the data set that I used to train, but I didn't use the hyperparameters I used for the previous iteration of that model. I could have replicated that training and I didn't, right? So having a well-organized machine learning workflow is quite important. And making sure that it's auditable. People can go and see what happened when you trained that, right? So here I have like a training thingy. Actually, let's look at the, yeah, training. I'm just gonna download this file and I'll just explore what happened. And this is very nice. Oh, there was an error on this run. You can even see the notebook as the output of the notebook that we'll have run on that stage, right? So very easy for, you can go to the data scientist and be like, here, here's the notebook. This is your code, how you actually wrote it. It failed this way, right? So you go fix it and let me know when it's ready and we'll commit a new version of it. So it makes it very transparent and very scalable, right? To actually build machine learning pipelines this way. And that is kind of like the problem that I'm trying to address by proposing to actually build your machine learning pipelines in a very structured way through machine learning operations and gluing them through object storage. So, yeah. So that's kind of like the gist of it. So, yeah. So ML SOPS is a development philosophy. So this is where you bring to your company and say, okay, stop building everything into a single notebook. Start structuring into separate stages. We want to see it in separate stages. And even if the separate stages, even if people develop them as separate Python files, putting them in containers is quite simple. You can bring tools like LiDAR to actually make this simple, right? So you should always keep in mind that machine learning models may go stale. So we saw that when machine learnings, in the initial machine model that I had that didn't recognize that you too, needed to modernize. So if you have everything structured as a pipeline, you just hit rerun, right? It will go graph the latest data set that's available and go train and deploy any model if it's a passenger validation. Storage infrastructure is way cheaper than computing infrastructure. So we've seen this from the big data days. The segregating computing storage is actually key, right? Compute is burstable. You can go to any cloud provider and just create burstable compute. But storage, data cannot go away. Data is not elastic. Data just likes to grow. And the beauty about data is it grows at a predictable state. So you can actually go on investing to building like a decent storage infrastructure, a very modern data lake based on object storage, for example. And then that will actually address that, if next month there's a shinier, better way of training a new GPU, a new Terrain Processing Unit, you can just go leverage that, right? So you're not bound by the infrastructure that you decided to build last year. And building all machine learning components and as usual, companies have advantages. We saw people can just come and be like, okay, this is too slow because you're downloading and running everything to a pandas data frame. I'll just replace this with Presto, right? Or you're crunching these numbers with Python. I'll just replace that with Spark. So it's every component, every stage of the pipeline can actually be optimized and made and iterated separately as the rest of the pipeline. No one needs to go and learn and understand the whole pipeline itself. And finally, the lifecycle of machine learning models is quite important. They always need to be iterating, right? Or they will go sell. That's kind of far redundant endpoint. So that's my talk. So thank you for coming along. And if I have any questions right now from you, I would like to take them. Yes, it all, okay. Should she ask the question on the microphone? Yeah, for the recording, yeah. Do you have any data on how the behavior, the performance changes with the size of the model or the data? It all depends on the model that you're training. So like if you're doing like an eight base model on top of Spark, you may not, I mean Spark already trains in a distributed fashion. So the only advantage you'll see is if the storage is feeding data as fast as the compute needs it. But if you're training, let's say a deep learning algorithm, right? So one of the biggest bottlenecks is the PCI bus. The PCI bus and let's say the GPUs are trying to pull data from the NVMe from the local storage into the GPUs so that it then can be computed. That's usually a bottleneck, right? And then the other part is, okay, maybe you're bringing the data over like the network, right? Through an NFS or NAS storage. And that's also like a bottleneck, right? So if the storage layer is not performing, it's not performing enough, that quickly becomes a bottleneck. However, if you have like a very powerful, let's say tensor processing unit, like a TPU, and you'll make sure that your data is pre-processed and batched just right so that the batches can be loaded very quickly from object storage, right? And that's important in building on top of something that's cloud native, right? So in this case, Minayo is one of the greatest choices for that. But every cloud provider will offer it also object storage in that location. So if your algorithm is structured enough, I mean compute, you can always increase, right? So I don't have any precise numbers, but I can tell you that they actually unblock one of the biggest contention points on the computing infrastructure, all right? Any other questions? All the questions you have, you. All right, so thank you. And have a good open source summit. Ha ha ha ha.