 Hi, my name is Daniel Valdivia, and I'm an engineer here at Minayo, and today I'm going to talk to you about writing machine learning pipelines against object storage and the reason why you should be thinking about doing it that way, right? So I'm going to be introducing you first to the part of the problem about how Traditionally organizations are writing machine learning models and using them in production and then we'll see what are the cons about that model, right? So let's get started So this is the problem that Is seen around in the industry, right? So traditionally you have a machine learning engineer and these guys will have some really fancy machines, right? They they have some nice CPU some local search and sometimes even GPUs to speed up training depending whether they're doing traditional machine learning or perhaps deep learning And then they will sometimes they cannot feel the whole data set right until the machine So they take a subset of that data, right? So they can train because of course you you are experimenting your testing stuff You want to make sure that you got it right and then they pick the framework of their choice, right? Sometimes there they want to go with TensorFlow, MXNet, PyTorch, the most popular ones They're working perhaps on on a Jupyter notebook on a Zeppelin notebook or perhaps they're just building the Final code on top of it and this is so far this so far. It's great Because it's actually Encourages the directive building of the application and this is where it still look resembles a lot of traditional application development Now the problem with this is that usually when these engineers are done building the application The machine learning model being treated as an application They they throw it over the wall, right and who catches this? That's the DevOps and then the DevOps has the mission to take this application and go figure it out Okay, what data do I need to bring to train this? What what's the file how I'm gonna grow on it? I'm gonna be doing this on top of virtual machines on top of containers Because the infrastructure, right? I'm gonna be picking GPUs CPUs, right? So these students separate to its entirely separate worlds and then it completely Isolates and this this interest a variety of problems because for example, let's look at it this way Let's say you're training a very large problem like image in it So image net is a image recognition project, right? So very very firmness But if you want to train this yourself or build something on top of it, the data set is 1.31 terabytes in size, right? So this is fairly large even for the average engineer machine, right? So and it has 14 million images. So let's say a machine learning engineer that wants to work with this data set You know probably the data won't fit on his computer and now when when it comes to production Let's say the union manages to feed this data setting this machine and wants to train train like this in production, right to serve the model Right this these notes notes with did have you know fancy GPUs. Sometimes they have tensor processing units or also known as TPUs This takes too much time to download imagine You're just loading the 1.31 terabytes of capacity of data into the algorithm And then the first part of the pipeline broke and then you have to do it all over again, right? So this is one of the problems that traditional machine learning development and deployment strategies are Gonna face especially working with large data sets And it doesn't have to be this way if from day one you you have a healthy development cycle You can actually go around many of these problems So let's let's zoom in and look into the stages of building a machine learning pipeline, right? So On why this is quite important, right? So I was telling you that you know if you're training your Machine learning algorithms doesn't matter if you're doing some deep learning like in my examples I'm putting some tensorflow logos. You can imagine this as well with any other framework, right? We're so usually compute tends to be quite pricier, right? So in my case if I'm getting some VMs with GPUs or CPUs on the cloud provider Or just procuring that harbor for my data center these machines that are definitely gonna be way more pricey, right? So having them block on doing basic operations like loading data in and out It's just too expensive, right? And this gets even worse if you're using an even nicer machine on the cloud like the ones that use TPUs those machines Are going to be way pricier but comparing these type of infrastructure to storage infrastructure, right? In total infrastructure can be more plentiful and it's actually cheaper to operate and host so This is where the advantage of actually designing your machine learning pipelines against object storage comes into place, right? You can design everything where you have a very large data lake of data sets, right? You're at any given time when you need to train Let's say you keep your data on premise and then you want to train on the cloud when you need only when you need to train It's because then you get access to burstable compute, right? Or maybe some other Situation on your organization. Maybe the compute is constrained, right? You have to time-share it or something like that, right? So then you want to make sure that when it's your time to crunch you you don't spend time Actually loading data. You just want to go ahead and crunch the data, right? Perhaps you're showing this infrastructure for some other purposes, right? So that's it's very important that you have a very clean separation of the infrastructure And I mean the storage infrastructure is great if it's running object storage because it can be built very large-scale at a very low cost so Summing into machine learning pipelines usually how they're built is you know an engineer goes and he extracts some data, right? He extracts it out of somewhere and then He cleans it up He then pre-processed the data for whatever format the algorithm that he plans to use will will take in, right? So perhaps he's training some in-match recognition. He'll he'll grab this and put it translated to some zero some once And then he'll crunch it to some algorithm, right? So that's very important now will train. This is that the part that generates the data, right? So usually pre-processing and training need access to all the data, right to do their parts And then when you're evaluating the model, perhaps you only need a subset of data that you didn't use to train on So you can test that they about the validity of the model and finally, let's say you are serving this model You want to deploy it, right? And then what what you're deploying is just a resulting model that might be way smaller And usually people take this one They bundle them on a container or virtual machine deploy to production and use it But then if you already like have this very nice pipeline It's very important that you keep it continuously running so that you can continuously be training new models in a very easy way If you have more data, you just want to crank a lever and get a new model train So more importantly once the data once the model is starting serving You want to start acquiring more data and then be able to say, okay I want to acquire more data and perhaps re-evaluate my model and see if it's working or not And if it's not working for the newer data, let me trigger a retrain So some of the scenarios where this could happen imagine that let's say you train a model To do some character recognition and let's say beginning of time people only did zero some once So you train the best model out there to do zero recognition People will do it like this balls and they will get them a zero They will do us a line and they will get a one But then in the future, let's say some new trained the users They they decided to start using a new fashionable digit. They call To and your model doesn't really recognize it because it was not originally trained on it Right, so you will do some faulty recognition of picking this as a zero So this is where it's very important that you build your machine learning pipelines in a in a mechanism That's continuously easy to just Again press a button and have them rerun the whole pipeline to train a new model against the newer either label data or unsupervised data that you're using for your algorithms and Train it evaluate it and deploy it all in a simple fashion, right? now if you see All of these boxes to some extent they're becoming components So and this is very important to look at to look at each stage of a pipeline as a reusable component Because now if this is where you know, maybe extracting the data will act the same for you every single time But per se preprocessing data can be improved over time Let's say now we have some better techniques to preprocess the data We made some decisions on how to do that Let's say we had a first version of my training algorithm and then we we build a new version a third version, right? so Looking at this as as reusable components is quite important so that you can see the evolution of the pipeline and then Not a whole it doesn't need that mean that the necessarily that the whole pipeline needs to be changed every time set Something that needs to be changed about it You can just simply change the component that you need to update right or improve the component Moreover, if your pipeline is already looking like this, right? You can imagine. Yeah, perhaps I did this Initially with TensorFlow Let's say and I use Python on TensorFlow go to extract preprocess train evaluate and then serve my model Right, and then let's say containerize this and deploy on a Kubernetes pipeline So this could be my version one of my pipeline and it could be a very decent model right and this is actually the the approach that we're going to be exploring through the demo, but You don't necessarily need to stick to this right so you can start with this And if you're structuring your pipelines so that they are continuously They're designed to continuously be run and but then they're not actually running on the data beam They're not relying on the data being present locally at all time because that that's kind of like the this the The learning of the story Right, so it's you don't want you don't want the extract to be like downloading all the data Right from the source and then putting in the file and then having to have that file Copied open from hard disk drive to the next container on the hard disk drive to preprocess it and then having another file And then copy it you want to do something that glues these together, right and moreover perhaps this is not the right tools at Over time you want to move into more modern solutions Perhaps you say well my extraction now I'll do with Reno my preprocessing I'll do with spark training tensorflow Evaluating the tensorflow and then I'm gonna use case surf to serve this So when you start looking at your machine learning pipelines as modular components That you can actually start improving over time it unblocks a world of functionality over time and What the point I was also trying to make is how to glue this right? So the the right way is where me objects touch can be thrown in the mix, right so that you know Data extraction is reading out of your data lake That's your objects your object storage and then putting it the extraction results into object storage again Then the preprocessor Stage of the pipeline comes in it reads that file, right that are straight from obvious storage Crankset puts it back into object storage then comes tensorflow trains it reads straight from object storage, right? It doesn't need to load the data entirely if if for example in tensorflow We have the the concept of estimator. So as training is happening on top of the batches tensorflow will load the data that it needs straight from the Object store right same goes for evaluation and then the resulting model can also be pushed to object storage and we can then have a Single container for serving that's just reading a different model, right? And then that's how we do a quickly deployment and then let's say for ingesting more data You you can bring any application of your choice again throwing more data into into the object store So that if you reach a certain threshold, let's say you you have someone supervised Anomaly detection algorithm that you're capturing. I don't know a lot of HP access logs to Kafka And at some point you say well every time I get every time every after every single day or every X amount of hours I want my algorithm to retrain automatically, right? So then that's where after Kafka has thrown enough data into object store Object store come either notify you be a lambda notification or you can just pull a lever and then start the whole thing together, right? This is the the importance of actually designing your machine learning data pipelines against your object store so that you can Build these highly reusable pipelines that can involve over time and become more sophisticated, right and deliver higher value and again Make everything more useful, right now most most of the Concerns here is like, okay, how is this gonna look right? So how am I gonna develop these sort of pipelines? So For example, here if I was the developing our own Tens of flow, right? So let's say we will train that on top of image net for example you will Structure your pipeline so that at every part of the pipeline you are reading reading, right? So and then passing on to the next stage the next stage will also read straight from object store and then so far And so I can support right so that's that's pretty much how you want to go about designing in this case the pipeline that we're Gonna be walking through the demo. It's gonna we're gonna train something smaller. We're gonna train sentiment analysis That's the the hello world of machine learning. I will say But we're gonna be training that and then what you want is to be able to Do one one thing at a time, right? So One stage will pull the data Pre-process it put it back in object store next stage and then next stage next stage, right? So and this is something that most modern machine learning frameworks can support to do they like all modern Popular machine learning framework support object store. So you can do this in a very easy way That's not disruptive to your workflow and even better, right? So maybe your engineer will be concerned Okay, maybe I don't want to be uploading and downloading all the data Every time I'm making a test that they will still probably push back into like I want to have the data locally but Some really nice feature that you can use like if you should you choose to use me now for your object store is You can run me now your locally, right? So you can also use it for development and you can trust that the same API will be available when you run me now production and your pipeline on production So that that that way your engine your machine learning engineers or data scientists They can actually iterate and build their algorithms entirely around object store, right? They don't have to wait for they don't have to throw that problem for someone else to scale it. They're they're building these nice Let's put it this nice culture about around how they should build machine learning right because perhaps to some extent at some point They'll be like someone's gonna already take care like there's already some stages for pre-processing data Which I don't need to care about like I'm gonna start after those guys preprocess the data And then I'll do something different so perhaps a new model some new prediction So it's it's a philosophy, right? So this is what is also known as machine learning operations and it's the the machine learning operations It's the philosophy that as you are designed you're designing everything around components Reusable components so that people can mix and match and then improve independent parts of the pipeline as it goes, right? so With that set, let's jump into a demonstration of this Let's see how you could actually achieve this using tensorflow for it for the time being you don't have to do everything Around tensorflow like I'm suggesting if you were following along and doing this on a spark pipeline Let's say you could pretty much achieve the same result But the point to take home is pretty much the simplicity and then the next thing We're gonna do after like reviewing some simple design, right? So it's okay. How do I make these into? simple How do I make this into a pipeline and then run it in some in some orchestrator? So traditionally and I showed you in the part where you could just run this on top of Kubernetes Well, perhaps you want to pick an orchestrator like airflow or q-flow, right? So something that makes it super trivial so that you can upload your pipelines and have them rerun and rerun and Revaluate what is this part of and it's even easier for debugging because now when you're when you're looking at your pipelines they may be They may be fading out a particular point and that's something we'll see but okay. Let's jump into it, right? So for this part I'm gonna jump into my IDE and to my notebook, right? So here I have actually a like super nice IDE and What I'm gonna be you don't have to read through all of this notebook Well, this is usually how I go about designing my my machinery in algorithms, right? And by the way all the source code that I'm showing on this video It's available on github and we you can also find it on the blog post We have a blog post for me now particularly for tensorflow So you can just look me now Tenserflow right and then you'll find my hyperscale machine learning with me now on tensorflow blog post and you can follow along Now in this case as I'm designing my machine algorithm I want to be able to do stage by stage what I what I need, right? So perhaps I want to download my data put into object store and then take it from there, right? So so what I want to do on this In this is pretty much go along with the algorithm, right? So this is how I will pretty much go about making my algorithm. I will install importing my dependencies setting some base Parameters in this case. I'm gonna be pushing everything To my object store, right? So I'll be Pre-processing some data into some bucket using the mini SDK to actually download the data that I'll be like extracting and pre-processing Right, so that's gonna be the first part of my pipeline, which is let's pre-process some data, right? So in this case, I got the ACL and IMDb data set That's a data set of pre-labeled data that has movie reviews And some of the movie reviews are actually positive and some of them are negative, right? So I'm gonna download that data put into object store, right? And this is what it will come to expect and all of this part. We're gonna Pretty much skim through because there's not not much we can do but pretty much Like a pre-processing pipeline, right? This is where you could actually Be naive in building tariff around Python, right? So for example in this case, I'm being naive and just downloading the data crunching it and returning it, right? In real life, you probably want to build it this on top of like, I don't know spark So that if your data set is actually insanely large, you can actually distribute the preprocessing around spark workers So now I'm just gonna go ahead and actually preprocess my data shuffle it and then go around to Upload it, right? So now for the next part, I'm gonna be taking these datasets. I'm gonna be encoding them Using a layer and existing a pre-processing layer I'm gonna be leveraging Google's universal sentence encoder model, right? This is because I'm building a deep learning model So something really nice about deep learning models is that you can stack the layers, right? And leverage an existing model. So now after I have downloaded this, I'm just pretty pretty much I want to download it from TensorFlow Hub and put it into object store. Why would I do that? Why not have the pipeline downloads this model every single time? Just as I say it, you should probably be noticing like where you're running these pipelines and retraining You don't want to be re-downloading from the internet stuff that you can pretty much cache in your object store layer And then have the pipelines pull at very high speed from it, right? Because that's running on top of your own infrastructure So you don't need to be pulling things over the internet And even if you are developing things, right? As an individual engineer And let's say you're collaborating on a local lab If the data is everything is on the local object store, things are gonna just go faster So after downloading the data and putting it into object store That's what I'm doing on the first pipeline part I'm just gonna download these things and then just load the model So now it's pretty much a simple and then load the model, right? So far everything is still highly compatible with working with things locally So we'll start seeing the magic after we're done post-processing the data In this case I'm just gonna download, have some functions to serialize my samples Extract them and put them And now we're gonna take some time, right? So this is where the part where some start pre-processing Of course it's something that takes time, right? Especially if you're working with the whole dataset So what we're gonna do here is through the magic of editing We're gonna jump to whenever that's done Man, I've always wanted to do this All right, that took some time And after we're done pre-processing our data Of course, now you want to get in the business of training, right? So this is where the things really start making sense about You don't have to load the whole data Because for example, particularly here on TensorFlow I can start loading my... Right now I'm listing straight from Obiex Storage getting some list of files To configure TensorFlow properly to read the data I just need to set some embedded variables Which is something we set at the top And then all the files I'll be preparing them And this is where the source comes, right? So as I'm giving TensorFlow the data to read I'm telling it, read it straight from Obiex Storage I'm not actually telling it to read from a TMP file or anything So that as my machine learning algorithm goes on trains In the particular case of TensorFlow We'll see the estimator loading the data as it's needed This is also a very common pattern access Like, let's say you were doing Spark You wouldn't be downloading the whole dataset on every work As you're crunching data Spark will only be pulling the data that it needs to actually be crunching So have I run these stages? Yeah, I have not run these stages The nice thing about preparing your data in this set Is that, for example, this is the model that I'll be training So it's going to be a pretty simple dense layer It's a couple of dense layers And then just applying an activation function on top of that How many weights have not printed the weights? It's going to be like 135,000 weights to train So it's not a very deep model Let's say that this way So now because of my data All my data is already on a TensorFlow dataset And the TensorFlow dataset That's really strange for ObjectStore And I even preprocess my data into a format that's friendly to TensorFlow So that as the data needs to be ready to process It can just be read straight from ObjectStore And then each file has a bunch of records to process Like a whole batch, that's how I decided to preprocess my data around I'll just pass this around to TensorFlow I'm also going to tell TensorFlow Let's put my checkpoints on ObjectStore And this is very important because as these trains Let's say the training is going to take a few hours or a few days You want to be able to check the progress And perhaps you want to start TensorFlow on top of that TensorFlow is a great tool for TensorFlow users That they can see how training is going Even if it's not done You can just go checkpoints And see what's going on over there And then I'm just going to throw the log directory for that And here's the second part, model feed So I'll just say let's train this for 10 cycles 10 epochs So we start training And then again, this is going to take some time So through the magic of anything, we'll skip to the end Ready? All right, training is done So we run our data to train epochs And now after this part is complete The rest of the drill, you already know it So pretty much we will save this model straight to ObjectStore I'm going to ask how did the model did to my training So we can see that part And then I'm going to run some evaluation About how good is this model performing You get some data that I set apart from the beginning And this data is also stored on ObjectStore Therefore, these map data sets that I'm actually running straight from ObjectStore They are very fed into the TensorFlow And TensorFlow can tell me, okay, so you have an accuracy of 85% Not state of the art But I mean, I try to keep these trainings small But you get the idea, right? And I'm not trying to go too much into detail about here Because pretty much I want to take you through the normal cycle Of what machine learning engineer will do And the last part is, of course, loading straight from TensorFlow To show me the throughput or the performance of my algorithm I could, of course, do some predictions Here I have some sentences, some of them are mean Movie reviews, so of course, being able to say Yeah, this was extremely good, I loved it It's a positive and then redacting negative, right? So you see some false positives over here So you get the idea, right? So the model is trained, it works, 85% of the time So now we want to serve it So serving is also something that can go straight to object store So in this case, I'm going to review my model using a CLI tool Actually, this step was not needed for this video But I can start, let's say, a TensorFlow model server There's TensorFlow service, a technology for serving TensorFlow models There's a key value in other technologies But once I start this, I'm going to start it on a Docker container Load some Python requests and then just make requests to this guy And be like, okay, show me some predictions Right, so I love this movie And there we go So I actually sent all my same samples through a JSON interface And I got what I was expecting So you can see how I'm gluing everything to object store So that when this actually reaches production, everything is ready So now, traditionally, as a data scientist, I'll be saying, done Let's take this to production I could just throw all this Jupyter notebook into a large Python file And of course, everything is ready for prime time Just put this in a container, I will tell my DevOps, run it And then things will work But of course, we've seen what is the problem with this approach It's not a highly scalable approach And the main idea that you want, the main takeaway is Okay, we want to improve on top of this, right? Now, the best part that if I design my whole pipeline Around very well-defined stages And very on purpose, I put everything into a single large file I didn't try to, I could have organized this very nicely To separate files, you can do that I actually encouraged you to do that For the sake of this video, I actually made a very large file So that you could actually see that, for example My orchestrator of choice is Qflow for this video as well If I want to transition this to a pipeline on Qflow It's going to be dead simple, right? So I can use actually, for example, Qflow pipeline components Python functional components on Qflow And I can just take my chunks of Python code Wrapping on a stage In this case, my preprocessed data function Is going to be one of my stages, right? The other stage is going to be my training stage, right? All the code that you saw me run for training My evaluation code My deployment code, right? All of that are going to be stages of my pipeline And Qflow makes it very simple through the DSL Kind of like, the pipeline SDK has a DSL For actually designing pipelines this way And that's what I'm doing, I'm just taking my code And actually, I don't want to bother you With just me copying data from left to right That's why I prepare this pipeline But the main takeaway is that Okay, the code is already ready To read data out of a different location In this case, object store So there's no need for me to set up anything To copy data into the container, right? So I don't need to set up anything Even better, I can just hit run on this pipeline On the right side, that will beam it up to my Qflow instance Let's do that right now Actually, I think I configured this to To compile, I want to run There we go, let's run this straight Let's not waste any time Right, so let's go to my Qflow instance Which I'm running here I'm going to log into my Qflow instance And nothing's going to load There we go That's the danger of live demos Even with editing, I wouldn't tempt it So let's run to run We'll see my sentiment analysis pipeline Actually running, and you can see that All the things that we were doing And running it through my computer This Qflow is already running on a machine With GPUs, so it's definitely going Way faster than what my laptop did Right, so the preprocessing, the training The evaluation, let's pick up the evaluation Right, so this is going to throw me A file to open So let me bring an editor For that So after opening this file I can see the accuracy training Right, so because of random initialization I was fixing a very particular Random number, I got like better accuracy On my machine Actually, no, this is actually The same accuracy I got on my notebook But you get the idea Right, so what You get the idea, right, so when I was essentially running things before Let's see if this one opens, yeah, this one opens in browser So everything that I was running before On my notebook, because I Particularly designed it to Read the data straight from object store Right, I was designing and developing Everything running locally, right So I had a minio running locally Copying the data inside everything Reading straight from that place Preprocessing, putting it back Training straight from it Semi-checkpoints, doing evaluation Against it, and then finally Serving the model straight from object store It was that simple that at the end In this case, Qflow, and made it Run as a machine learning pipeline Now, every time I need to return my Pipeline, I can just rest assured that I can just Pull the lever and program Now, a new machine learning model Will be born, right My last stage actually is leveraging some of the nice Features from Qflow, for example Serving models My running on HTTPS Where's my model server? There we go Yes, nothing's going right This demo Here, right, so now I can do TensorFlow Serving natively from Qflow So my pipeline can actually take that upon itself So that when it's done If my evaluation goes right And I pass it on threshold, perhaps That's when I move to deploy the model And that's about it, right So machine learning operations It's a development philosophy That's what I've been trying to The only thing you want to take from this talk Is that Machine learning operations should be a Development philosophy You want people to be like, okay I'm going to be a better preprocessing Or I'm going to be a better Extrater stage, right But then they build it in a way That can be reused by someone else Or they can also rebuild their own Machine learning algorithms on top of their own Reusable components That's what you want your organization to move The machine learning pilots are highly reusable They can be highly auditable as well So that's one thing The other one is Machine learning models may go stale, right So that's why it's so important that you Keep retraining with newer data Whether it's newer labeled data Or newer unsupervised data Because business cases Will change Your business will evolve over time And your models need to evolve with it We saw that sample of the MNIS optical character recognition Example that I gave you Where people may come with new digits And then your model needs to retrain quickly For that as soon as you gather some data To start identifying those new use cases You want to retrain So Then storage infrastructure Is way cheaper than computer infrastructure So if you're building everything around Storage infrastructure, right Compute, you can always burst it You can go to the cloud and say I'm going to do this train and the plans Yes, GPU machine, TPU machine that I can And then when I'm done I want to turn it off That could also be part of your pipeline Starting a fancy VM somewhere Have it run and pull the data from Your data lake and then shutting it off afterwards And then the storage infrastructure Tends to be way cheaper to operate Than compute infrastructure So that's another very important Highlight of why you should Be designing your pipelines this way So we're already touching that Building all components reusable Has huge advantages, right You can have your machine learning Engineers actually collaborating Or your DevOps actually improving Parts, right, so like let's say Machine learning engineer doesn't know how to do the Trino You can get someone to do the Trino for them, right So someone says, well We can crank this faster through us In a Spark cluster and we already have one We can do that for pre-processing data, etc You get the core idea And lastly Yeah Machine learning models need to Be conditionally training or They're going to go stale, right And you don't want to do that So that's been all for today Thank you for attending this talk My name is Daniel Valdivia again Please follow us on Twitter Hit us up on github, right Or join our public Slack If you want to hang out with the community If you have any questions about How to run your own object store You can ask us there Or you can just visit our website And thanks for your time And hope you have a great time Remember the source code Is going to be available on github And it should be on the description of this video Alright, see you guys