 Thank you. So today what I'm going to present is something called as a machine learning data pipeline. So this is like a sample data, like how a data preparation step happens for machine learning. So usually like in our company we have a very similar kind of a structure where we have a lot of data sources basically which we need for applying machine learning or doing some kind of analytics. So what usually happens is like data doesn't lie in a single database for you. So it usually lies in different database servers. So it might be something like a Hadoop or HBase cluster, which you can imagine it to be a database server one here. And also you can imagine for like a relational databases like MySQL and all where you have a database server too. And as you have seen in the last presentation, there was a lot of talk of open APIs that one has used for the Singapore government, like all the MRT stations and all. So in our companies and many other companies, in order to use those API, actually we want those API data just to enrich the existing data we have. So we have to consume those API as well. So before anything, before going into machine learning or training or pre-processing or doing some kind of analytics, what we need is like a data integration layer. So what usually happens in a data preparation? So we imagine a hypothetical situation where we have like three data sources and we have like three components, which are the data integration. So the data ingations are basically consuming the data. And once you have the data, you need to aggregate them. So what you do to aggregate is like you need to find a common field might be an ID or might be some kind of a name or something, which you need to aggregate together. And you have to come up to a single data, like a single view of a data of all the three sources. So you have something called as a data aggregation task. And after that, you have something called as a data pre-processing or a cleaning. So this data pre-processing step is quite important in machine learning because here in this state, you do a lot of machine learning things. For example, you do a pre-processing, you do a normalizing, you handle the missing values and many other. So that after this state, the data that you get is quite enriched and is ready to go into machine learning training. So we'll come to a machine learning training part in the later part. And yeah, so after this, so how do we code this? So right now in Python, like few years back, so what we used to do is like we used to write naively a Python code we used to use. For example, the URL lead to library to consume the API and also for database server used to query using some kind of like a some kind of SQL alchemy and all this to consume the data from database server one to and then we do a join and all. So this is how a naive Python code looks like for this diagram. So you can see here basically you have to get data from the API. So you say this is the URL. After you get the data, you just output it to some kind of a CSV format. And also like you have this database one from where you read the data and you again say, okay, let me start the disk data into something called a database one dot CSV and similar for the database to know I have used an example using the pandas library, which is a very common library people uses for in the machine learning and the data science people out there. So what we do is for aggregating the data, we again read back all the three CSS that we have caught and we use something called a pandas dot merge on a field called us ID and we have the aggregated data. And again, we say, okay, let's store this state of the data again. We store it as something called as a data aggregated CSV. Now what happens? The next step is like the pre process. So for example, you have a function where you handle only the missing data of this data. So what do you do? You again need to read back the aggregated data and say, okay, now take this data frame for me handle missing data, which accepts a pandas data frame, and it gives you back a clean data frame right now. Now this all three can be imagined as to be tasks. And you can imagine this execute data preparation pipeline as a wrapper function, which basically execute this task. So it calls the ingest data. So once that is over, it calls the aggregate data. And after that, it calls the pre process data. So as of now, everything works. Now, as we know, things are never perfect in this computing world. Something or the other goes wrong. So what happens now is, so for example, the first task of getting the data from the API, it works absolutely perfect. You get the data and all. Now you again read the data from the database one. You have the database one data now. You store it as a data underscore DB one CSV. Now in the second call, if you see database two dot read, that suddenly due to some like DB issue, it fails. It throws some kind of an error and all. And the whole program stops and it throws you some errors and all, which you can see in the logs might be database surveys down or something or the other. Now what happens if the state fails? So the common thing that comes in my mind, I need to rerun again. Fair enough. So this is where like people will be really pissed off to imagine like, okay, I have already processed the data from the API from the database one. And again, I have to do it again. So how do you like, how do we handle this state? So the thing that we asked the question, can we resume? That means can we start from where exactly it had stopped? For example, can the system figure out from like, where it exactly needs to run again? And it should like know where the intermediate state was after which it needs to process. So this calls for something called as a persistence of task state. So we need to understand, okay, we need to persist some of the states, like we already have the steps as we are storing as a CAC file. So we can imagine those as to be steps for each and every task. Another thing that we should keep in mind here is like the property of Atomy City. So for example, in this failure database 2.read, what happens is like you read it and you start writing to the DB2.CAC. So after starting to write, after sometime it fails. So what should happen? The DB2.CAC basically is a dirty file right now and you can't use it. And by now we have to think about something called as an atomic file operation. So if the file has to be written, it has to be written in full or either it shouldn't be written at all. So we need to imagine when the state fails, we have to remove this data underscore DB2.CAC. So let's start again coding what we do. We now say we'll be checking for the states so that we can avoid the unnecessary reputation of all the steps that have succeed. So what we do, we say, okay, let's check if that path exists or not, since that is our state. Now we see, okay, if it doesn't, if it doesn't, then we say, okay, try get that one. And if we get an exception anywhere, what we do to maintain Atomy City, we remove that file again back. So right now with your code, you have achieved where to continue from, you know the intermediate step. The second thing that you have done is you have maintained an atomic file operations here, right? Good enough. Now if you imagine this code base is like you're accessing three, four data sources and you're writing this kind of code. So you can easily imagine the code is getting messier and messier. And for example, now you're taking parameters, parametric methods and all this, things will become more messier for you to handle. Whether it exists and exists, this thing will continue. So there are two ways to handle it. One, you start crying. So I'm not the one who will cry. I'll be the second one. So I'll go on a holiday, leave the whole code base to my team and I'll say, okay, you fixed it. I'm going on a vacation. That's a simple thing I'll do. But okay, let's think about how to make this thing like instead of writing it every time, whether it exists, it exists, remove if operation fails and all, how do we do it? Can we have some kind of a thought process behind it or some kind of a boilerplate behind it which can help us to achieve the same task. So let's think about DAG. So DAG stands for Directed A Cyclic Graph. So those are in the using Apache Spark. They will be knowing and using this concept. So basically this Directed A Cyclic Graph is kind of a graph which shows interdependency among all the nodes. And it is very similar to Spark. It does a lazy evaluation. So what I mean by a lazy evaluation is, okay, so you see like this is some kind of a, another representation of the same thing that I have here. Yeah, the representation of this thing again in a different way. So what I do is like there are, we can imagine there are three processes, three tasks basically, reading an API, reading a DB one and reading a DB two. And I tell them the output of that is stored as something as a data API, CSC, data DB one and DB two, similar to earlier. But this data aggregation task has three inputs now. And after that it stores an output of data aggregated CSC. And after that it stores, okay, data pre-processing, that data aggregated becomes your input. And finally you have the clean data which is ready to go into a training for a machine learning, which is data training.csv. Now what happens like suppose if we draw the graph first and say, okay, I know these are the tasks I need to execute one by one. This depends on this, this depends on this. And then after finally, so after drawing the graph, basically the execution starts when it reaches the last step. But in our earlier code, if we just for example, say ingest data, it will start ingesting data. But in this thing, when we think of it as a DA, DAG graph, it basically does some kind of a diagrammatic representation, a graphical representation where you can see this task depends on this and this and this. We'll see graphs basically like I have snapshots, I have it in the demo. Okay, now do we, so what if there is a framework which provides you with a very similar kind of thing that you can achieve? So yeah, so if anybody still remembers, there used to be a show called the Super Mario Brothers Super Show. So there used to be two fellows like Mario and Luigi. They both used to be plumbers and in their city or town, I don't remember, there used to be the best plumber of the town. So do we need a Mario or Luigi, something like that who can fix this, who can plump this all task, stitch it together fine so that if anything fails, it will automatically start, take care of Atomy City and all. And obviously once I started in the machine learning world, I felt okay, this is something very important for me, which is needed. And so the question comes, we need a plumber. So here is Mario's brother called us Luigi. And Luigi is something that Spotify has developed as a like a framework, a Python library. So it basically does, it has a lot of properties basically. The first thing is the dependency resolution, which you have already seen, which have already seen this dependencies like this data reading this task, data aggregation and all these tasks depends on each other. So that dependency resolution is handled by Luigi itself. It also take care of this whole workflow. It also has a visualization. So if you remember in our earlier, like Python, Nathan way of writing that code, if you cannot see what is the progress of your task of the whole data pipeline, what do you need to do? You need to go and tell the logs basically and see, okay, what is it? What is the state it is processing at? But you don't have any kind of a representation or a graphical way of seeing it, okay, this steps are over, this are pending. So Luigi provides you that. As well as like if it is, how to handle a failure? If it fails in our naive Python way, it will just fail. But in this case, what you can say, if it fails retry for two times or three times, it might be your DB server is down for a minute or so. So what happens is like, it will throw you an error at that moment. But Luigi, what it will do is like, it will say, okay, it failed. I'll again retry it in five seconds and it will keep it. And it has a lot of configurations to play around. So this is how basically a Luigi task looks like. So if you see, the first thing you see is a parameter. It says a param, which it takes as Luigi dot parameter. So everything here in a Luigi is a task basically. You can see there is a like my task, which basically inherits like a class called as a Luigi dot task. And you have this parameter. So a parameter is something like if a task, for example, so if I want to train a model daily, so my parameter should be the date. So I should pass a parameter as date, today's date, and it will create the models and everything and store it that thing. And request is something it's telling this task, my task is dependent on something. So you say, okay, what I require on. So basically this is where you say the dependency. So this, my task is dependent on some other task. So this is how it maintains the dependency. And the run is basically where the main business logic lies, where you can write the whole code, the whole business execution and everything, whatever you need. And output is basically where it writes or stores the state. So for example, you read data from somewhere, you do a processing, do some kind of thing, and you say, okay, this is where I want to write it to. So each and every task now you can imagine has a dependency. It says this task is dependent on some other task and after completing it, after running the whole logic here, it writes it somewhere. So this becomes like a very easy way of, like it's a very good boilerplate to start with. So you can write all your tasks now, like we'll see in the future slides, like how we write now into a Luigi way of writing this code. Okay, this is where it comes. Now we lividify our data preparations tasks. So earlier it was API data ingestion. So what we do, we make it as a class because API data ingestion itself is a task. So what we say, so if you remember the graph here, the API data ingestion doesn't depend on any task. Neither does read DB one, neither does read DB two. So these are three tasks which are independent and can happen in parallel. So what happens? You have the API data ingestion. So you don't see anything called as request because it is not dependent on any task. So it does some kind of operation of like reading from the API and it has an output. You can see that output is maintained. And basically after running it, you write it to that output again. So you maintain the state for the API data ingestion layer. And after that, you go into the database one ingestion which is very similar, you read it and you write it. So basically you have now make this three task of API reading and the two database reading into three Luigi tasks. Now, what we are left with is the data aggregation part now. So we have the data from three sources written into a CSV. Now we need to aggregate the data basically. So in the data aggregation part, you can see, okay, it is dependent on three tasks now. It is dependent on these three states. Data API CSV DB one and DB two. So what we have is we use a require and say, okay, you yield three, these are the three things which basically is now a generator. So basically these three tasks have to finish before this task can be executed. So this data aggregation now, before even like starting to run itself, it will run all the three tasks. And after that in the logic, what I do is a simple again the same logic where I just use a pandas dot merge to basically create a single view of the data. And I say, okay, the output of this fellow would be like data aggregated dot CSV. Okay, now this data preprocessing state or task, which basically is the main task that we need to do. So data preprocessing only depends on one single task which is basically the data aggregated dot CSV state. And once the data aggregation is over, then only a data preprocessing can happen. So basically we have the data preprocessing part where we get the data aggregation we call the in the request we put a data aggregation task and we do some kind of like preprocessing handling, handling, missing values, outliers, and many other things. And we store it again back to a data training. So now we go into the step which is basically training a machine learning. You have the data with yourself right now. You have something called as a clean data training. Now, how do we train a machine learning? So we can say, okay, now this training, this train class, which is a Luigi task again, it requires something called as a data preprocessing because it requires a preprocessed data now. So how do we run now? In the run part, we say, okay, let's run a random forest or some kind of aggregation, some kind of a model. And we say, okay, this is the data frame that we pass it to that method. And we have something called as a sales model, which is basically right now a pickle object. And we store this pickle object into the output you can see for this output method. So this is basically the target where we store it. So once we run train, what happens? So once we run train, it basically draws, it starts to draw the graph basically. It will start, okay, train depends on data aggregation. Data aggregation, sorry, data preprocessing. Now data preprocessing depends on data aggregation. Depends on the three data ingestion of DB1, DB2, and API. So this is the graph that basically Luigi will draw out and start executing the tasks. So this is how, we'll see it in the demo as well. This is a sample of when you start executing how it starts. So it will check. So you can read it here. So the debug log here, it says, okay, it is checking if train is complete. Then it will check whether data preprocessing is complete. And it will check again whether aggregate train data. Then it will say, okay, train data ingestion, store data ingestion, these are the two data, basically I'm reading. And it will say, okay, this has status pending, pending, pending. Now once the pending is over, you can allocate a number of workers to them. So what will happen? You say, okay, I want to run with five processes. So basically two processes in our case, in our example, three workers will take on the three processes here. Basically reading API data ingestion DB1 and DB2. So three workers will start working here. And the two workers will remain idle because until and unless these three tasks finishes, the other workers can't do anything. So once this is finished, all the three finished, then automatically the data aggregation will start. So only one worker will be active that time and rest four will be idle. So this is how the execution starts. And you can see like, okay, Luigi basically creates a new process ID for each and every job that it starts triggering. And this is how a Luigi scheduler looks like. So you can actually see a graph here. So we start a Luigi in a like a central scheduler using something called as LuigiD. And we say, okay, let's run the Luigi pipeline. And you can see the graph here, like you can see, okay, the train thing that is still pending. The data pre-processing part is running right now. The three greens here means, okay, it is done. So the aggregation is done. The data reading from two data sources, that is also done. So we'll see all these examples in the demo. So yeah, so let's go into the demo. Okay, so, okay, there are two things too. So what we'll show first is like, we'll run it in a local scheduler. Local scheduler is something like, is used when you want to do a debugging for debugging. In the development phase, you want to write code. You want to see everything is right or not. So what we do is, so let's just run the code as it is. And just let's see. So you can see here, the same thing, like here, you can see, okay, it is like aggregated. Checking if aggregate train data is complete, status pending, it will continue. It will check all the workers will ask for more tasks until and unless it has completed. So just to tell you, this example actually is from Kaggle basically. You can refer to the Kaggle website here. You can get the data here as well. Also like in my GitHub repo, you'll find the data. And also it says like how to install and everything. So I have like a build.sh. Those are in the Linux or Mac environment. Can just build it and automatically things will start working. So you can download the data. So basically this data comprises of two data. One is like at the transactional data about like you can see here. This is how the training data, let's see here. The test data looks like it has ID, store, day of week, date, open and many other parameters. And what we do is like we try to predict the cells. Also there is a information called a store.csv, which is basically the information of a store. What is the store type? What is the competition distance and many other. So we can imagine as this data to be the store data to line some database one server. And this data basically the transactional data to line some other database server. So it's basically mocking that kind of an environment with two different CSVs. So let's see what happened to our running. Okay, so this says this progress looks good because there was no failed task by missing external dependencies. So if we go to the TMP folder where we have created everything, we see now that we have a like a model pickle object. We have the training clean data, the training stores data, the training data and also the aggregated data. So basically all the tasks have run successfully. They have created all their dependencies using the TMP states and all this thing. So, okay. Now what we need to do, we'll see it running in a central scheduler. So for running in a central scheduler, what you need to do is you need to just remove this flag called as local scheduler and you just run it. Okay, another thing is you need to run. You can see the, yeah, you can follow here how to start the LUIGD central scheduler. So you can just run a LUIGD ampersand which will basically run the central scheduler as a Python process. And it will have a like an internal, I think a HTTP server that which you can access at a local host port of 8082. So we'll see that now. So what we do is, okay, fair enough. Let's go here. So we say, okay, let's start the LUIGD process. So LUIGD process starts. Now we see, okay, let's try to access this. And we see, okay, we see there are no tasks running, nothing is running and all. So we go again back here. So we have changed the parameter to right now run for like a central scheduler. Like we have removed the flag, the local scheduler. So what we do is like we just run it. Okay, so before, okay, this task now says, okay, there is no tasks which are missing. Why is it saying? Because all the dependencies here exist. So what we need to do is we need to remove this, all the dependencies here. So once we run it here, you can see here now, okay. This is how you can see what are the tasks it is running. So if you go to a single graph, you can see, okay, this is how the graph looks like. So you have the train basically, which is still pending. You have the data pre-processing, aggregate train data, it's running. So you reload, okay. Aggregated train data is like completed data pre-processing and train and all these things. And apart from that, you can see what are the pending tasks and all you can also, like if there is an error, usually it will show you a red kind of a, like it will show you the exact Python error and you can just click on it and see, okay, what is the error there? So it is like very useful, this like visualization actually. You can see the progress of your task and many other things. So let's go here again. Okay, so the tasks are finished. Okay, so I think also some other things that we can see here is, actually, yeah, it's not available right now, but you can see all the historical tasks here if you maintain a, like a SQL IDB as a configuration. And apart from the configurations here, there is this Luigi documentation where you can find a lot of useful things. You can find, for example, like, okay, what is the error email to be set and all these parameters. And so also like this are like many parameters you can pass like I told data, like date time as a parameter and all these things. You can say email as I told the error email type and also you can use this actually to even run HDFS tasks as well from a like a Luigi framework. And also this is like quite famous among the Spark people who wants to like do a Spark submit task into like using some kind of a Python framework or something. And apart from that, I think, yeah, another thing that was part of like the presentation which I didn't show right now is like, so how do you predict? You have created the model right now. So you have something called as, if you go to LS, you have the pickle object. So how do you deploy this in a production? So that was like a small part of it, which I thought like, why not show it? So basically if you go into my code base, you have something called as a flask app server. So basically I use flask to like, which has basically something called as a endpoint called as a predict cells. So you basically pass the data to this guy. It will use this model that you have created and pass on a prediction. So it becomes like more of a like a cells prediction API for you right now, for your company if you want to. So let's see that part as well, which is like quite basic here. So what we do is like, we say, okay, Python plus app server.pu. So it starts running. So if you want the prediction, what you can do is like, okay, I'm just using advanced rest line to show you. So, okay, this is the data. This is the store. This is the day of week, date, open, promo, state, holiday, school, holiday. And these are the parameters for which I need the prediction. So let's see, let's pass a like a request to that fellow. It comes and says, okay, this is your predicted cells. So this part of the code is also lying in the GitHub. Those who want to play around, please play around. Feel free to. And also there is one part of the code which is quite useful in some cases, which some part, yeah. So when you want to retrain a model. So my Flask API server here, something called as the load model part. So this load model basically reloads this predict cells, which is this model. And this model basically again loads the pickle object in memory. So this is something like quite useful for people like who just wants to keep on something retraining the model like hourly, but you don't want to restart your Flask app server again and again and again. So you just pass a like some kind of an endpoint, reloading that module particularly. When you reload again, what happens? This predict cells again, it will reload. So that means this model path. So this pickle object would be again great. So you can also like, yeah. So this is what it does here in the ML pipeline. So if you just uncomment this part of the code. So basically after the training happens, it comes to model to the state that you have asked it to. And after that it will try, okay, let's load the models again. So basically behind the scene, your model is already reloaded. You need not care at all. So I think, yeah, let's go back to the slides. Okay, some of the limitations is like, okay, they don't have a scheduler right now. So we rely on Cron basically, like in my company, we have a fraud detection system, which basically runs using Luigi and a lot of data sources here and there. So we still rely on Cron for that. So it doesn't have a scheduler. And also like in some scenarios, many people have looked forward for Luigi to run multiple executions. So you can imagine this many tasks can run parallelly in a cluster mode. It can run on one node. Some tasks it can run. That executions is still lacking. And as far as I remember in the Jira, I have seen a lot of like people commented about it. There are many open tickets about it. I'm pretty sure Spotify will come up with something in the next versions or something. And that is something they're looking forward to actually. And yeah, some of the useful links which you'll find here. Please go ahead, play around. There are a lot of things. And you can see my GitHub repo here. You can just have a click and you have everything there, how to install and all these things. And feel free to ping me there. Okay, thank you. Please feel free to reach for any queries. So yeah, any questions? Okay, one more thing. Also those would be using for a machine learning. There is a known issue that I wanted to point it out to you, which me and many of other people have actually found out, which I've written here. So if you're using SK Learn and you're starting to do a cross validation or a grid search or something with the end jobs, which is basically parallel processing, how many number of jobs you want to run parallely. If you give it greater than one, there would be scenarios where you'll like face an issue with telling like, okay, there is some kind of a weird error it will throw. So that is something even everyone is aware of and is working on it. Like when I debugged it, I found the issue to be because Luigi assigns something called as a process assignment ID. So once a process assignment ID is done, after that, when you start to create multiple, when you give multiple workers, then it starts to do again multiple jobs. That is where it actually gets into something weird and throws an error. So just be careful when you use a grid search CV or anything in jobs greater than one. Yeah, any questions? Yeah. I'll play around with Luigi a little bit. Okay. But I'm in this rather shallow. One issue that I have to get right is that I feel a little bit troublesome to have to build small components and test around it and then play around with it. And until it is done, I put it into a band. Into it. It will cut right in there. Okay. Like the debugging process makes it a little bit, I feel Luigi is having a pretty heavy tool chain by boiler plane and wrapped around the task that I don't, I haven't found any way to make the, make small debugging steps convenient. So, okay. The Spotify recommendation for using Luigi basically is to keep your tasks small. So make sure that your task is, for example, if you want to read from a server, read from something, get the data, process it, make sure you store the state. So it like each and every kind of like a software, you need not need a Luigi. Basically why you know there is a lot of dependency between tasks that is when Luigi comes into the picture. And actually like we have like quite two, three systems right now, which is in production at my company in pocket math, which actually works fine. And even with this dashboard coming up with the visualizer. So if I have to ask my DevOps people, like, is there something wrong? They can directly go into that port and say, no, all the tasks ran successfully yesterday and all the now. So it becomes so easy for you to manage as well. But imagine like you're writing a name, Python code, something goes wrong. You need to log into the server, go, find the logs, tail it, but debugging, like I'm not pretty sure what exactly issue, but we can take it offline. Like what exactly were your use case and we can have a discussion. Yeah. Yeah. Question. Okay, thanks.