 Hi, hello. Namaste. I'm originally from Nepal, so I'm currently living in Germany and working in Germany. Today I'm going to talk about building data workflows with Luigi and Kubernetes. Before we start, a few things about me. My name is Narkumar. I'm currently working at Brininger. Brininger is traditional fashion house, mostly popular in South and West Germany. And currently Brininger is expanding in online e-commerce space quite rapidly. I'm working at Brininger with a data team, data lake team, and we use Python, Luigi with Kubernetes, running on Google Cloud. I was a web dev in past life, and then companies came to me and then they wanted me to do data, so I moved to the more data engineering role. You can find me on the Internet as my last name. Before I start, can I ask one question? Do you work with data engineering? It's kind of a new term. How many of are you using in data engineering? Slowly is growing. Actually quite a lot more than I expected. How many of are you using Luigi? How many of are you using Airflow? Okay, pretty decent. I'll just introduce Luigi because Luigi is pretty small. It's really lightweight. I think if you just see the read me of the Luigi, you must get the idea. Then I'll talk about Kubernetes, how you can learn Luigi in Kubernetes. And I think that part is pretty interesting for this crowd. Okay, so just a few things about Luigi. It's a workflow pipeline tool. If you are using Airflow, pretty similar to Airflow. It was open source by Spotify. It's already pretty mature, actually pretty old. And you can write basically the data flow or basically pipelines as a normal Python code. That's always the plus point. It's really lightweight and it comes with basic web UI to see the jobs and stuff like that. It has tons of packages. You can run Hadoop jobs using this as an orchestration tool. You can run BigQuery. You can run AWS stuff if you need to have some file or if you need to query. Let's see the stuff like this. It already has tons of content packages. So you can do almost everything with really few lines of code, right? And it doesn't have scheduler, by the way. I'll come back to this point later in my talk. So just to demonstrate, let's make a use case. Let's assume one use case. Let's say you work in a company where they have ice cream franchise here. They have a lot of ice cream shops and the manager or your colleagues or data analysts want to see the daily sales of yesterday. Every morning they come to work and they want to see what happened yesterday, how the company sold. So we need to do a few things to achieve this, right? You have prod database where all the transactions happen, right? So you want to dump the prod database to somewhere because you don't want to do aggregation in prod database. Otherwise, you can kill the prod database. And then you want to ingest somewhere in analytics database. It could be Redshift, it could be BigQuery, or it could be anything. It could be Postgres, whatever. And then you want to run aggregation on this analytics database and then update the dashboard and send out to everyone, right? So since we are in a Python conference, and I know all of you are really good Python developers, you wrote an awesome Python script and settled using Chrome. So dump sales data in this analytics database and then aggregate data and then cool, maybe profit, right? So it looks pretty great, right? I mean, runs works. Maybe not, like we have a couple of issues with this implementation. What happens when your Boston fails? So if you see this year, we settled hourly because we think maybe the dumping database takes a bit of time and then ingest some takes a bit of time. So we kind of managed our Chrome to start one after another, right? Assuming that the Boston finishes within one hour, second on finishes within one hour and stuff like this. And one hour because everybody is doing big data, right? I mean, we have big data. So yeah, we have a couple of problems here. What happens when the Boston fails? What if Boston takes longer than one hour? Or if you have to run this same reporting for the last five days or last one month or last one year, which can happen. And how do you see if this absorbs then all successfully? And somehow if you have, if you mistakenly run multiple times, what happens to your dashboard? Like is it broken or what happens, right? So since now we know how to use case and we saw the Python implementation. I want to show you the Louisiana implementation of this. And I have to open my Python. So yeah, so this is my Louisiana implementation. I put everything in one file. So yeah, I already hear some laugh. I put everything in one file and you can see SQL queries, plain SQL queries. So don't worry about that. What I want to point out is how you can implement some kind of task or some kind of job in Louisiana. I can increase the font size. Is this good enough? So the first thing I wrote is a DOM database task. And then I have load to analytics DB task and then I have a aggregate task, right? So the one important thing you see here is load to analytics DB task depends on the previous task. So the DOM database task. And if you see the aggregate task, this depends on the previous one. And the last one depends on the previous one. So this is how you can change the series of bad jobs together in Louisiana. And running this is pretty easy. So I'm currently using PPNB. So I have to trigger like this. So I have to go to that directory and then I can just run. So the output looks pretty, pretty a lot of stuff in there. But you can see here we run like four, four tasks. And then we have the report, right? Yeah, so looks pretty, pretty okay. And if you see in the UI, this is how it will look like. So first I dump the database and then load it to our analytics database and then the aggregation and self-report is steady. So we kind of solve our problems that I mentioned earlier here. Oh, sorry. What was it? Here. So we solved some of the problems. Like what happens when Boston fails? They are saying together. So the next one will just stay in waiting state. And we also solve the second one that if Boston takes longer than that, it will just keep waiting, right? Then I'll come to the other points later. So we saw that you can do kind of changing of multiple batch scripts or batch jobs. And then you can run with simple command line use. And since this is a reporting thing that has to be in production, you have to somehow run it. So as I said earlier, Luigi doesn't have scheduler. So you have to use Chrome. Usually Chrome is used. So you need some way to trigger the task. So if I run with Luigi, it would be... Oh, where is mine? So if I have to run with Chrome, it would look like this. So I was pretty surprised when I saw that Luigi doesn't have its own scheduler because it's pretty common to have scheduler in and this kind of pipeline tool. But it turns out that it's actually not bad. It's a design decision made by Luigi team. Not having scheduler means you are really flexible to do whatever you like. You are really flexible to run from different places. And one of them is from Kubernetes. So that's what I'm going to talk about in the next. So what is Kubernetes or Chrome job in Kubernetes? You can run a lot of stuff in Kubernetes. You can run normal services like web apps, which run 24-7. You can run jobs, which do a particular thing. And then they are done. And then Chrome job is basically jobs scheduled with the Chrome. So the jobs are known also called as run to completion. So Kubernetes runs it and then when there is an error or a failure, then it reschedules them. So it always tries to complete the job. So we saw earlier that it was pretty easy to run from local. But what about Kubernetes? So I have a simple Kubernetes setup on my local machine. I installed Minicube. So I have Minicube cluster running. And if you see here, so this was my Luigi code. What I did is I have built a Docker image out of it, which is pretty easy. Nine lines of code and then you have Docker image. I have uploaded this image on Docker registry and deployed to Kubernetes, which is Minicube right now on my local. And Kubernetes deployment looks like this. And then running on Kubernetes, you can see the command run Python and Luigi stuff like that. So this is the setup I have here. You can see here I have one Chrome job. And then I also deployed the Luigi demo also on Kubernetes. So you can see here deployment Luigi and the UI of the Luigi. You see this is the UI of the Luigi that I deployed. So all this setup I already did beforehand because we don't have much time. But if you are interested in setup, I have a project in GitHub so you can easily follow it. So let's see. I have one Chrome job here. And I want to run it. So this is scheduled from 7 to 16. So right now it's 16, 19, so it will not run. But I can also manually trigger this task using Chrome, Cube Chrome command line tool. So let's see what happens. So I should see a pod which is in running state. And you see here. And then let's see in our Luigi UI. So you see there are a couple of tasks in pending state. Some of them are already done. And there is one running. The aggregate task is in running state. And this is how you can track the progress of the jobs. Some of them are done. And if there is error, you could see here. Other has just waiting here. So it looks pretty cool. So yeah, we are able to run on Kubernetes. And Kubernetes has conceived of this Chrome job, which creates a job, which creates pods. So it sounds like a bit like a Rosendo, but it's pretty easy once you get into this. So that's what the Kubernetes did here. I had deployed the Chrome job. And then it created a couple of jobs, actually one job. And then which created a pod, which is still running. So yeah, we are able to run Luigi on same way I ran on local. But this has quite a bit of benefits. The main benefit being the scalability of your pipeline. So since it's on Kubernetes, you could scale in Leslie. In our case, we use Google Kubernetes engine. That means at midnight or after midnight around two, we have hundreds of job running, which are automatically scaled up. Kubernetes automatically scales up because the load is really high during that time. And then after all the jobs are done, then all the Kubernetes nodes go down. So we only have dedicated two nodes, rest are dynamic scaling. So this is, I personally find really powerful. But apart from this, you get all the benefits of Kubernetes from containerization to easier deployment, to flexibility with the infrastructure and stuff like this. So that is it. It's a bit rushed, but you can follow with the example on my GitHub. And then you can, if you're interested, just try this. Yeah. So I personally find it very powerful, Sarah, that Luigi being lightweight, it's really easy to containerize and deploy on Kubernetes. As a result, you can build complex batch processes, which are easy to scale and maintain. And you get the benefit of both. The pipeline to Luigi and then the infrastructure side of Kubernetes, like horizontally scaling and the deployment, like I said earlier. Yeah. So in short, that was it. I have one hidden slide, which is my team is hiding. If you like Python, if you like Kubernetes, those kind of stuff, please feel free to talk to me. We are just based in Stuttgart, just two hours from here. And we are really looking for developers. With that, any questions? Thanks, Nars, and any? Yeah, please, sure. Thanks for the talk. Great. Awesome. Question, if you have users or customers in multiple time zones, do you have an experience like what's the best substitute for Chrome tab to manage tasks in multiple time zones? I personally didn't really have to deal with that because our developers or our users are all in same time zone. So, yeah, I have unfortunately no right answer for that. Okay. Why gentlemen? So I have a question about here. I'm here. You're looking in the wrong direction. Yeah. So does Luigi support rescuing of jobs? Meaning if the rescuing of jobs. So as a workflow management system, if you're in the middle of your job and something failed, can you go and fix it and just start from where you failed? Yeah, Luigi handles that. In my setup, the setup with the Kubernetes I showed you, Kubernetes actually handles it. So we are not using the Luigi feature. So whenever something fails, Kubernetes will run again. So we are using that part. But Luigi can also do that. Hi. Thank you for the talk. I was wondering, have you encountered situations where you need different Kubernetes parts to have different rights and how do you manage that? For example, you want one job to run with production rights, but maybe another job to run with dev rights, not the same rights? Yeah. So usually we use service accounts to run jobs. So one job has one service account, the other job has other service accounts, and the service accounts have limited permissions. So if you need a storage account access, you only give a storage account access to the service account. So that's how we do. Thank you. Okay. Thank you for your talk. We are currently in the stage where we are evaluating Luigi versus Airflow. Yeah. So, I mean, I'm looking for recommendation, you know? So what we saw is that it was really pretty hard to write a code that express nested steps and the complex, like ERC of tasks. We found a way to make it with Airflow. And I guess maybe there is a way to do it with Luigi and we just missed it. Do you know how to do that? Yeah. With Luigi it's a bit complicated. The branching model is supported, but it's a bit hard. Airflow has definitely better mechanism in this case. And Luigi, we actually don't do that kind of super complicated branching or nested jobs. Okay. Really, really quick. Did you work on a use case where you have a NFS mount into your Kubernetes image? Sorry. Did you mount a disk to your Docker image? Monitor the disk. Did you mount a disk? We have a problem which is accessing large quantity of data and we want to move to Docker instead of accessing the local disk. And one barrier that we have is to mount NFS data to the image and access it. My question is, did you have a similar use case? Not really. So with Luigi, we usually only process the smaller data set and Luigi is mostly for orchestration part. So all the heavy lifting we actually do with other tools like BigQuery and stuff like that. Okay. Thank you. Hi. So I was one of the few hands that were raised when you asked who was using Airflow. So I was just wondering what your reason was for using Luigi versus Airflow if there's a particular reason. Yeah. So the Luigi and Airflow are both Python thing, but they really have a different use case. Airflow is really big and it comes with everything you need. However, the one main drawback was the Kubernetes support is still not so good. They recently added the Kubernetes support where each task can run on a Kubernetes pod, but this is not so stabilized yet. And this is one of the main reason that we prefer Luigi. Other than that, Airflow is also like if you have your pipelines, you have to copy in some directory. You cannot really build Docker images and deploy like I did with Luigi. So the packaging and running, deploying Luigi is a lot, a lot easier and fixable. And Kubernetes, sorry, Airflow, you are always in the Airflow way of doing things. So it's really heavy, heavy thing. It gives a lot of things, but it's also not that flexible. Hi, question. Thanks for the talk. How do you handle the logs? So when integrating with Kubernetes, do you get the logs from the pod then in Luigi? So since we use a Google Kubernetes engine, the logs are sent to Stackdriver by default. So yeah, so if you saw the running, Luigi will basically give the yesterday out and which is sent to the Stackdriver. Okay, and then how do you handle the secrets? So if the secrets are pushed to the logs, do you... We use Kubernetes secrets so they are not in the logs. Okay. Is it possible to configure Luigi in a way to run jobs not just based on the Chrome tab, but also, I don't know, some Redis message queues or a file being dropped in a directory or something like that? And how does that work? Because you were starting a Docker container for every run, but I think something would be there for all of the time. Yeah, that's actually a really interesting use case. We are also thinking about that. There is no built-in way to do it, so you have to build something yourself. Like if you upload a file to yesterday and then you want to trigger Luigi, then you have to build this part yourself. So like I saw earlier, running Luigi is just a command line thing, right? So you would run this command when the file is uploaded. So this part you have to implement yourself. There is no default way to do it. Thank you for your talk. When you showed your example with the code and the dependencies between the tasks, it looks like the code is kind of coupled strongly to the code that is running, so the Luigi task definition part and the stuff that is running. Is this very flexible or how strong is the vendor locking if someone decides to use Luigi? And then once you switch to something, is it complicated to do that? As far as airflow also has similar kind of contrips, so Luigi has really a lot of contrips. I mean, it's an open source, so it's not really a vendor locking thing. That's not really what I meant. It's about how deep is integration of the managing of the task and definition of the dependencies and the code itself is coupled a lot or are the stuff that you want to do, the scripts that you are running, separate things and you just define the tasks separately and can edit, for example, some command line options for what you want to run. I think I got your point. So what I do is we usually build one pipeline for certain stuff, like building one report is one pipeline. So this is one image, Docker image for us. So we have hundreds of images to handle all the workloads for the company. Does that answer your question? I'm getting a little bit different direction but maybe you take this offline. Yeah, we can also talk later. Okay, we could probably have one more short and quick question. Yes, please. And probably last one. Yes, actually testing Luigi is really, really easy because it's really lightweight. So we use PyTest to test it. We have some extension built to support Luigi but generally it's really, really easy. It's like any other Python code. So thanks for joining me. I really liked, there was a lot of questions. I was a bit afraid when I submitted the talk thinking that there might not be enough interest but I'm really excited that there were a lot of questions and if you use Python in data engineering, I really want to talk to you. Please come to me. Thank you.