 So hello everyone, I see who is asleep. Is everyone awake? Don't worry, I'm not going to take a lot of time, actually. I know you folks are already tired and waiting for the next break. Before I start, just to give you a gist, that this talk is not an advanced or intermediate level talk. It's for the people who have just started their career or interested in data engineering and want to see what are the different components or tools that are tightly integrated with Kubernetes ecosystem and how they can use it and how Kubernetes actually helps in terms of building dynamic data pipelines and all that stuff. So if you have an expectation or if you have already worked with a very data-intensive applications or systems where you have used Kubernetes already, you may take a break already. Shouldn't be a problem. But this is mostly going to be a beginner-friendly talk where we'll see some of the tools and how Kubernetes solves the data engineering problems. So just to set the expectation, this talk is basically going to be a bird's-eye view around how Kubernetes is being used in the data engineering spectrum. And I think a lot of people already may know because the problem that Kubernetes already solves it is whether it's data-intensive or not. You will find that a lot of overlap will be there on the compute layer and when it comes to the orchestration. And we'll see that how exactly the current data engineering spectrum looks like. What is a modern data stack? And what's the hype about it? And we'll also look into a few tools. Of course, there are many tools, but the tools that I've mentioned have used it and the ones that I've already experimented with. So again, if you have your favorite tool, feel free to interrupt in the middle and say this is also the one of the tools that has a really great support with Kubernetes I've used it. This talk is not going to be a tutorial sort of thing. So I just wanted to say that. So starting with modern data stack, anyone have heard of this term modern data stack? No one? Data engineering? OK, so the major difference between modern data stack and the typical data engineering. So for the folks who don't know what data engineering is, consider it as like you want to, the process is basically you collect raw data, you extract basically, and then process it and do some transformation on top of it and put it where on the end application it is going to be used, whether it's analytic or if you are using different customer data platforms, if you want to use there. So mostly around that. And so the modern differentiator between, if you see the diagram, mostly on the left hand side you see that that's how the on-prem systems are, like where you deploy your system, you build and put it on your on-prem. So the major differentiator between the legacy data stack and modern is that you use on-prem hardware, or let's say even if you want to use cloud providers, but then again, it's a typical deployment pattern or the build process that you do. And in modern data stack, basically the tools are built around a cloud native ecosystem. So if you see here, this is how the typical data engineering and modern data stack process looks like. So you have a data acquisition layer where all the data or the events are coming from mobile web servers and then in fact, different SaaS tool that your company uses. This is very high level diagram I'm talking about where you have business oriented applications, right? And where you have end to end process from let's say take any example of B2B app or B2C apps, right? And you have different data integration tools where you wanna procure or capture these events, right? And sort of perform analytics on top of it, right? So these are the four layers, right? And if you see the last part of this particular diagram which talks about data acquisition where now a lot of methoded lodges that have come into the picture, like people do ELT or reverse ETL sort of thing and a lot of customer data platforms, right? Where your entire operation is actually performed on within data warehouse tools itself. So you put your tools on top of it and then directly you extract it and put it to the different customer or business user ending tools, right? So this is what a typical modern data stack looks like and when it comes about where do we use Kubernetes here? So precisely if you see the layer called data orchestration, right? This is the part and the compute part is where Kubernetes is usually used whenever you are building a very high level or heavy data workload system. So I just wanted to give you an overview of what in basic terms modern data stack looks like and we'll also see what Kubernetes helps with in terms of data engineering and as a data engineer if you need to know Kubernetes or not or if you just focus on traditional data engineering tools if you want to just build a data pipelines. We'll also look into a few tools and we'll see when to use it and that is exactly how the opposite of that will even not to use it and then the conclusion. So as Pavitra introduced, I'm Abhishek. I mostly do backend engineering. I've been doing that for the last six years and aspiring developer advocate as well. I've been working, I have worked with few companies doing developer advocacy and I'm a very Python fanboy sort of thing. Been doing that for a long time and right now currently doing a bit of rest. I also run few communities across India. Jutri, Chennai, Picon India and then there are a few more chapters and if you want to connect with me over LinkedIn or any social media platform this is my username. Now the question comes why do we need to use Kubernetes with data engineering or whatever data pipelines you use, right? And anyone can answer that just like a very obvious why Kubernetes is being used. No one? Anyone can answer why do we use Kubernetes? Because it's trendy or because, sorry? Yeah, that's right, scaling but what sort of scaling? Yeah, so just to keep it very short of course people will have a different answer to it but then under the hood these are the two main reasons why you should be using Kubernetes for your data pipelines or in your data engineering prospectors of course when it comes to scaling and especially when it comes to compute scaling, right? I have also attached a kind of use case where how Kubernetes being used in the latest slides but to give you an example consider like you have depending on event workflows like whenever there's an event triggering and this is a very extreme example I'm talking about and depending on that you have to start let's say 100, 200 or 300 workflows for example and it depends on multiple concurrence requests that coming on, right? So how do you manage that? Will you use only like tool which you have to keep on replicating and doing the same thing, right? So basically where you need to scale the compute capacity, majority of the time that is when Kubernetes will be used. So let's say you have a job that needs a very heavy compute so what you will do for example you have a Spark job that is running and if you want to distribute the task you will run that on multiple nodes, right? At that time you can use Kubernetes for example and then again the way we are building data pipelines these days, right? Most of the things, right? Is there anyone who is not using Docker in this room? For any app who is not using Docker? It looks like everyone is using it. Yeah so we know that whatever application we use we dockerize it, whatever in fact the same thing applies to all the data science, data pipelines and ML models as well. We write it, we bundle it and then we you know, counterize and then we deploy that's how it works. But then again it's a traditional way when you have to scale the dynamic workload when I say dynamic, the same example that I explained if you have to orchestrate the containers or the entire data pipelines that time you're going to use. These are the two main reasons and just to deep dive into few more advantages that Kubernetes gives you on top of it as I explained already that the reason why Kubernetes is being used in traditional systems is almost same reason but with fusion and against, right? So all everyone loves using Docker whether it's a DevOps or SRE team or in fact as a ML engineer you write code and then you bundle it and then give it to, right? So again for the orchestration level, right? If you want to orchestrate those data nodes how do you do that? That time Kubernetes is being used and decorate your definition. So let's say you are running a data pipeline where you have an error that says okay memory ran out of like the memory is it's short and that particular job fail, how do you fix it? Like the Kubernetes has a declared definition so you can just go ahead and increase the size and then you can rerun the job or the part and then voila your like data pipeline runs again and then this time to execute. This is just one example. And then whenever we talk about building data pipeline is just not one engineer or two engineer, right? We have data engineers or ML engineers. We also have data scientists who do post work on whatever data comes from data pipeline and then we hand over whatever we have built to the SRE team. So it's like an entire when it comes to enterprise level I'm not talking about you run one person or two person team and then you are doing everything but in a typical enterprise based scenarios that's how it's going to be. So the way it works is you define everything at once and then you use it. So the handover process that takes off via using containers again that's easy. And when your data grows and if you want to increase the execution power of your platform as well, that time also it helps you. So the given example of let's say you are running a job on task or multiple parallel processing you wanna do it that time you can also put this task on let's say five to six notes for example and then it will run it and then okay. So as your data grows, you can scale it as well. And when I say iterating faster, initially when you develop the first half of course it's gonna be a crappy version of whatever data pipeline you use and over the iteration you're gonna build more things on top of it. So whether if you want to enhance the ML model that you have used or if you wanna do A to B testing depending on different users or like a part of cluster you can do that with different parameters and it's easy with Kubernetes to do that once you're underlying low level stuff is done. So just to give it because I'm not going to show the demo so this is one use case but this is a company called Zoom which is Berlin based e-commerce app like they have explained how they have used Kubernetes with Spark to streamline their entire data engineering pipeline. So you can go ahead and read that and coming back to tools these are the ones that I have used and my favorite because when I started using Kubernetes I was completely lost because there are so many components, right? I had no clue which one to use when to be honest and I wanted to get my things done. So I don't know how many of you have used Apache Airflow? Anyone? Okay, how do you like it? Good. So Apache Airflow is like grandmom of all the data engineering tools came at the start using as a scheduler and doing what not, right? But I don't like it. I mean of course just to give a gist Airflow lets you write your entire data pipeline in Python code and in a way how DAGs work, right? So you have to write the code in that way and then you can do it. So again, lot of people for the simplicity sake because Airflow is very popular a data engineering tool they go and start using it but then they can't scale much because by default Airflow initially when it came it did not have support for compute scale and all those things like how Postgres is works best with a single node same goes with the Airflow but then later on again you have a Kubernetes operator, right? And you can run your Airflow jobs on that. And then the second comes the Argo. I'm sure people must have used it if the typically I have seen in cloud natives ecosystem or the people or the companies who are using lot of Kubernetes tool they usually use it for CICD pipelines if I'm not wrong, correct me if I'm wrong but you can actually use Argo to scale or in fact build your data pipelines on Kubernetes. Again, if you are fan of using YAML go ahead and use that. Argo is based on that. And the third one is perfect one of my favorites. So Airflow has a lot of drawbacks when it comes to creating dynamic DAGs, right? When I say dynamic DAGs consider it like a one DAG as a one pipeline and if you want to pass dynamic parameters and all you have to rewrite everything again and you can't do let's say in a for loop if you want to create five pipelines or five DAGs you can't do that in Airflow at least in version one. So what happens is you have to define everything in fact if you have a problem at the business level itself then probably you have to streamline the entire schema how the data comes and all and that's how you can do it. Now Prefect gives you a solution on top of it a really good abstraction level you don't have to write a code as a DAG rather than you can create each function as a workflow in Prefect and you can also pass the dynamic parameters and all that and the same way it works on Kubernetes as well so if you want to create five to I don't know if you have used DAGs it works in the same way but if you want to put it on the five to six nodes on let's say in Kubernetes configuration you can do that as well so each task which if you want to send a Slack notification or read the data from PostgreSQL and then do some operation on top of it and then put it back to something else or some other tool you can do that and then Qflow and other DAGs it works almost the same but again it's a preference how you use it because people know about this tool but it's more of a how you have started using it and more of an industry standard that people go with but I love Prefect I have used it very much and DAGs as well since I come from Python background I have a bit of bias for Prefect and DAGs again when to use it Kubernetes is not a solution for all your data engineering problems if you have a need to scale your pipeline let's say more than 1,000 for example just to give a hypothetical example but if your pipelines are running fine you are able to scale on the single server it's fine you don't need it right that's fine you don't need Kubernetes to run your pipelines again if you want to automate the ML model management go ahead if you want to use a demands level of ML ops you can go ahead and use that and if you want to as I explained about experiments if you want to do a lot of A2B testing depending on the cluster on the cluster level you can do that and of course for the data and the lineage just add not in all three bullet points if you don't want to do that don't use Kubernetes if you don't want to do the particular thing just don't use it so yeah these are the main reasons I would say so in conclusion it was initially to be honest very hard for me to get started with Kubernetes especially for deploying the data pipelines it's not like a one stop solution but then these tools the modern data stack tools that have come that we have seen it gives you a really good abstraction layer where you don't have to put a lot of effort you can just use these tools and leverage the power of Kubernetes so consider Kubernetes as a GPS to you if you're a data engineer and then it helps to navigate if you want to do a cloud native things and if you don't want to get lost in containers so thank you if you want to access the slide this is the URL and yeah okay do you have a time? yeah okay so if you have questions you can take it off the stage no problem