 Well, hello people, we continue in the afternoon talks and now we are going to have a limb talking about reproducible and deployable data science with open source fighters. How are you limb? I'm good. I'm good. Can you hear me okay? Yeah, all perfect. Where are you streaming from? I'm streaming from the office actually. So I went to the office so that I have some private space to this one. Nice. Is this your first Euro Python? Yes, this is my first Euro Python. Very excited to know this as well. Yeah, it will go great. Well, if you are ready, we're going to share your screen and we're going to start with. Good luck. Can you see the slides? All right. Hi everyone. Very excited to be here today to talk to you about reproducible and deployable data science with open source Python. Before I continue, I'm just going to check the chat to see if everything is okay. Yes. Okay. Everyone can see my slides and my screen. That's great. Yeah, my name is Lim. I'm a software engineer. My background is in full stack software engineering. I mostly worked for startups before some example companies I have worked with include delivery and memorize. Right now I'm working in the data domain at Quantum Black Labs. Quantum Black is the advanced analytics consulting arm of McKinsey. And right now I'm building data products for data scientists, data engineer and machine learning engineers to accelerate their projects. I love open source Python. Currently I'm a core contributor on Kedro, one of the open source tools explored in this talk today. I'm lim dodo everywhere on the internet. It means Lim's big head in Vietnam is. So the agenda today is I will be setting the scene in which you need to deploy a realistic Jupiter notebook into production. I will explore some of the challenges that you might face in this journey and how you can move from a notebook environment into a standard Python project with Kedro. Then I will explain how we can use Kedro extensibility to integrate with other tools in the ML Ops ecosystem and a few deployment strategies we can use to deploy the project as well as how do we go from here. From one project to hundreds of projects in the future. So with that in mind, let's imagine this following scenario. A media organization wants to provide media and video recommendations to their users. As we have seen in real life, if done correctly, this can have material impact on the company's bottom line. And this is the organization first attempt in adopting machine learning and data science in their workflow. So one of their very talented data scientists have conducted extensive research into different algorithms and architectures for building recommendation systems. The data scientist has organized her findings as a series of Jupiter notebooks. And a disclaimer is that all of the notebooks and data science code in this presentation adapted from Microsoft recommenders repository where they details best practices in building recommendation system. I cannot recommend it enough. That's a bad joke. Sorry. All right. So when it comes to notebooks, there are a couple of different perspectives. On the one hand, the interactivity fast feedback loop and convenience of a Jupiter notebook is almost unbeatable, especially for exploratory data analysis and rapid experimentation. On the other hand, notebook pose some challenges for reproducibility, maintainability and operationalization. Specifically, notebook actually makes it hard to collaborate between each other in a team member. I know that there are some recent really cool projects that address this issue. But out of the box, Jupiter notebook is essentially a single player game. And then it also make it. It's also quite hard to conduct code review in a notebook format, as well as some of the other code quality controls that we usually enjoy in software engineering are hard to implement, such as writing unit tests, documentation generation, linting, so on and so forth. And notebooks sometimes give you a full sense of security because it cashed the results. When you see the output cell of your notebook and you see that it has the correct results, it might make you think that your code runs without errors, even though the logic has changed. And all of these problems combined cause, I think the biggest issue of the more, which is consistency and reproducibility in your project. So in a very interesting study in 2019, New York University executed 860,000 notebooks they found in GitHub. And they found that only 24% of them ran without error and only 4% of them actually produced the same results. So with that in mind, I'm going to explore some of the strategies we can use to turn a notebook environment into a standard Python project using Kedro. Disclaimer, I'm a core contributor to Kedro, so this talk will be biased. So what is Kedro? It is a framework for you to build reproducible, maintainable and modular data science by applying software engineering best practices, such as separations of concern and versioning into your code. It was created by Guantan Blake from our own battle scars of delivering data science project to our client. It's used at startups, major enterprises in academia, and it's fully open source. But instead of selling you on this tool from the top down, I would like to explore some principles that motivate us to build Kedro in the first place and some of the problems that it is trying to solve. So the first problem is data management. Data management in Jupyter Notebook has a few challenges. Whenever I come into a new notebook, my first instinct was to ask a few questions like, what are the datasets in this notebook? Where are they stored? How are the data loaded? What are the formats? Can I reuse my data loading procedure for the similar datasets? How can I incorporate new datasets if necessary? In our example notebook, you could see that it's actually miles ahead of the curve in which it factor out all of the data loading logic into a reusable library for the movie lens dataset. But it's not quite ideal because it's programmed directly into the specificity of this dataset and it's hard to reuse later on. Later on, if you have more datasets, you will have to build more libraries. So the way that we serve it in Kedro is we provide a declarative data management interface through a YAML API as configuration for you. It separates the what, the where, and the how of data loading. And when you stack all of these datasets together, declarative definition of datasets together in a centralized data catalog, it gives you an instant clarity on which important datasets are used and persisted in your project, even to non-technical team members. It supports a number of features such as interpolation for dynamic values such as environment variables. It also supports changing dataset definitions between different environments, local development staging production and so on and so forth. So a common use case is you would use a smaller dataset in local development for rapid iterations and bigger ones in staging and production. It also supports data versioning, partitioning, incremental loading through different dataset types and configuration. It promotes security best practice by accessing data without leaking credential and it's extensible through custom datasets. So if you find a particular use case that we don't support out of the box, it's very easy for you to write a custom data connector and use it in your project. I think the biggest benefit of declarative data management is it provides a consistent interface between business logic and different IO implementation. So it abstracts the way the differences between different data sources, processing engine data formats, out of the box we support many of them. And it promotes the irreusability of your data connectors as they are separated from your business logic. So you can swap them in and out without changing the data science code. In data science code, parameters and configurations are also important. Parameters including the project parameters as well as the hyperparameters of your model. So in the screenshot here you can see that I have different configurations from different tools that I integrate into my project such as great expectation which we will see in a few minutes, as well as MLflow and Spark all in the same place, as well as my parameters are also managed in the same way in a game file. The trade-off here is that YAML domain-specific language works very well on small and medium-sized projects, but it becomes unruly for massive ones with hundreds of datasets, even with good IDE support. So to me to get these problems, get your support splitting your data catalog into multiple YAML files. It also supports change of templating to avoid repeatability at the expense of readability. And then you can also use some other YAML native features such as reusable code block to improve your configuration. I'm just going to show you a very quick demo of how this looks in real life in our project. So this is our data catalog. It's located under COM base catalog.yaml in my project. And as I mentioned earlier, this is a base environment, but you can also create more environments such as staging and production down here to overwrite your catalog definition in different environments. And I really like a feature in VS Code, the Outlight Editor, where you can see the outline of your catalog. So it's very clear what datasets are used in this project. And to demonstrate the idea that we can swap in and out different datasets, I'm just going to change this into Spark. Spark dataset, I'm going to do file format, CSV, and it will just work the same way with a different processing engine for your code. So this is the data catalog, and it hopes that it gives you some idea on how it can help you with data management in Jupyter Notebook. And the next bit is about how do you manage the code in your project. So the challenges with managing code in Jupyter Notebook is that cells need to be run in a specific order. There are global scope variables that may or may not have been initialized. It's hard to unit test specific cells in isolation. It's still necessary to factor our common logic into pattern utilities outside of Notebook to prevent the Notebook from becoming polluted. So out of the box, Kedro gives you a few simple but powerful coding patterns as well as abstractions to help you manage the code better in a project. So the first thing is that business logic in Kedro are written as pure pattern functions. But that means there are a lot of benefits to this, but one of the biggest ones is you can unit test it in isolation. And then you can use other tools in the pattern ecosystem when it comes to functions such as decorator composition to help you write more modular code. And then these pure pattern function can be connected together to use in a bigger pipeline in a concept that we call Note. And Note is just a thin wrapper around these functions with inputs and outputs, which are dynamically injected at runtime by the declarative data sets and catalogs that you have seen earlier. So the pipeline shapes is always a DAB, which is a directed acyclic graph by design. So there's no cycle in your data flow. And algebraically speaking, it's just a set of node so they can be concatenated together to form bigger pipelines. In this example, you could see that I have three nodes in my data processing pipeline to clean my racing data and movies data and use the outputs of these two nodes at the input of my create model input table node. And this is my data science, sorry, data processing pipeline, and then I could create my data science pipeline the same way. And then in the end, I just concatenate them together because in the end of the day, it's a set of nodes, so it has algebraic properties. And you can build pretty big pipelines this way, iteratively and modularly. So this is a demo that we have online, but I would also like to show you the an example article that one of our users actually wrote on mediums and just want to show you the screenshot of their pipeline, which I think is quite massive. It is at the end. Yeah, so this is a pipeline that this company runs in production. It's one of the biggest telecom company in Indonesia, I think. One thing worth pointing out about the coding patterns in in casual is that the topology of your pipeline is that dictated by the data flow. As we saw before, you need to connect the node using the inputs and outputs. So inherently a casual pipeline is data centric. It's a dark of data. If you're familiar with other workflow engine, such as a flow prefect, the pipeline is a backup tasks and data artifacts. So in a way in casual, you already have a table level secondary lineage of your data for free. And the transformation logic that produces it is not column level image, but it's a good start. So that's the coding patterns that can help you extract the code from Jupyter notebook and then put into a Python project in a maintainable way. And the next big thing that I want to talk about is the development experience in casual, because as we know, the development experience in Jupyter notebook is amazing, especially when it comes to exploratory data analysis and rapid experimentation. It also has really, really great communicative utility, right? For example, I come into this algorithm recommendation system algorithm notebook and then I understand very easily what's going on. So I think this is a great strength of notebook that we try to match with the tooling that we provide in casual. So the first thing is we provide interoperability with Jupyter notebook for exploratory data analysis by allowing you to use schedule constructs inside the notebook. For example, you can use the data catalog to load and save the data within the notebook itself before you move on to writing code in an IDE. We also allow you to embed a pipeline visualization within a notebook so that you can visualize the shape of your data flow when you explore your data. But beyond the notebook environment, we provide a very powerful CLI environment to help you run your project iteratively. There's a run command here that supports running the whole pipeline, running the pipeline in different environments, running different sub-pipelines within the main pipeline, running a single node, overriding parameters at runtime so that you can, for example here, I want to try different hyperparameters for my model. So I can run my pipeline many times with different parameters just to see how it works. And there's a long list of run commands. Most of them are powered by the fact that our pipeline is just a set of nodes so you can filter it in any which way that you like. Beyond our provided command, we also allow people to provide their own CLI commands either through a set of plugins. For example, the visualization tools that you saw earlier is built as a plugin and provided as a command in casual. You will also see another plugin later, which is Airflow, where we will create an Airflow app from the casual pipeline. And finally, you can also create your own commands within your project, as you can see down here, where you create your own run command within the CLI.py in your project. So it's a very extensible way to add more to your development experience in casual. And here are some of the example CLI extensions built by our community. There's a casual MLflow plugin to provide commands to interact with MLflow. There's a casual diff plugin, which is really cool, where it shows you the diff of your pipeline between different git branches. Another tools that come out of the box with casual is a powerful pipeline visualization with casual bits plugin. It helps you develop and communicate your pipeline with fast feedback loop. So in my example here, when I change my pipeline definition, the visualization to what listens for the changes in the file and then automatically refresh itself. So you could actually see that your pipeline shape actually has changed. And it's also being actively worked on to turn into an interactive data science IDE, even though my product manager might kill me when I say this, but stay tuned. Then the last thing is Kedro allows you to scaffold your new projects with standardized project template. It's originally based on cookie data science and it comes with a few tools out of the box such as linting with Flaked and ISO code formatting with black and so on and so forth. It supports more advanced setups such as sparking visualization through a concept of status and it supports custom status to tailor to your project needs such as a specific CICD configuration. So that's it about Kedro. I would like to talk a little bit about how you can use Kedro to integrate with other tools in the MLOps ecosystem. If you think about MLOps, there are many different ways people describe it these days, but one of my favorite one comes from NVIDIA, where they are model MLOps at the lifecycle. And as you could see, Kedro helps you with the collaboration development workflow in the middle and then it provides some tools out of the box to help you with the collection, the injection and the analysis. But what about all of the other responsibilities in the MLOps lifecycle? And instead of trying to become like an on-site fit all kind of tools, Kedro provide a very extensive integration mechanism for you to hook into these different needs in the pipeline lifecycle. For example, how do I run data quality checking after my raw option data is loaded? How do I emit runtime metrics after a node runs so that I can set up a monitoring system so on and so forth? It provides its extensibility mechanism through a concept of hooks, lifecycle hooks that map exactly to the MLOps lifecycle that you saw before. And we have seen it used to integrate Kedro with different tools like Grafana and Prometheus monitoring, MLflow experimentation tracking, so on and so forth. In our example today, we will look at how we can automate data quality checking with great expectation using Kedro hooks. Great expectation is a pattern based open source library for validating, documenting and profiling your data. And the way we do it is that we are going to write a hook before we save our data set to validate the data so that if there is some changes in the data with bad qualities, we want to stop it from propagating down the pipeline. I'm going to show you a live demo of this very quickly. So this is the code editor. This is my data catalog as we saw earlier. In Kedro, you can provide these hooks in a file in your project or hooks.py. So this is my hooks.py file. And then this is my data validation hook using great expectation. When I initialize it, I initialize the data context in great expectation using the configurations located in Confbase. And this is the hook implementations. This uses the same hook mechanism as PyTest uses. In fact, it's powered by the library that PyTest people made for building plug-in architecture. So this hook is called before dataset saved. And the idea is very simple. Before you save a data set, you try to get an expectation suit, which is like a set of validations to run against this dataset. If there is a suit that matches the dataset name, then we will just run it using great expectation. I have configured my project to have one expectation suit to match my clean movie dataset. This is a JSON file, but it also has a Jupyter notebook interface for you to interact with it provided by great expectation. And when I run my pipeline, let me just do this quickly. This hook will be called automatically. And you will see that after the validation runs, we can open what we call a data docs for you to view the validation results. Actually, because I changed my data catalog earlier and has some typo in it, I'm going to change it back to pandas. I'm going to run this again. I think that should work. Yes, the data doc is located under Uncommitted Data Docs and there is an index.html here. I'm going to open this with my browser. This is a great expectation data docs where you can see all of the previous runs of my pipeline and all of the validations come from those runs. If you click on one of them, you will see what expectations will run and then if it fails, then it will tell you why it failed. That is how you can use schedule extensibility to add automated validation checking into your pipeline quite easily with just a few lines of code. Another bit I want to talk about is after all of this development effort as well as putting controls in place to assure that your coding quality as well as your data quality are pristine. Now we need to think about deployment. How do we deploy this pipeline into productions? To this end, get your support of your deployment strategies. If your project fits, if your pipeline can run into a single machine, we support a single machine deployment mode where you can just containerize it using Docker or package it as a wheel file and then install it in your Python environment in production and then just run it as any Python package. But it also supports a distributed deployment mode. If your pipeline cannot run on a single machine, then you will need to split it up and then run it on different nodes in a cluster. And I will demonstrate how it looks today using that principle with Apache airflow. So the idea is very simple. Just convert every node in your pipeline into an airflow task and then the whole pipeline, sorry, into an airflow depth and the whole pipeline each node is converted into a task. As you can see in the screenshot here, my task flow in schedule look exactly the same as my task flow in my airflow depth. If there is time, I will show you a live demo of this in a bit. But a very good question to ask is like why starting with Kedro and then having to convert this into airflow later on if they look exactly the same. So as we saw earlier, starting with Kedro gives you the benefits of rapid development, much closer to that of Jupyter notebook. It focus on data flow, not task flow, and it gives you the flexibility to stay simple. So if single machine deployment works for you, let's do it before you have to go distributed. It gives you the flexibility to choose between different distributed orchestrators. So if you don't have airflow, you can go with Algos, Kedro, Prefect, whatever. The principle is the same. You convert parts of your pipeline into the primitive construct in the orchestrator environment and then off you go. And then there's a very powerful concept here that I would like to promote. That is your deployed pipeline doesn't need to have the same granularity as your development pipeline. So in development, we want as much details as we can for all different purposes. But in production, sometimes we are constrained mostly by the computing resources as well as the production environment. So we might want a different way to slice the pipeline and then deploy it based on those constraints. So in this example here, I have split my pipeline into just two different tasks in my airflow rack. One do data processing and one does model training. And theoretically speaking, with airflow, you can run these two different tasks in two different workers. And one might support Spark and the other might use GPU if you use deep learning models. And the last thing I want to talk about is how do we go beyond a single project? How does these tools help you to scale from one project to hundreds of projects? And basically it helps you do that by promoting the concept of reusability. So you can reuse your pipelines between projects using an abstraction that we call modular pipelines. I didn't have time to cover this today, but it's in our documentation. It helps you build reusable data connectors, reusable extension hooks. So beyond just great expectation, it can build other hooks, as I mentioned before, to do performance monitoring or experimentation tracking. It help you build reusable CLI commands, help you create scaffolding template with starters. And you can also publish them as open source libraries for the community to use too. And that's it for my presentation today. All of the code for this project is hosted in this repository. And thank you for listening. Thank you so much. It was really nice and well done. So now we have two minutes left. If you want, we can do a bunch of questions that you have. Are you ready? Yeah. So the first question is, does Kedro support pipeline versioning? If yes, how do you drag dependencies? Okay. So as mentioned before, pipeline is constructed as pure code. So you can check it into version control. You would version control your pipeline exactly as you version control pattern code. Maybe Git or if you have other version control system like Mercurial. And then dependencies are also... I suppose you talk about project dependency, not dependency between pipeline. In terms of project dependencies, it's tracked again as any other Python projects with your standard tooling such as requirement files or if you use... Well, we use requirements.txt and bit generally. But I think you can also use Poetry if you are advanced or more trendy. And the next one is, what do you think are the main pros and cons of Kedro with respect to DVC? Yeah. So I think that's a great question. I'm not that familiar with DVC to talk about this. But what I think Kedro helps is it helps with the development workflow. It helps with collaboration. It helps standardize your practice across your organization. So every team uses the same standard so that it's easy to transfer between projects or it's easy to reuse a code later on. Whereas DVC in my limited understanding is specifically concerning version control of your data and your code. And I think these two tools are complementary like they are all forgotten in my opinion. Perfect. Thank you so much, Lim. It's really good for the first time. Thank you. The time is limited. So if people have more questions, please go to the breakout room part. And Lim, you will be there answering questions they have. Thank you. Thank you so much.