 Hi, thanks for joining our session. Today, we are going to talk lessons learned from the utilization of machine learning pipelines in production environment. First of all, let us introduce ourselves. I'm Kyosuke Hashimoto. I'm a research engineer at Hitachi, and I'm interested in developing AI solutions with open source software, such as Kubeflow and Kedro. And also, I'm interested in public clouds, such as AWS and Microsoft Azure. Hi, I'm Masahiro Ito. I work at Hitachi as a software engineer. I'm responsible for leveraging open source software related to big data and AI. I've been developing data lakes and data pipelines using Apache Hadoop, Spark, HBS, Kafka, and more. Now, I'm developing MLOP solutions for customers who are going to build enterprise systems. These are the outline of our talk. Our presentation is in four parts. First, we will introduce our business and explain the challenges of our machine learning projects. Next, we would like to talk about the development process of executable ML calls in production by Kedro. In part three, we will give some discussion about the lessons learned from the project. Finally, we will give a brief summary of what we have covered. OK, let's move on to the introduction. To begin with, I'm going to talk about our motivations to focus on machine learning. Machine learning technology is applied in various industries, such as IT and energy manufacturing and city development. Our company Hitachi has a lot of products and services in these industries. When we applied ML from experiment to production, we faced many gaps between data science and software engineering. So our challenge was how to move ML from experimenter to production. I'll talk about job specialization in machine learning projects. Most of the machine learning projects consist of three steps. The first step is proposal. In this step, sales persons or consultants suggest how to solve customers' problems by data analysis. The second step is experiment. In this step, data scientists showing ML models and generated logics by Jupyter Notebook in experimental environment. In many cases, they have good mathematical skills, but they are not good at production level software engineering. The third step is production. In this step, software engineers will implement logics from scratch for production environment based on the result of data scientists' experiments. I'd like to talk about an issue and a hypothesis for applying logics into production environment. The issue is that it takes a lot of effort to implement the logics and test scores from scratch in production environment. So we thought we could reduce these efforts by refactoring the Jupyter Notebook logics and converting them into production scores. In this presentation, we will focus on how to transform logics from such refactored notebooks. I'll explain the gaps between experimental and production environment. There are three gaps that make logics re-implementation difficult. The first gap is logic majority. In the experimental environment, data scientists write logics such as data preprocessing, model training, and evaluation in the same notebook. But in the production environment, we need to execute model training, evaluation, and inference effectively. The second gap is flexibility in data source. In the experimental environment, data scientists load small load datasets such as CSV files from local storage. But in the production environment, we need to load large datasets from object storage, data warehouses, or our regional databases. The third gap is logic scalability. In the experimental environment, data scientists train ML models on local machines using small datasets. But in the production environment, we need to train a lot of models with large datasets and execute inference in parallel. Let's talk about our challenges towards production environment. There are three challenges to transform some logics in Jupyter notebooks to execute the course in production environment. Firstly, we need to pick up the parameters and outputs from the notebooks to achieve logic majority. However, this is difficult because multiple outputs and parameters developed by other cells are packed in a single cell. Secondly, we need to modify datasets' loading logic since our system specifications have not been determined under the experimental environment. Thirdly, we need to make some logic scalability to handle large datasets. There are several popular open source frameworks to decompose complex logics into simple procedures. We use the framework called KEDL for these logic transformations for production environment. KEDL is a specially popular framework among data scientists since it enables integrating popular tools such as Jupyter Notebook. From the comparison below, we consider KEDL is suitable for overcoming the three challenges. So we try KEDL in our machine learning project to modify logics for production environment. Here is overview of validation scenario and its target system. The goal of this ML system is to classify the condition of machines for maintenance recommendation. In this scenario, engineers modify notebook calls generated by data scientists to production calls. The input data for ML model is sensing data of machines. On the output of ML model is machines condition. The target production ML system consists of training, evaluation, and inference part. First, the training system trains ML models from sensing data. Then, the evaluation system evaluates the trained models. After that, the best model is deployed into the inference system. And the inference system predicts the machine condition from new sensing data. The projected machine conditions are used for maintenance recommendation and other applications. Next, we are going to explain how we develop executable machine learning codes in production by KEDL. By transforming logics into system-executable codes by KEDL for production environment, we try to solve three challenges below. First challenge is logic modularity. In order to pick up outputs and parameters from single cells in Jupyter notebook, we decompose complex logics into a set of functions and data processing flows by KEDL. This time, we call this procedure pipelining. Second challenge is flexibility in data source. In order to modify logics to load data set corresponding to the system specification change, we replace the composed logics, which in this time we call them nodes, and the type of data set changes for modifying data loading logics. Third challenge is logic scalability. In order to modify some logics to finish in reasonable time for larger data set, we execute independent nodes in parallel by using KEDL's function. This is the process overview of conversion from experimental logics into executable codes for production environment. We transformed cells into nodes and connect nodes to develop pipelines in following steps. Firstly, in logic modularity, we transformed pipelines in KEDL style. Secondly, in flexibility in data source, we applied data catalog for data and model loading function. And thirdly, in logic scalability, remove loops inside nodes. By these solutions, we could easily transform logics for production environment by KEDROL. First solution, transforming pipelines in KEDROL style consists of following four steps. First one is generating project template by KEDROL and second one is extracting definition of nodes from Jupyter Notebook. And third one is adding nodes not in the Jupyter Notebook and then we can connect nodes to develop pipeline. Through these four steps, we transformed cells into nodes in KEDROL style. From next slide, we are going to explain these steps in detail. First step is to generate project template when we started to use KEDROL by typing comments KEDROL in it. Following directories and files are generated automatically. We use some of those files afterwards. Firstly, we define nodes in nodes.py and describe definition of pipelines in pipeline.py which is located in the right side of these slides. And also we use KEDROL.yaml which is on the left side of this slide to use solve challenge number two that is flexibility in data source. Next, we extract logics from Jupyter Notebook and transform them as a nodes of pipeline. For example, we have training logics in Jupyter Notebook here and then we divide as a node in a part of pipeline.py and also a part of node.py. Next, we added node not in Jupyter Notebook. Before connecting nodes, we found a node for inference pipeline is missing in production system which is post process. In Jupyter Notebook, training evaluation and the inference are mixed without any categorization in single notebook. However, when we develop a pipelines, training evaluation inferences are separated and we could find missing part easily. Post process is defined as follows in non.py and pipeline.py. We input inference result as arguments of function and output results as we could input them to other applications for maintenance recommendation. Next, we connect nodes to develop pipelines. Since input and output of logics in notebooks are not clear, we had to clarify them to develop pipelines. Kedro offered simple and easy to understand way to define pipeline and node as simple playing Python codes. In notebooks, inputs and outputs of each logics are not clear and then dependencies among cells is not clear. Also recovery policy is not clear when something go wrong with exacting notebooks. On the other hand, by using pipeline in Kedro, since flow is modularized by nodes, we could make inputs and outputs, dependencies clear. Also, we could design recovery policy for every node. For example, we could connect nodes in pipeline.py, which is shown in left codes and develop training pipeline, like right figure. Next, we are going to explain how we solve challenge number two, which is flexibility in data source. We apply data catalog function or data and model loading function. Kedro supports declarative definitions for specifying data source and be the writer in separate configuration paths, which is called data catalog. By using data catalog, we can check existence of data and apply embedded wrappers for read and write data, which enables us to switch data source by replacing single configuration file, which is Kedro.yaml without editing the definition of nodes. Developers can decouple location of data sources, type of data source from implementation of the 12 logics. How to apply data catalog in data and model loading function is consisted of these two steps. First step is analyzing logics in Jupyter Notebook. And second one is decoupling data loading and converging codes. First step is to analyze logics in data loading logics, implementing in Jupyter Notebook. This is an example of data loading logic. As you can see in this example, data loading logics number one, number two, and conversion logic number three are mixed in one loop. When data source is changed in production environment, analyzing and changing logics are difficult. So we should decouple these two different types of logics into another files. And this is how we decouple data loading and converging code into different configurations and codes. Firstly, we apply data catalog and describe data loading function in kettle.yaml. And secondly, we really implement rolling and converging codes in node.py and pipeline.py. Third challenge to solve logic scalability, we remove loop inside nodes extracted from Jupyter Notebook. There are some logics which have loop insights such as training logic and devaluation logic for testing many algorithm or data sets. In experimental environment, these loop structures may be useful, but in production environment, this causes the delay to scale out pipelines. In order to finish completion in reasonable time, we should remove loops inside nodes, explicitly expressing loop structures in the pipeline flow and execute nodes in parallel by single command option. We removed loops of each nodes as follows. Please look at the left figure. As you can see, before refactoring, training and devaluation has loop at the number of locations, which indicates that increasing the number of locations accompanies delay of wall processing time. On the other hand, by removing loops of each nodes which is indicated in the right figure, we could distribute it processing at the increase of locations. Now, we discussed the effect of utilizing Kedro in our scenario. These are good points of Kedro, which we learned in our validation scenario. Firstly, we could easily develop structured pipelines by simple definition of functions and pipelines in Python without learning any special framework to overcome challenge number one. Secondly, we could easily modify pipelines by separating codes which were in different environment and scenarios and modifying configuration paths to overcome challenge number two. Thirdly, we could scale out logics and run them in parallel with simple command option, which is related in challenge number three. This is a point to consider we learned in our scenario. We believe transforming logics by Kedro reduces much effort than refactoring logics from scratch, but transforming logics still takes decent costs. As a future, we should evaluate those quantitative costs. This is our idea of reducing costs to transforming logics from Jupyter Notebook. We think developing tool for data scientists to tag cells and generate pipelines and functions should be the possible solution. For instance, Kail allows us to transform logics in notebooks into machine learning pipeline automatically by adding tags and off dependencies. Kail stands for a Kubeflow automated pipelines engine and it has annotation feature for tagging cells in Jupyter Notebook and transforming cells for nodes of Kubeflow pipelines. We expect those integration is applied to Kedro as well. Finally, we wrap up our presentation. This time, we try to apply Kedro to accelerate our machine learning project from experiment to production. We have found that Kedro enables more efficient implementation of production machine learning systems than re-implementing them from scratch. Kedro enables manual logic conversion from Jupyter Notebooks to production level structured machine learning pipelines which are easy to develop, modify and scale out. Future works are as follows. Firstly, we need to evaluate the effect of saving costs for transforming logics from experiment to production. Also, we should consider more automated solutions such as Kail for those transformation. This is the end of our presentation. Thanks for joining us. We are happy if you are interested in our presentation. Thank you very much.