 So today's talk is about automating machine learning workflow with DVC, and let me introduce myself first. My name is Hongjoo, I'm living in Korea, and I worked for SK Hynix as a data scientist. So some of you may not know my company, but actually SK Hynix is one of the largest memory chip maker, and you would easily find some when you uncover your laptop or desktop, especially when you are using the computer from Apple or Dell. And my recent work interests is building a knowledge graph and supply chain management doing them automatically, and some kind of mining software repository. It's all machine learning jobs. So today I'm going to talk about DVC, an open source tool for managing ML workflow efficiently. And firstly, I will start from how software developers, I will start from how software developers work well with various practices and tools, then talk about data scientists and machine learning developers who have some challenges to adopt their work with software developments. I think DVC could help them to work more efficiently. And then lastly, I will show you how to use DVC, how DVC works with an example of projects. Actually the title is automating ML workflow, not automated machine learning itself. So actually there exists an active area for auto ML, so you better notice that this session is not about auto machine learning. So let's start with Waterfall to agile. I don't think people used to work with Waterfall way for developing software even in the old days, designing, building software and release could never meet a set of requirements at once. I've never experienced such a case after doing my homework as he has 101 class. However, it's useful to learn that we should work with iterative process other than Waterfall. Since the requirements are always changes or not concrete enough, we used to organize a small set of tasks and do what we can do earlier and release features in a progressive way until all the requirements are satisfied. So as we are not working alone and the iterative process should run fast with extreme efficiency, we used to divide our work stages into a few steps and try to keep moving forward without a stop. And for each stage, we have been continuously, we have been continuously think about how we can do better with the job. Some people start to talk about methods, just like TDD and continuous integration or continuous development or deployment. And some develop some efficient tools, such like Git and Maven or a JUnit, Jenkins. So those tools help us to do our job easier and more efficient way. So we have so many health even on deploying, operating, and monitoring our software and maybe sooner or later software development could be the easiest job in this world. Now how about machine learning? So there are typical workflow in machine learning as well, which are data acquisition and data preprocessing and build model and evaluation and model selection. And lastly, deployment. Although such workflows are a part of whole process of developing machine learning application, but they are relatively new and less developed. This is because data science or machine learning is different with software development as the software development is different from the developing hardware with more waterfall process. These are the typical process of machine learning. And this is the typical workflow in one chart. It is an iterative process starting from data acquisition acquiring on the left side, but very different from what that software developing process is to deal with data and more along with codes. Sometimes data and model takes more important part of process with just a few lines of code. Also, it is a team sports and some parts need some specialists like data acquisition and processing stage, these data engineers area, and also preprocessing and model selection is for data scientists or for machine learning engineer. And even the software developing engineers are needed for the last step, the deployment to build an application code. That's pretty complicated, isn't it? So for this reason, machine learning workflow cannot just follow software development processes. And I think there are machine learning's own three main challenges in machine learning projects. They are burgeoning data along with code and deploy a model, not a code. And lastly, metric driven development development. So people used to have their own burgeoning system as you see in the screen. And later, we don't know which one is the proper working version. And also data scientists should share those data, but it's not easy because they usually take so large space in storage and hard to manage. A few gigabytes or even larger data, how can we easily share them? Another problem is sometimes changes in data triggers pipelines, even there's no single lines of code changes. But it's difficult to notice which part of data has been changed. So we should keep organizing the data with its related code so that we produce output at any time if the data changes. I'm sorry, this line is supposed to be a separate section as a separate challenge. I made a mistake here. Different from software development, the most important and final artifact is a model, but not a code. So we have a version, we have two version models, and keep tracking of the which data and code produces the model. Lastly, machine learning is a metric driven job. Our software development process starts from requirements and ends with requirements. A metric is the most important milestone teaches what we should do next for the improvements. I'll show you some examples of what kind of decision can be made for tracking the metrics at the last step. So metrics should be kept tracking along with codes, data, and models. Metric must be kept, tracked, and now DBC comes out. DBC helps to handle these challenges. There are other solutions such as heat LFS, MLflow, and Apache Airflow, but I recommend DBC because it's easy to use. If you are familiar with using Git, then it's very intuitive to use the DBC with Git, and it's language independent. Even the DBC is written in Python, but you can use DBC with C languages or Java, or any other tools, whatever you want. So it's language independent. Lastly, it's useful to individual to a large team. Other tools like MLflow, Apache Airflow, they need to manage the web server. But in case of DBC, it's just a client command line tool. So for you can adopt to your project individually, or you can share the tool with other members in a large team. So it's easy to start with. Okay, it's time to see how DBC works with the problem of cats and dogs classification. Actually, this example, the project trains a small VGG net to classify cats and dogs images. So go to the GitHub repository later, and there will be an instruction to build a Docker image that which contains everything you need for following the walkthrough example. And later containerize the image with running bash cell, then follow the commands and the following commands should be run inside the Docker container. So this is the typical directory structure I used to work with when I'm doing machine learning project. So I put a data, some raw data and process. And when I'm ready to deploy the model, then I put the retrain finalized model in the finalized data. And there's a notebook directory. Actually, I use this directory occasionally, but mostly I just put the source code in the source directory at the bottom. So I make a cat dog module. Then when I need to experiment such a module, then I open a notebook and import the cat dog module and test the modules and do some experiments. And also there needs some data downloading scripts in a script directory and also the script for deployments. So to start with, we need to initialize the GitHub repository as you see in the screen. And add the source directory, then do some commits. And after that, we do the same thing with DVC init, which initializes the DVC repository inside the Git repository. So you can see some .dvc directories and some files inside the directory organizing the whole repository. And also we need to add index.dvc directory into the Git repository so that we can track the DVC version as well with Git. And lastly, we commit the Git, a DVC repository with a commit command. There's a script download.chill, which downloads 25,000 images in total, half cats and half dogs, that's pretty large. So the script put those files in temp directory. So there's a cat directory and dog directory, which has 12.5k images for each. And next step is a set of the parameters. Those parameters are used for data preparation or pre-processing, or it contains some hyperparameters for training a model. So as you see in the prep stage, we use splitrate as 0.9, which splits the whole data into the training data and test data for training a model and evaluating it. The class size is actually, the data set is two large, so the training takes a long time. So I just limited each class with 2,000 images, so in total 4,000 animals inside a store in the training data set. So if you have a GPU computer, then the whole training step will be finished in a minute. Then we have learning rates and batch size and number of epochs and validation rate, 0.2 for the validation steps. Now it's time to define the first stage of the pipeline, which is called prepare. So there's a preprocess.py file in cat dog directory, which divides 4,000 images into training data and test data, and sampling 4,000 images in total out of 25k. So the process data stored in data process with the Python command. So you see the options, minus n is the name of the stage, and p is the parameter, which you have seen in the previous slide, the parameter, and d option is the dependency. So the prep stage is dependent on the preprocess.py, and the output stored in the data processed directory. So after running the dvc run command with such options, we can check what kind of files or directories have been changed. So there are three directories and files have changed. So I add them to the git repository and commit. So now we are sort of tracking the preparation stage. Thank you. And next step is defining a train and evaluate stage. So I named it version 0.1 because I just put a one convolutional layer and one fully connected layer, very simple model. So such code is written in cat.train.py. So as you see the first command, I run dvc run again with another name train, and it accepts a train parameter with p option and depends on data preprocess, which is what's the output of the previous stage and depends on the train script itself. And the output goes to data with the model.h5 file. It's a model exported file. And it draws the plot data to the plot.json. And the task is run by cat.train. Running the cat.train model. And so you will see some output in progress of training the model. And then we will define another stage named evaluate, which depends on the model.h5, which was the output of the previous stage and also depends on evaluation script. And it tracks the metrics with M option, option M, with the score.json. So the train evaluation metric will stored in the score.json. And it will keep track with the model. I mean, the metric information stored in the score.json will be kept tracking with the model file. So I added some more files and made a commit and tagged the version as 0.1. Now we have defined the three stages, starting from prep to ending with evaluate. So with dvc.dag command, we can see an ASCII art. So the prep or when the train depends on the prep stage and evaluate depends on the train stage. So when you have changed on prep stage, the whole deck has to reproduce. Or if you have only changes in related to train stages, then only evaluate stage has to be run again. So when there's nothing have changed, if we try to reproduce the experiment with command dvc.repro, then you can see there is nothing changed in the previous stages. So nothing has to be done. But if I update the model with adding another convolutional layer, then running dvc.repro, it detects some change in the source code. So it starts to build model again. So after finishing the training of model, I put a tag 0.2 as another version. And did the same thing, adding a third convolutional layer and put the version 0.3 and commit. Now it's time to compare the metrics for each version. Regarding to the accuracy, as you see, the accuracy ACC score is just around 0.67 to 0.71. So it says just adding more convolutional layer since not helping the results. And I try to check the training process for each experiment and tell something. As you see, the training accuracy goes high, but the validation accuracy sometimes drops and stops increasing at epoch 2 or 3, which means it's overfitting. So it's a clear sign of overfitting. So I put some regularization with the dropout and then run the dvc.repro and do the same training job again. And you see, it's still sometimes the left part of the chart. It says sometimes the validation accuracy drops, but it continues to increase. So I also tried data augmentation. So rather than increasing the size of data, I tried to manipulate the existing 4,000 images. Oh, sorry. And it also helped. So later, merging two data augmentation techniques and regularization technique, I could have up to 7.8.0 operacy. So maybe later, you can try this at home with the walkthrough example and the slide. So good on. Thank you. If you have any questions, just shoot. Thank you very much. So we are now in a room. People is there. So time for questions. Stanislav is asking, how do I recreate the data on a different machine? For code, I don't get. Clone, what does one for data? So you can check out the code with a review about it. How do I recreate the data on different machine? Oh, a good question. Actually, I haven't explained the great feature of DVC in this slide. But you can make a shared cache. You know, the git has a cache inside our home directory. But you can think of sharing such cache to the shared storage. And then you can share the cache so that if I'm on training version 0.5, and follower tries to train the same model, it won't take a minute. Because the shared cache will just come inside my DVC repository. So it's amazingly fast because it's sharing the cache. OK, thank you. So any other questions? Oh, another one. Yes. The DVC handle version control on the data or rather input data must always be the same. And we are just burning, receive, so transport, repair, train. Actually, it makes a hash of the file or the directory and put inside a cache. So DVC doesn't do anything with git. But git tries to manage everything with DVC repository. Because the DVC repository has been kept versioning after we are defining a pipeline or training a new model, everything output, input, and dependent file will be hashed and stored in the cache. So it's going to be managed with the git. Would you use consider DVC as alternative airflow or can those work together? Let me read the question, please, because we need it for the recording. So if you read it all, it's better. OK, would you use consider DVC as an alternative to airflow or can those work together? Actually, airflow, they have advantage in monitoring, but which DVC doesn't have. So if we want to do a monitor our job, we have to use another tool. So in that case, we cannot use airflow with DVC, but like Jenkins, we can use Jenkins with DVC to monitor our Jenkins as a DVC test. When it takes one hour, two hour, or a day to train a model, you can put those DVC tests inside the Jenkins so that you can monitor the job. OK, perfect. So thank you very much. Thank you for presenting. We are just one minute.