 ML model and data set versioning by Corin Benoy. The focus is on best practices for organizing your ML projects, also providing alternatives to get for your projects. So, over to you Benoy. Good afternoon everyone. Thank you for turning up for my talk. Are you able to hear? Yes. Today I will be talking about why machine learning projects are different and why our current practices of managing machine learning projects are not so efficient and I will be talking about one of the best practices to version control, that is to version control your models and data sets. So, a bit about myself. I am an open source contributor. I have been contributing to multiple organizations like Foreshia, CloudCV, DVC. I have been a Foreshia Open Tech Knights winner. I recently became a Kaggle expert in kernels. And yeah, this is where I still am. Thank you. So, before starting my talk, I just want to have a quick poll. So, who would have at least worked on some machine learning models and trained a model and worked with data sets greater than 500 mb. So, that's a good response. Thank you guys for turning up for my today's talk. So, at first I will be talking about some of the adventures which I faced in my startup. So, when I was interning in a startup called Noroplex, which was building products for IOC. We've been working on very interesting problems on computer vision, mainly on object detection. Yet, I felt a disgusting feeling for machine learning there. I was someone who had contributed to open source. And for me, the amount of work I did per day was the number of git commits I did. And all of a sudden, in my internship, I was not able to use git for about 2-3 weeks because I was not doing nothing which usually I did in software engineering. Also, we were working with large data sets and models. Yet, unfortunately, we transferred that using pen drives. So, I was curious like how will the problems scale when we worked on bigger problems, bigger teams, etc. So, I will talk some of the challenges which I faced when I was working in the startup. So, apparently, machine learning is slow. So, in my startup, we were working only on some of the open source projects. So, I didn't need to write any code. But whenever I was working on a project and downloading data, training of my models took a long time. It took like about 2-3 hours to even a few days to get a decent accuracy. So, during this time, I could fight with my colleagues, sleep, do anything literally. The next challenge which I faced was, in software engineering, most of the software products just take a few seconds to execute. And the normal workflow is download the project or write the project, install the requirements. That's it. Tissue. While in machine learning, it's totally different. So, in machine learning, whenever you are working on a project, you first write the code or if your colleague had written that code, you download the code. Then you need to download the associated data set with it. And apparently, as I said, on training with this code and data, you are able to generate ML models of the output. And if you are using some models for interference, you use pre-trained models. The next challenge I faced was machine learning is metric driven. So, whenever we are working from one experiment to the another, we try on focusing on improving some metrics. For example, if you are trying to predict a flower classification problem, it depends on how much accurate your model, model's prediction is to the actual value. There are various other metrics as well. Yeah, the next challenge was when I'm working on big data science projects with huge data sets, I'm not able to use Git. And Git clone becomes slow. So, you may be talking to me that, yeah, you can easily avoid all this by using Gitignore or you can, by Gitignoring your data set and models. But yeah, this is my response for you. So, what is model's versioning? So, when consider the case of a flower classification problem, at first, you try to work on the project and you try with this architecture like VGG16 and you get an accuracy. Then you try another model or maybe something like Resonance or something, you get another accuracy. So, one of the key features of machine learning is machine learning, we need to rapidly experiment ourselves. And it is very important to track all the experiments we do. So, why is it important? So, whenever we are working on any machine learning project, we, most of our experiments, almost 75% of our experiments usually fail. And only 15%, 25% of our experiments succeed in some way or the other. So, it's always very important to keep track of our best experiments as well as the worst experiments. Why do we need to keep track of your worst experiments? So, most of the research which has come has been from something like some experiments as we thought they didn't work out. So, all of a sudden, the researcher suddenly decides to think like, why did it fail? And out of this, we are able to generate new intuitions. And if you want to, one of the best examples for my idea is like, just research Google about how Resnet Blocks was created. So, what is data set versioning? This is a self driving car by Google. Google's way more. Can anyone guess the amount of data generated in a self driving car per day? It generates about four TBs of data per day. So, it's very important for when you are working with such data to version control it and learn from it. And also, data is a very expensive thing. So, for example, one of the sponsors of Picon India, recently, Vice President recently told to me that they spend about 30% of their expenses on buying and procuring data. So, managing this data is important for the company, as well as it is very important. So, we need to check whether sometimes your machine learning model does not increase accuracy when more data is being filled. So, we need to find the optimal limit for data set management. So, why data set management? So, as I said, data sets evolved. So, so we need for hierarchical data sets, data set versioning is very important factor. Next thing is moving of data sets around. So, usually, it's very hard to move data sets around. Like in my startup, I told you the experience of version controlling data sets with version controlling data sets and sharing data sets in pen drives. But this is not an ideal situation. And this is not how we want things to happen in our life. So, in industry, so to pro usually we download data set in the local local repository. And after some cleaning of the data set, you need to push this data set into the cloud service. So, it's very important for our data set management principles to consider easy moving of data sets from local storage to cloud sharing of computers with sharing of data to our friends computers using SSH or something like that. So, I'll be so when I was interning in that startup, I realized I was looking for tools on how I can improve this work in machine learning and which I faced in these challenges which I faced in my startup. So, what I did was what I did was I participated in an open source competition called Google season of docs and there was an organization called data version control. So, I got interested into it because I was someone who was interested in data set data science and all of this and after contributing to one to two weeks early, I realized what DVC is and what are the principles of DVC and how does it solve this problem? So, what is data version control? So, DVC is a command line like tool which was a brainchild of a ex Microsoft data scientist called Dimitri Petrov. So, he created this tool to incorporate all the machine learning best practices which he felt was lacking in the industry like models and data set versioning, there are a bunch of other best practices according to him. So, just Google about it, you can find it easily and the best part of this tool is it works hand on it works together with Git. So, Git is used for code versioning while DVC is used for incorporating the best practices. So, it has features like models and data set versioning and a bunch of other features as well. So, yeah, these are the features of DVC. It's a good tool for experiment and data set tracking. It's a open source project with about 3500 stars. Yeah, as I said, it's built to adopt some of the best practices of ML and it's language and framework agnostic. So, you don't need to have it works in Python R or any language you want. So, I will be going to do a very interesting problem and I'm trying to I'm trying to work on a real life scenario. So, in this project, so I will be trying to version control cats and dogs in this problem. So, let's start. So, Alice and Bob are data scientists and they are worth trying to create a machine learning model. They're trying to create a machine learning model for creating an optimal dogs versus cats classifier. So, this is Alice. Are you able to hear me? Okay. So, right now, someone had written already the code for the associated models and Alice is going to download the data. So, unzip data dot chip has the data folder in it. So, I'm going to unzip the data. So, as you can see, there are about 1800 files in this data. So, there are 800 images of cats and dogs in the validation data set and about 1000 images, 500 images of cats and 500 images of dogs in the train data set. Okay. So, we can easily connect the data and code together with a simple tool called DVC. And I just want to add a disclaimer that DVC is not the only tool for models and data set versioning. There are a bunch of other tools as well. So, you can just Google about it and think whether it works for you. So, I'm going to connect the data and the code together. Okay. So, now a bunch of metafiles are being created. So, let me add the metafiles and commit it. So, you can see that a metafile called data dot DVC is being created. So, this file is used for linking off the data folder. It is a metafile to link the data folder and the code together. So, as you can see, so this is the DVC workflow. So, whenever you are trying to run any problem, so it automatically creates a DVC file with it and you can get pushed and filled with DVC workflow. And automatically, if you want to automatically, if you want to use any remote storage, remote storage, which is for storing like AWS, if you want to push your data into any cloud services, cloud storages like AWS S3 or Google storage, you can easily push using DVC push command. So, right now, I'm going to push my data set to a cloud storage. Sorry, I didn't specify my remote. So, currently, I have just set my local DVC server as the remote server. And yeah, so let me add more data to this. So, I'm going to add some more data. So, new labels dot zip has some more data in it. Okay. So, right now, there are about 2,800 images of cats and dogs. And there are 1,000 images of cats and 1,000 images of dogs in it. So, again, I'm going to connect the data. I am going to connect the newly added data into the into the into the with the code and my DVC file will automatically track for me. So, as you can see right now, the data dot DVC file is being changed. So, I have 2,000 images of cats and dogs. And I'm going to add a tag get tag with it. Okay. That's it. And let me push again to my remote server once more. So, right now, we have tracked 1000 images 1000 images of cats and dogs at first. And then we add more labeled images of cats and dogs again. And we are going to track all this using DVC. So, this is Bob So, Bob apparently clones the data folders from Alice repo. And I forgot to do one more thing. I wanted to train the actual model. So, I can easily train my model using Python 3 train dot py. It will take some time. So, I'm not I'm going to shop it right now. It will take like about five to 10 minutes to train this data. So, as you can see that there is already there is already one second. Okay, I'll just remove my data folder that I actually added it. Sorry. Sorry for that. So, I removed the data folder. So, this is Bob. And when she clones a new repo from Alice from Alice state, she gets all these files there. So, in order to in order to get the data and the model which was already being versioned there. Alice can easily use a command called dvc pull minus r from the remote to get all the data sets which was being there initially and you're able to easily get that data set back. It will take some time. So, as you can see it's apparent that both Git and DVC are kind of analogous. There is a push command in Git and there is a push command in DVC as well. So, it's I think you understood the analogy and how well it fits everything. So, right now the data folder is being generated. And also, there is a model.h5 which is being created, which is created after the training process. Okay. So, what I'm next going to do is this is something which we need usually. So, as I trade we need to Bob is a data scientist and she already downloaded 2000 images of 2000 images of cats and dogs which already Alice had added. But what if she wants to make some modification in the code and apparently she wants to test whether the data set in the 2000 model 2000 images model or the 1000 images 1000 images model is better or something. So, you can easily go back between the various data sets versions easily. So, so I'm going to go back to my earlier version of 1000 images of cats and dogs. Okay, it's now checked to a now I am in my old stage of versioning 1000 images and in order to so right now, I had 2000 images of cats and dogs at first and now I want to return back to my old state of having 1000 images of cats and dogs. So, I can easily go back with a simple command dvc checkout. So, as you can see there are now just 1000 images of cats and dogs. So, that's dvc for you. You can easily switch the various versions of data sets the various versions of models and there are a bunch of other fun functions as well like pipelines and all. I recommend you to search more about that. So, as a conclusion, I just want to say as it's quite apparent data science is different from machine learning different from software engineering which most of us are accustomed to. So, whenever you are working on your projects think about your processes how you can improve your machine learning projects. Right now, most of us doesn't follow the best practices of machine learning. Actually, there is no best practices of machine learning. Recently, a very famous blogger David Herron recently said when she was working in 1995 in software engineering they used to coordinate working with the teammates with parallel directories. That's actually the state of machine learning right now in 2019. We need to change. We need to think about our processes. We need to make machine learning more efficient and more practical for all of us. Yeah, so try to version control your projects and try it out on your machine learning projects. That's all. Thank you. We are running out of times. So, we'll take just three questions. Please raise your hand. Questions anyone? We have one question. Hey, hi. This is really good stuff. So, question on the data security as in where the data is getting stored and how secure is it on the core side and data side? So, the data is being getting stored. In my case, I am using a remote server. So, when you are working on actual machine learning projects, we have a separate cloud storage. We usually use Amazon S3 or Google storage. And if you want to send data, you can use other platforms as well. There is also a DVC remote which is being set right now. Right now, I have stored my DVC in a specific folder of my project. So, this is the local data storage right now. So, when you are working on any actual project, you are actually storing it in Amazon S3 and Google storage. And about your question on security, I am not sure whether there is any defix or not. Hi, that was a great session. You showed us how to version control data. Should models also be done in the same way or we have any different approach for model? Yeah, models should also be version controlled. So, because I told you the importance of tracking experiments. So, it's good way to, this tool is a good way to version control your models as well. And I showed you how we, right now, sorry, in the Bob's case. Okay, this is the old state, sorry. The model is being versioned here, model.h5.dvc. So, model is not stored in any. Another thing in DVC like called 5Lens, just Google about it. I can talk to you more about it. Okay, any more questions from the audience? So, you were saying that your files get stored on your Amazon server or something. Do you have support for Git LFS, large file storage? Okay. So, Git LFS is also a good tool for storing data storage of large files. But Git LFS has some limitations. So, Git LFS needs a separate Git server. So, and most of the Git servers has a file size limit. So, in Git Hub's case, it's about 2GB. And if it is Git Labs or Atlasian's offering, there is some limitations. So, you can't version control large amounts of data using Git LFS. That's a problem. Okay. So, we've run out of time now. This is, any more questions? Okay. So, thank you.