 Hello everyone. My name is Dmitry. I'm going to talk about machine learning experiment management, how we create machine learning experiment, how to become more efficient on the experimentation phase, and what kinds of open source tool we can use in order to be more efficient. I'm happy to be here first time in San Diego. It is probably the warmest place I have ever visited. I was born and raised in Siberia in Russia, and for my entire life I was constantly moving to the south. First I moved in St. Petersburg in Russia, which is a way warmer place than Siberia, as you can imagine. Then I relocated to sunny Seattle, and a couple years ago I moved to California to San Francisco. When I work at Microsoft, I was data scientist at Microsoft, and I have seen how large companies are organizing machine learning process. What kinds of best practices they use, how they make team to be more efficient, and what kinds of tools they have built to be more efficient. Today we are working on open source tools and machine learning tools, and we use some ideas and some of the best practices that I have seen in large companies. Also I am one of the creator of DVC project, which is Git for Data, DataVersionControl. Today we are going to cover some of the open source tools, including DVC. So my opinion might be a little bit biased, because I am creator of one of those, but I try to cover this problem from different angles. But first let's discuss what is why ML is special? Why should we think more about machine learning? There is one opinion that we are not very efficient when we are doing machine learning. We are wasting a lot of time already doing the same thing. We are making the same mistake over and over again. We don't know how the process needs to be organized and what kinds of tools that needs to be built. To understand this problem better, let's take a look at the difference. Difference between software engineering and machine learning experimentation. And the first straightforward difference is hyperparameters. Some people think that machine learning is about choosing the right set of hyperparameters and getting the best metrics. This is partially true. Machine learning actually includes a few stages. We process and prepare our data sets for ML training. First, second, we develop our algorithms. We do the coding, kind of a usual coding job. And then at certain level, we do hyperparameter tuning. And then this cycle repeats again. What is special about hyperparameters? The dynamic of the project changes a lot in this stage. Because instead of running one, two experiments in a day, we start running dozens of those in a day. Sometimes even hundreds and thousands. And the next question would be, all right, we have all these experiments with different set of parameters. How can we track and store this information? When I work at academia, I usually use, like, pen and pencil to put all this information. And it works fine for small teams or for a single person. If you work as a team, you probably need to communicate this information better. And one of the common way of communicating this information is Excel spreadsheet. You can store this information in some Excel or online and share with your team. So you will see what kinds of experiments you have run, what kinds of parameters were used. The next question would be, all right, but how can I return back to the best experiment? So machine learning is metrics driven. And we need to preserve and store all these metrics along with the parameters. This is how you can choose the best combination of metrics, the best experiments from hundreds of experiments that you have run. And the same table can be used. And many teams doing this, even in high-tech industry, even in large technology companies. So you have all this history of your experiments, parameters, metrics you can pick and choose the best one. The next question would be, but how about the models? When you know what set of parameters produces the best set of metrics, you probably won't be really happy to run training again. You don't want to spend another hour, day or week to get the same model. And people usually store models as a separate file and keep them in some place. So you need to come up with some creative names for these data files, for these model files and put all this information to the same table, to the same Excel spreadsheet. So you won't be running this training again. The fourth difference is data sets. So we are dealing with bigger data sets with gigabytes of data, sometimes with even hundreds of gigabytes. And data sets are evolving. We do have a version of our data sets. This is especially true for production environment. When you might have a new data every day, every week, you have more and more labels. When you do modeling, you probably hack your data by yourself because you see some issue in data sets. You create new labels. You run some argumentation algorithms to extend your data set or some active learning techniques to increase the amount of data that you have. And the version of data set becomes a big pain, especially because of the size. It takes time to just copy the data set. So the same solution as a model, which is quite common, you create a copy of your data set, copy of your directory, for example, with some name. And you put this information on the same place close to the same experiment. The last difference is, which is actually not difference. It's a similarity. It's code. We still work with code. We still write a lot of code, especially on this modeling stage. And this code needs to be stored and versioned on your version control system. So this is kind of basic assumptions. And I believe everyone is doing this. What is special about ML? We need to... Code is not enough. We have all this variety of artifacts around experiments. Code, data, hyperparameters, metrics, and everything form an experiment. And everything needs to be stored together with a clear connection with each other. So for code, of course, you can store all these commits the same way as you store it to the data, to the same table. So as you see, the table with information about experiments becomes bigger, bigger, and bigger. And you probably have the same engineering intuition that spreadsheet is not the best way how you organize your workflow. If people use Excel spreadsheet to manage the workflow, to manage collaboration, it means something is broken in the process. And why that? First of all, because of people involved in this process. You need to ask people to put this information to the table. And not everyone wants to do this. Not everyone is ready to spend his or her time for this. And it is too easy to make a mistake. You can run one experiment, change one hyperparameter, change your target metrics, and forget about the change in the rest of the metrics. So you don't have clear, correct information. And when we think about automation, how to become more productive, how to be in the experimentation phase, we need to think about replacement of this table. We need to think about automation of those of these steps. And tools help you, and we will cover some of the tools. So first of all, let's take a look at the experiment. So our experiment contains data code, model, metrics, and hyperparameters all together. And we need to preserve them together and keep track of all these artifacts. So for automation, we will cover three tools, MLflow, GitLFS and DVC, data version control. The first tool is MLflow. You can easily find and install the tool. We will focus on tracking part of the tool, which is very important for experimentation phase. MLflow is focused on hyperparameter tuning stage. It gives you ability to run dozens and hundreds of experiments and store all this information, all the output, outputs of the experiments in some internal databases. You can store hyperparameters, metrics, ML models, and even data sets. And all this information goes to some internal database by default in your machine, in your local machine, and you can set up a central one. So how it works. It requires some code modification. MLflow is a library for Python or Scala in our world, it means Spark. And you need to import the library and inject some statements with your hyperparameters, with your metrics, and output files that your code produces. When you run this code, when you run your training, all this information will be moved and stored in some database. And through UI, you can see the result. Have you recognized this table? This is actually the same Excel spreadsheet as we saw before. And MLflow basically automates this Excel spreadsheet. What is important to know about MLflow? MLflow copies all this information to some internal database, which means it can work really well for your hyperparameters and metrics. They're just numbers, right? It might work really well for models, because models, there's relatively small files, right? And it probably does not work for data. When you work with gigabytes of data, gigabytes of data set or 100 gigabytes of data set, you probably don't want to create copy over and over again in this experiment, right? For many reasons. And another problem, there are no clear connection between your code and outputs. So your hyperparameters and metrics will evolve separately to your source code. There are some tricks how to connect them together. But this is going to be on your shoulders. So you need to modify. You need to put like a checksum or something to connect these two worlds of metrics and source code. So to connect your code and data inside of experiments, people can use Git LFS, which stands for large file storage. LFS helps you to manage larger data files, as you might guess. It's just a Git extension. And it provides the same, absolutely the same API as Git. When you install Git LFS, you will be able to add your data files inside your repository. So what have changed? First of all, you have a clear connection between your code, your data, and your models. Git LFS is not machine learning specific. They are no concept of metrics. There is no concept of hyperparameters, of course. But those three pieces connected through Git. So, and Git LFS can give you, can bring you to a new level in this experimentation phase. It works really well until you start working with actually large data files. By large data files, I mean a gigabyte scale. When you jump to like three, four, 10 gigabytes scale, Git LFS, it's not a perfect tool. Because initially, this tool was designed for front-end engineers. It's about storing your image files inside Git repository. LFS is about storing your Photoshop files, which is a couple hundreds of megabytes, into your Git repository. And it's not perfect for a gigabyte scale. Git Hub has a two-gigabyte limitation for LFS project. However, the good thing about LFS, it provides you a usual Git API, which means you can easily connect this tool, this experiment, with your DevOps infrastructure, with your data infrastructure. So, no data version in general for Git LFS, however, model versioning and clear connection with your source code. Now let's talk about data version control. When I worked, when I worked at Latch company, when I worked at Microsoft, we had the system which tracks all our experiments. I had an ability to jump in any experiments from yesterday, last week, last month, like three months ago. And I had, I have basically everything, all the metrics, versions of file that I have used and result that were produced. When I joined a small startup, I was surprised that it's not a thing in other companies. You cannot do this. And if you'd like to keep all your results of experimentation, all your versions of data set, you have to do this by yourself. And if it's okay to keep information about metrics and hyperparameters, it's like adjust the numbers. It's easy to do. With file, it's a tough problem because there are a lot of heavy lifting operations, like a copy file, a renaming file, moving file from my machine to server because I don't have space for all this version. And of course, we had a lot of problems with this. A lot of versions of file were lost. We were doing the same job over and over again. And I was thinking, guys, but how can you survive in this world? And everyone was saying, you know, like, it's fine. This is how machine learning is organized today. This is kind of a state of the art. And I was saying, like, no, it's not a state of the art. I have seen a better world when it's not a problem, when all this heavy lifting part is solved by your tools. And this is just a matter of your tool set to manage and preserve this information about data versioning. So we started the DVC project, which stands for data version control, which provides you get-like experience for large data files. And it's machine learning specific. I'll show you in the next slide. So you initialize your DVC in your repository. You add a data remote in addition to your code remote. And all the data goes to the data remote. DVC stores nothing to Git. Because of command-line experience and Git-like philosophy, it's a programming language. Agnostic. It works with your R code, Python code, or whatever custom code or tool you have. So this is how data remote and code remote are organized. From one side you have Git repository, with code and all the meta information. And from another side you have data remote, which can be your S3 bucket, Azure directory, or just like SSH server when you store the data set. And DVC orchestrates these pieces. So basic principle behind DVC, first of all, it was built for large data files. Large means any data file which needs to be processed in a single machine. It's a very natural requirements for machine learning. Because at the end of the day you're doing machine learning and training in a single box. There are some exceptions, but in most of the cases this is true. So we are talking about gigabytes and hundreds of gigabytes scale. Second important thing about DVC, it creates experiments as a Git commit or a branch. Why is it important? First of all, it supports best engineering practices, which is especially important when you are in your algorithm development phase. It should not necessarily work well for in a hyper parameter tuning phase, but in algorithm development phase it's the best practice. And this Git philosophy gives you ability to integrate your experiments, your machine learning activities with your IT infrastructure, with your DevOps infrastructure. This is kind of a bridge between two different worlds of machine learning and kind of research and DevOps and engineering. This is a gap when the tool set needs to cover. And DVC is not specific. It does have concept of metrics, machine learning pipelines, and it gives you ability to navigate through your Git repository, kind of a high level functions to navigate through your Git repository to find the best commit for your set of metrics. Let's take a look how it works. First of all, you can add any data file inside your repository. DVC is a special additional command. It's not Git, but Semantic is quite the same. DVC add file and your file goes under DVC control. When you do DVC push, data goes to your Git repository, like S3 bucket or SSH server. As a result of this command, some data information will be generated. One additional file for data XML DVC, which contains basically description of the data set, or you can think about it as a pointer to your data set. And Git ignore is going to be modified because data XML should not be stored in Git anymore. This information, this meta information needs to be committed and pushed to your Git repository. So you can do the same thing with images. Initially, DVC was built for kind of an LPS scenario for large files, like one file, 30 gigabytes. But we found that many people use DVC for computer vision problems, which means a lot of images, like hundreds of thousands and even millions. And we optimize DVC for this kind of scenario. Today, you can add a directory with millions of images and push all these images to your data storages and start versioning your image set. So once you created one experiment with data set, this information can be easily, your teammates can easily get this information. They just need to clone your repository with the code, meta information and metrics. And by DVC pool, you can bring everything in all information about this particular experiment to your machine. If you need another experiment, you check out this commit with another experiment, run DVC pool, and you'll get all the information about those, this experiment. Sometimes you don't want to get everything. For example, your experiments create a train, one single model, let's say computer vision model, which is like 300 megabytes, for example. And your initial data set was, let's imagine 50 megabytes. In many cases, you don't want to pool all this 50 gigabytes data set. You just need a model, for example, in deployment scenario. DVC gives you this ability to granularly manage data artifacts. You can say, I need only this data file, pull me this data file, and it will be here. And we have implemented even a simple version, like one line version for this command. When you can say, okay, this is my repository, and I need this data set, data file from the repository. And in one command, you will get a file or a directory. There is a concept of pipelines. It's basically about connecting your data set with your model through a command. And you can build a pipeline and even a doc, like a sequence of command in a pipeline. Let's not focus on this part. So what happened with DVC? We basically replaced our Excel spreadsheet, all this information, with a Git history. And Git history, it's the best source of information engineers come up with. Git logs and the Git history is the information that engineers really trust and really read it. So this is a huge step to the best engineering practices. However, we should understand that this approach, Git based approach, commit based approach, works really great on model creation step, on the coding step, but it might be not the best fit for hyper parameter tuning stage. When you need to run one hundred experiments with the same code, with the same data set, but just small set of hyper parameters is changing. You probably don't want to have 100 commits, the same commits on your Git history, right? So hyper parameter tuning a face can be done kind of separate outside of DVC, outside of your Git repository. So let's summarize our thoughts about experiments and tools. This is the whole picture. And you can see there's some intersection in the tool set. So you can version, for example, ML models with any of tools we just discussed. What also important about the tools we covered, they are not mutually exclusive. You can use different tools in a different phases of your machine learning experiments. You can use DVC during a development stage when you code your algorithms, when you check your high level hypothesis, and you can switch to ML flow when you do hyper parameter tunings, when you do incremental changes in your models. And this is a quite common use case these days. You can find a lot of blog posts about this combination. And people like to use this combination. So Git LFS, it's a very simple way to improve your productivity, to store, to connect your data with your models. But it's not a ML specific tool, and you probably need to have some additional tool set to be more efficient. So today, we are seeing this shift from software engineering to data to machine learning. A similar shift was happening about 30 years ago when we moved from hardware design to software engineering. And today, the new shift might take another 30 years. Does it mean we won't be efficient for another 20-30 years? How to make it faster? How can we become more efficient in next year too, not 30? I believe we need to do four steps. First, we need to understand better what does machine learning experiment means. And we need to understand when we make most of our mistakes in the process. When we lose most of our time in the training process. Second, we need to use automation. We need to use tools. If you don't see any tools for your specific scenario, you probably should invent one. It's an early stage in the tool set, in the ML tool set. What is even more important, this information needs to be shared and communicated among your teammates. You need to probably go to the stage and talk about your successes and failures in modeling process in the tool set. This is very important for our community and industry. And the last thing is open source. Open source is the best way to share this knowledge, to share tools and information about using these tools. So thank you. I am open for your questions. And today, I will be on our booth. Please join and ask questions in the booth. Questions? Yeah, sure. So all the techniques we discussed, they are compatible with notebooks. Of course, you can use notebooks. When you run Git-based tools like LFS and DVC, you probably need to learn how to properly store notebooks into Git, right? And there are a few techniques and tools how to do this properly. But in general, yeah, everything is compatible with notebooks. So today, all those connections needs to be made in a custom way. So you need to kind of create a script, how to copy one repository, clone one repository, another repository, and then use them together, right? In DVC, we implemented a special feature, which is called DVC import. You can say, DVC import, give me this repository. It will bring me this dataset from this repository. So it will bring you all the data, and it will keep the connection. So if your next repository will change, for example, a master will move somewhere. It will recognize and you will be able to get the last version. So there are some support for this kind of features. But default standard is, yes, custom script. Yes. So this is actually a very good question because when we started working with DVC, we, like, all this question needs to be answered. How to deal with data, how to keep data immutable, right? So our solution for today, we use some left links, and we separate your workspace and your cache. And we tried very hard to preserve your cache from corruption. So everything in your workspace can be modified, but we still have a version and a cache. When you do some checkouts, you will get this, all right, correct version of the file, not modified version of the file. And there are a lot of tricks which we were using to do this. One of the tricks is reflinks, for example, right? I mean, everyone knows same links and hard links, right? Not everyone knows, like, there are new types of links, like reflinks, copyright semantic. And this is how DVC works today. We utilize reflinks specifically to solve this problem, to separate workspace and cache and not corrupt your actual data set. Yes, you need to install one, centralize, and put all this information in this database. And this is actually the limitation for large data sets, right? It's fine for models, like, to copy 100 megabytes, 300 megabytes even, but not for gigabytes, not for dozens of gigabytes. But in general, as a preserving table, it's just great. Let's, here's first, yeah. Sorry. So I'm not sure I understood the question, but let me try to repeat this question. You're asking about dividing data, right? Do you mean dividing data set in one file into two files, and then process one of those, right? Or both? Yeah. It looks like this is the last question I can answer. So there are no limitation in tool itself. So whatever you can store in your local machine, whatever you'd like your comfortable with to push to S3, it's fine, right? For separating data, it's not a problem. You can separate as many pieces as you can, like millions of images, no problem. If you'd like to separate one file to, like, 100 or, like, three files, again, it's not a problem. You might need to connect all these pieces together. Let's say data set to train and test, and then to model, right? This is why we need pipelines. I mentioned briefly that you can create a pipeline and connect all the pieces together. But I don't see any, like, issues with the numbers of files or numbers of steps you're taking. Some people have, like, 40, 50 steps pipelines. I have seen that. And with many files, as you may guess. So it looks like we don't have time, and I can answer all the questions outside or in both DVC booths in the same floor here. Thank you.