 Hello folks, welcome to the second day of Mycon India. Hope you all enjoyed your day yesterday, and hope you all had a great rest at the night. So, and hope you all enjoyed the panel discussion on this talk. Now we will be starting with our first talk today. Right now we have Dr. Alo Brian with us. She is a data scientist at Interactive Cooperative. She has also worked in research in computational neuroscience and speech perception. And right now she will be talking about how to use continuous integration with machine learning. So let's get started. Hi everyone, I'm really happy to be here. It's actually past midnight for me. It's almost one in the morning for me and I'm really thrilled to be with you all. So I'm just going to get started. So my talk is how to make continuous integration work with machine learning. All right. So thanks to that intro. You know by now a bit about me that I used to be an academic researcher and now I work on data science at DVC. And we are an organization that makes tools for machine learning and data science based on engineering principles. So a little bit about the machine learning workflow first. So I want to start with a diagram and this is a really powerful figure that's all about the machine learning workflow. And so there's a survey from teams all across Microsoft and that, you know, finding that they all tended to follow a pretty similar workflow once machine learning was involved. There's kind of nine steps here going from gathering your data and engineering it to monitoring your model and production. And this is really consistent across teams. And there's a lot of folks working on this. We've got data engineers and people whose job is engineering, you know, all of all of data management. We've got data scientists, people who are modeling data, you know, creating great models from it. And we've got software engineers who are going to be helping us deploy models, monitor them, you know, and keep all the operations running. And, you know, this process has a lot of feedback, right? So data scientists might need, you know, be asked by software engineers, hey, can we check out, you know, how this model is going? Oh, you know what, I'm sorry, I have to pause for one moment because of my it turns out my laptop is unplugged and I'm going to be right back. So sorry. Okay, we're back and we can go for the rest of it. Thanks for your patience. Okay, so data scientists and software engineers are going to have quite a bit of back and forth because of, you know, finding the best model is going to be, you know, an iterative process, right? And it's going to depend quite a lot on what the environment is at, you know, that we can actually deploy in. And so there's going to have to be quite a bit of feeding back and forth to make sure all the data scientists ideas will actually work in production. And everybody is going to need to know, you know, what data set was our current model trained on and how may, you know, how might the data that's actually being received by the system now that it's in production be drifting far from what the training data is. And so all of this back and forth can get pretty confusing pretty quickly. But we have some inspiration for maybe a way out of this confusion. And that's DevOps. So DevOps, if you're unfamiliar is the combination of the words development and operations. And so it refers to this kind of coming together of these two potentially opposing forces development people that want to create new features, experimental lot and operations people who want to keep our systems stable and always running and finding some shared practices so that they can work together. And, you know, machine learning and the software development around machine learning doesn't quite have yet an analogous set of practices and philosophies. But we might be able to learn something from the way that DevOps has really transformed software development in the last decade or so. So I want to talk now about one of the most important ideas in DevOps, and that's continuous integration and delivery. So the principles of continuous integration. I'm just going to abbreviate it. The big idea is that whenever you change your code, you will build your software and then you'll run some tests and you'll get feedback. Okay, did we pass or fail my tests? And if you failed, then you'll keep iterating on this cycle. And if you pass, then you have a candidate for release. Now let's think about what's different when we're doing this with machine learning instead of traditional software development. So it's not enough to be monitoring just did my code change, we also have to check did my data set change because of if our data set changed, that will also affect our model, you know, potentially in a very profound way and potentially in ways that are difficult to measure or to predict from the outset. So we really need to monitor if data sets are changing. Another difference is building software. We're training models and training models can be quite simple. It's might be something you can do on your laptop or it might be something that you need hours and hours of GPU time to do. Another difference is that instead of just running pass fail tests, we're going to be evaluating our model and model evaluation is, you know, it's quite interesting. And even, you know, the practices about how to evaluate models is still emerging, but it's pretty clear that a lot of pass fail kind of checkmarks are not going to be enough for us. We need a report because of it's possible for like the overall accuracy of a model to improve over some baseline and yet do worse in certain categories or subsets of your data. And that can have dramatic consequences for, you know, it's real world implications. So we need detailed reporting and it needs to be something that a domain expert like a data scientist can come in and examine quite quickly. So why don't we have this yet if it's such a great idea? So first of all, version control can't always track changes in data sets. And one of the big problems here is that when you keep a data set, you know, in version control, sometimes they're simply too large to be efficiently versioned. So everything I showed you in traditional CI systems is mediated by usually Git version control. And that's very difficult with, you know, certain data sets. Another difference is that training models isn't like building software and a lot of the hardware that we use for building software or CI with, you know, typical systems is not going to cut it for machine learning models. They're simply too big and too resource intensive. And another difference is that evaluating metrics is more complicated. We need something more than pass fail checks and that's not built into a lot of typical CI systems. So we think what's needed here in, you know, to kind of improve the rigor and the sustainability in the long term of machine learning projects is to build tools that extend CI CD practices from software engineering to ML. And at DVC or iterative, that's what we do. So you might know our, you know, our kind of flagship project is called data version control. It's an open source project. I saw a lot of people in the keynote were asking like, oh, what kind of open source projects can I get? And we have a lot of first time contributors. So it's worth checking out if you're if you're interested in trying out a contribution. But we make tools that use, you know, extend software engineering practices to machine learning. And we're also a totally Python project for the most part, which is also really cool. So, okay, our thoughts about how to reimagine CI CD. This is kind of our game plan going into the problem. We want to be able to version data like code so that data sets can trigger, you know, a CI, a CI feedback loop. We want to let our CI system allocate whatever cloud resources are needed to train models. And we want to be able to provide feedback about our tests in the form of metric reports. And that could include data viz. So here's how we're approaching this. And here's the team that I'm working with. Dimitri Petrov, who is the DVC creator, is also working with us on our new project. David Ortega is a developer and me, I'm a data scientist. And so here's kind of our big inspiration. Lately there have been some really neat developments in the Git ecosystem. And if you're not familiar with either GitHub actions or GitLab CI CD, I totally recommend checking them out. They're a lot of fun. And these are CI systems that, you know, basically GitHub actions is a pretty powerful kit where, like, whenever you have a change that's detected in your GitHub project, GitHub actions will spin off a computer that is managed by a GitHub to do a workflow that you've defined. So they're actually providing you with computing resources. And the same for GitLab CI. And so in this talk, I'm going to stick to talking about GitHub actions, but actually everything here is equally applicable to GitLab CI. And these two systems are very comparable. So our project is a new open source project called Continuous Machine Learning. And the basic idea is to adapt these two very popular Git centric CI systems, GitHub actions and CI CD, to work better with machine learning and data heavy projects. So two ways to keep track of our project. We've got a website, cml.gov, and we have a GitHub repo where you are always welcome to open an issue and try to solve a pull request, you know, it's a cool place to visit, I think. So here's where those systems fit. So GitHub actions or GitLab CI fits into this part of the CI feedback loop. Using GitHub actions, you can automatically train a model, run some tests on it, and then get an answer about how did the model do on those tests. So we can automate the training part and then the testing part and the reporting part. And we're going to be doing that on computers that are run by GitHub. So the first priority. I'm just kind of going to step through some of those three priorities we laid out and how we're addressing those and, you know, using the GitHub actions, GitLab CI system as the basis and then building some tools on top of those that beat these priorities. So the first thing we wanted was to be able to provide feedback in the form of metric reports. And so kind of, you know, so the basic unit of a GitHub action is that you create a YAML file that's in a special directory, GitHub workflows. And then it's just a, you know, some pretty ordinary format. It's a YAML. And then you describe a workflow that you want. And then every time GitHub detects a change to your project repository, it will run the workflow and report the answer back to you. And usually it will just report, you know, kind of in like console logs. But what we have here is our kind of our setup. So you put this special YAML file in the special directory. You set a trigger so you say, okay, whenever a push is detected, this one we're going to run the workflow. You can pick a Docker container, the operating system of your, of the runner. We're just using a new bunch of machine and we're using our Docker container that we've created that has a library of some functions we wrote. And then we've got our workflow. And so in this workflow here, I'm saying let's run Python train.py. So let's train a model. And then we're going to use some special functions that we wrote, these CML functions for continuous machine learning to write a markdown report. And it's an ordinary markdown file, but it's going to have some text data and some image in image here. And then we're going to send that to GitHub. And here's what happens. So when you make a pull request, computers managed by GitHub will run this train.py kind of kind of workflow. And then it will give you this data vis and report on your pull request. So you get like a new level of visibility to what's going on under the hood, what your model is doing. And now it's in GitHub. So I think that is cool. And then the value of having this kind of reporting in the pull request is that now you and your team can have a discussion and you can decide when to merge something into like the main branch of your project, maybe the production branch together and decide if it's up to muster with maybe more transparency about how the model performs than you would get if you had done this testing locally. And you can do, so just let's take a closer look at the workflow I did. This is the code that I had and this is the figure that it generated. This is an ordinary confusion matrix.png file generated by Scikit-learn. You can use whatever kind of visualization tools you like, just save into a standard image format. We also created, you can use DVC with this. So you can use DVC. And this is actually, if you're familiar with it, we have some plots. And the plots are really good for comparing metrics about your models or datasets across Git branches. So when you use them in a GitHub project, that means that your pull request will display how does the model on this branch differ from the model on the main branch of your project. So that can be useful for making comparisons. And we added a TensorBoard wrapper so that you can get a link to a TensorBoard in your GitHub pull request. And this will help you follow how your model is training in real time, even though it's training on a runner that is managed by GitHub. So reports are basically just customizable markdown documents. You can actually put whatever you want in it. And so a couple of ways to use this in these reports in particular, you can report about how the model training went on remote machines. So, you know, TensorBoard is an example of that. You can get transparent and totally reproducible in detail test results. So you can put whatever test results you want here, whatever is going to be useful for your project. And kind of another value of this is being able to review your data and your models like code. We have a lot of great practices for code review, maybe not so much for like, oh, I've changed the data set or my model has really changed. You know, how does that compare to what's on the main branch of the project? Is this something that we want to merge? And so, you know, using kind of the code review standards created by, you know, the Git flow, maybe we can extend that here to these aspects of machine learning project development. And so we've been creating this YouTube video series as kind of goes into each of these use cases and ways of using it. And we're always expanding it. And it's been pretty successful so far. We've had it up for a little over a month and we've gotten like 17,000 views. So people seem to generally be interested in it. So I feel like we're hitting something that people are responding to or think is cool. But you know, we're still kind of learning all the use cases and learning what people want to use this for, where it provides them the most value. Any feedback about that is very, very welcome at this point. So the second priority that we had was about data management. We wanted to be able to do CI while working with larger data sets. And actually, this goes back to what data version control or DVC does. So DVC is a tool for extending Git version control to data sets and models. So DVC is not a replacement for Git. You use DVC with Git and it helps you, you know, link with whatever's in your Git repository to data sets and models that might be in cloud storage. So you can version them like you version code, even though you don't have to keep them with your code. So you can use this kind of Git like syntax to be tracking data sets and models. And so you can do things like DVC, add and push data sets from your local machine to cloud storage. And so that when you do that, DVC is going to create this kind of metafile. So a little, you know, just a little lightweight text file that you can version like code. And that text file actually will help you, you know, points to where your data is in cloud storage. So something that we did in CML was that we created a way to give GitHub actions access to your DVC remote. And that's your cloud storage. That's where DVC tracked files reside. So you can pass some credentials to your GitHub action. And then that means that the machine that is running these tests automatically can pull your data set from cloud storage, do some transformations on it, some modeling, whatever workflow you want. And then, you know, maybe if you created a model at the end of that, push that model back into your cloud storage. So you can really start pushing and pulling lots of things tuned from cloud storage using this high level syntax. And that avoids you having to try to cram things into your GitHub repository or use Git LFS. You can use whatever storage you like. And that includes Google Drive, S3, GCP bucket, as your blobs, it's pretty flexible. So when you pull data, you're grabbing data from the cloud. And when you're pushing, you're pushing things from the runner to the cloud. So a few consequences for continuous integration. This means that changes in your data set can now trigger your GitHub action, even though the data set isn't part of your GitHub project repository. The data is in DVC storage or cloud storage, but it can still trigger a round of the CI feedback. Another difference is that your GitHub actions can now push and pull big artifacts to and from cloud storage. And finally, you have data set and model versions linked to every model training run. So you can know precisely what version of your data was used at the time that a certain model was run. And then you'll also know, OK, and what was the code, what was the environment, what was the infrastructure, because you did it all in this GitHub actions system. So the third priority is how can we let our CI system allocate cloud resources to train models? And this is the coolest one to me. I'm actually going to talk the least about it because of we're currently revamping it. But our first approach has been using Docker Machine, which is actually a deprecated project, but it's really cool. It gives you a way to deploy, you know, just turn on basically an instance on like EC2 or, you know, there's a Google Cloud engine and there's an Azure engine. But you can type a command from your terminal and it will launch, turn on your machine for you with whatever Docker container you chose and also, you know, a command to run. And so we pair that up with GitHub Actions and essentially what that means is you can push to your GitHub repository and that will trigger, it will turn on your EC2 instance, start training on it, and when the job is done, turn it off for you, which to me is the coolest thing. And we're actually revamping this right now. We've decided to use something called Terraform to do this. So I won't go too into the implementation because it's gonna change probably within the month, although the usage won't. So for example, something that you can do here is like this is, I had a style transfer neural network and I was able to, anytime I would change the code essentially, that would trigger an EC2 instance with the GPU turning on, running the training and then turning itself off at the end and I get the result of the style transfer reported back to me in my GitHub pull request. And so here I'm showing a comparison of the model on the main branch of the project compared to the model in, you know, my experimental branch. So something cool about this is that because of your automatically working with these EC2 instances or whatever instance type you choose, you don't have to remember to turn on and off your resources. So you don't have to engage with them interactively. Like you don't have to log on, you know, to your instance and then remember to turn it off. GitHub actions will actually automate that. So we've got this really cool resource orchestration and it feels tremendously powerful to be able to do that. So I'm very excited about this aspect. So to summarize what we have so far, we wanted to reimagine how continuous integration could work for machine learning. Going from, you know, it's really well entrenched and there's a lot of excellent practices in traditional software development. How can we bring them closer to machine learning? So we wanted to be able to version data like code. And for this, we had already established that data version control was a good tool or a tool that we have worked really hard on developing with a really incredible community to extend Git version control to datasets so that data can now be part of your CI system. We also needed our CI system to be able to allocate cloud resources to train model and we kind of crafted one approach and we're working on another so that you can deploy your cloud resources from your GitHub action. So that kind of makes it... And also just to note, this also works on non-cloud resources. You can deploy a GPU that's like, you know, wherever you have a GPU. It could be in your house and your GitHub action can trigger it. And it's not too hard to set up actually. It's, you know, and that's because of the GitHub docs are great. But it's pretty easy to set that up and quite fun. And finally, we wanted to be able to provide feedback in the form of metric reports. And so we created this library of helper functions for writing reports in your PR. And so you can write basically a markdown document with whatever kind of reporting you want about your model. And it will appear in the pull request just like a regular comment. So, you know, I think that there's a lot of value in doing CI-CD with machine learning. And I think, you know, some of it is just that it's cool. It's cool to automate things. It's cool to move off of using your own computer and your own resources to putting, you know, you can almost be serverless in a way. GitHub Actions is not truly serverless, but it is like you don't have to be the administrator of the computer resources that you're using for testing and training. And that's pretty cool. So all of these, these aspects of automation, you know, it gives you a lot of power, flexibility and also transparency and reproducibility. Like you can always go back to the commit that generated, you know, a certain model performance and you will have everything in that CI system to completely recreate that experiment. So, you know, those are a couple of reasons why I think this is, it's a lot of fun. And so, you know, another value of this is that a CI system is a good way to take all the parts of your project from your data set, your code, your models, you know, and put all of this, all your results, all of the computing resources used for this from what is the actual infrastructure that ran your tests and what were the Docker containers, what were the environments, and put all of this into one system. And that makes it so that everyone on your team has access to the entire history of your project, all of the history of your tests in a way that's easy to revisit, recreate and search. So, I think that that can be pretty powerful. So, thank you for listening. And so, you know, our project's ongoing and it's still kind of a baby. It's, I think it's maybe two months old since we've launched Creative Machine, launched Continuous Machine Learning. And so, you know, we're very much still developing it and looking forward to any feedback about, you know, what might be missing, what doesn't work for you, where do you see this fitting into your actual day-to-day machine learning workflows. And so, you know, any of that feedback is very useful. And so, there's plenty of project updates ongoing. These are our social channels. And we also have this video channel that's going, which is tutorials and use cases. And a lot of these are inspired by people asking us questions and letting us know what kind of use cases they'd like to see. So, you can also check us out on our DVC YouTube channel. All right. Any questions? I think that's it. Thank you for the great talk. Very exciting talk. Can we find some of the questions? So, yeah. So, the first question is an important feature of Creative Machine that you can check out at any point in time. You can call it a VCS at any given time. And so, how does that translate to DVC? Yeah. So, you can do checkouts. And so, it will depend. There is a cache. There's a cache on DVC so that if you've, basically, if a version is in your cache, you should be able to do a checkout without accessing the remote. And if it has never been on your local workspace, then it wouldn't be in your cache and you would have to go to the remote. So, basically, if you have, if you've like cloned, or the whole project is in your repository, in your local workspace, sorry, then you should be able to check out very easily, no remote needed. But if you've like copied someone else's project to your machine, you would need to do, like a DVC pull from their remote. But it's pretty straightforward. I mean, as long as you have access to the remote, it's like one command, it's like DVC pull. So, I hope that answers that question. But with regard to how checkout works, it's very similar to Git. It's modeled off Git. So, it should be as identical as possible. And the question is, what size of data can be sent? Data can be, there can be a lot of data. Yeah, actually, there is not, there's no technical limit. You can go as big as you want. I would say that by the time you're hitting a terabyte, you might want to explore some other systems, you know, tools like Dask maybe. So, DVC is really optimal for data that's in the range of like 10 megabytes to, like a few hundred gigabytes. I mean, you can do whatever you want. It's just limited by what kind of data transfer speeds you can live with. Another question. We can do DVC for data science programs. Or can it also be used for data science programs? Yeah. So, I mean, again, you can do it in any kind of CICD pipeline. So if you use the CML library that we're showing here, we created basically some authentication methods so that you can hook up your DVC remote storage to your GitHub Actions or GitLab CI workflow. And so any time that you need to access like data sets or models that are in cloud storage, this is one way to do it that offers the high level syntax and some versioning. So it's pretty flexible. And even though you're focused on machine learning here, I do think that some of these tools are quite flexible for non-machine learning type projects. So there was another question. Suppose you have a large amount of data and you make some small change in that data, will the whole data be uploaded again? Yeah. So this is something that right now, there's not like snapshots so that you are like deltas, so that you can like recreate based on like, oh, it's just this old version of data plus a little thing. Every change will be stored as a new version entirely because of the way that we do it with content addressable versioning. So that's a feature that we'd like to build in, possibly in the future. But yes, right now, it's not super storage efficient. So yeah, if you have a huge data set, any time you change it, it would give you another copy. Most people don't run into any problems with that, but I could imagine if you had absolute, like very huge data that could get into something. Thank you. That's all we have. Thank you for today. That was very insightful talk. And I think there are some more questions, but we base them on the stay stream in digital live. You can also answer there. And the audience can also live there. Okay. Thanks, everyone.