 Okay. Well, everyone, thank you once again for joining. This is Indra and I have with me Venkata from Scribble Data. As before, we are curating a series of conversations with data science folks, broadly called Making Data Science Work. This is about putting data science and its various avatars into production, where there are serious people asking serious questions about when do I get my money back on all this money that I've spent on this initiative and that. And today we're very glad to have two friends of Scribble Data, Ivan Shiklain and Dimitri Petrov, who are both founders of... Is he impressed? I got that. Who are both founders of... I remember us. Yeah, yeah. They're both founders of dbc.org. And now iterative.ai. dbc, I think many of you are already familiar. It's an open source version control system for data for machine learning projects. And in fact, I expect that some of you who are joining here today are actually dbc enthusiasts joining us from different parts of not just India, but potentially other parts of the world as well. And you guys are much looked up to. So Ivan, what was that? dbc has crossed 5,000 stars on github. That's incredible. Thank you. Yeah, super excited to have all the dbc enthusiasts here from India. And happy to answer any questions. What? Do they hear or later please contact us? Yeah, we'll provide them. We have time, right? Yeah, we answered different questions about dbc through 5 probably different channels. And today we got one more, right? Yes, we have... I believe we started with Discord initially, but now we have a bunch of forums, github. It's like a lot of different channels. Good stuff. I wanted to kick today's conversation off with just a little bit about the basics. While some of the people that are listening to you might know what you've done with dbc and what you're doing with iterative. I would love to ask first maybe Ivan and then Dimitri a little bit about your own personal journeys that has... Why are we speaking to you today? What got you into dbc and what got you into iterative? A little bit about your interest in data and your own personal journeys. We'll keep it short, but we'll be very curious to hear. Short, like 5 minutes short or 10 minutes short? Like 5 minutes short. Okay, so in 5 minutes I will cover my story and Dimitri's story and we will make something else. Go ahead. Thank you first of all. Thank you for inviting us. I'm really excited to have this deep conversation with data science folks, people who built ML tools. My personal story, I'm not even a data scientist I would say. What happened, I built tools for data scientists and I've learned recently a lot about data science, of course. But by trade, I'm software engineer. So my journey, I started doing system programming, like building databases back in Russia, Moscow, like 15 years ago. And I have done some startups not related to data science, web services, backends distributed systems, a lot of different stuff. And I found like the good question is how I got involved into data science at all, right? Why is this problem? Why, how, how it happened I connected with Dimitri and how it happens that we have been building data science tool for the last two years. I found myself wondering at some point, found myself rebuilding job title classifier. It's a very simple problem to have, like you have job title and you need to classify it to seniority level or type of job title, is it marketing or is it software development or something. And I was very curious, why don't we have some libraries for this? Why don't we have some ML models rebuilt for that? Why don't we have GitHub for data science, I would say. I later realized that it's a holy grail of data science tools to build GitHub for data scientists, I would say, right? But I found myself wondering why don't we have all of that? Like when I first started doing data science, being software engineer, I believe like a lot of software engineers do data science these days. I was wondering like how, why don't we share all these results? Why don't we have data sets? Why don't we have ML models? And even before I met Dmitry, I don't remember if he started building DVC at that point or not. Even before I realized he's building DVC, I remember we had that conversation. So I believe initially I came from a little bit different perspective to this area. And then when I met Dmitry, of course, it clicked because he will tell his story, right? But he comes to this problem from software engineering perspective as well in a way that resonates most with me and it makes more sense with me. So that's how we met and started building DVC and other tools. Dmitry, if you want to just add a little bit to this, your own journey and then when you went, Ivan? Sure. So first of all, I did some resorted academia, right? And we did machine learning even before like this term was like popular, maybe like about 10 years ago. And later I was data scientist at Microsoft. And at Microsoft, there's a huge difference between academia and industry company, right? In the industry, you have all these tools for machine learning called the beautiful AI platforms. Beautiful, probably not the right term, but at least you have the platform which solves the problem, right? You don't think about like how to get like this type of machine to run my code, right? A way to store data, how to like trigger data from pipelines and how to convert this to more production stuff, right? Dmitry, don't you use Excel spreadsheets to manage stuff? Yes, I did. It's a separate story. Let's not talk about it. So, and when I look at this kind of AI platforms and when I look to outside of like large companies, outside of Microsoft, I was thinking like, oh my God, like how people live outside? For example, when I start like playing, for example, Kaggle, like outside of job, right? Like, I realized that there are no tools. I start investigating like how startups solve this problem. There are no tools and they're kind of reinventing all these wheels again and again. And I like, and I start thinking like how those AI platforms should look like in five and 10 years. I don't believe that those like huge monolitical platform you can take from outside from like Microsoft, Uber, I don't know, Netflix and put it outside that they will work. I don't believe this will happen. I was thinking about kind of AI platform on top of open source stack. So it should be like open source tool. I believe we shouldn't have like one single platform. We should have like a bunch of tools which you can like chain together and they can work and sort of like a particular workflow because workflow for different in different teams. So, and then I start like playing with this, I start building something like how to get resources, how to visualize my plots, how to get this visuals to my machine, how to synchronize the stuff. And I found that I am kind of always making different kind of mistake. Like I got to run file and then like half an hour later, I just realized, oh my God, like I just waste like half an hour. And I need to do like the job again. And pretty soon I realized I cannot do this properly like resource orchestration visualization of my ML experiments because I don't have a good foundation. I don't have like a good version system data management system. And I got this idea. Okay, let's build this data version system somehow very quick. And then I will build like another blocks of this kind of platforms. AI platform. So this is how we stuck with DC. It's supposed to be like a couple months project. And today there are three years. We celebrate it like three years a month ago. This is a short story of DC. Thank you. And actually this leads me into the broad theme of what we want to talk about today, especially when Ivan was telling me about his primary trade being that of a software engineer. Right. So today's theme as the people that have logged in probably know is to discuss data science as a software engineering discipline. What can we take from software engineering principles and apply to data science to machine learning specifically. And I like that, you know, while you're not doing data science yourself, you are thinking about both from your training about what you were doing, as well as what do data scientists today who are talking about putting machine learning into production. What do they need and bridging these two. So that being the general theme of what we want to talk about. I'd love to ask, look, I don't want to ask you about. So what are software engineering principles? I think everybody has slightly different takes about it. And I think also that for the most part, our audience has already absorbed some of that. That tends to be the level of our audience. Happy to take questions from anybody in our audience. By the way, there's the question and answer tab on Zoom, or you can add comments to the YouTube video and we'll be able to get them to Ivan and Dimitri at some point during this conversation. If you have any questions, right. But maybe one of the first things I'd like to ask you is. Two software engineering principles apply to machine learning development and deployment. Anything that you want to talk about on this topic, maybe some learnings and approaches from software engineering. So go on. Sorry. Yes, what did we learn building these very complex systems over the past 30 years that is actually applicable to the new domain that is emerging? Dimitri? So before we jump to the principle, I think the more important question is like, why do we need those principles first and why now? And I believe why we start thinking about those principles. Why we start thinking about like engineering, like this supreme around data science is many companies today raise a point when they got finally, they got the first ML models is the first data result. And today they kind of their challenges have had moved from like how to get model, how to get data science folks for two production stations outside, right. Today they have this model and this model needs to work on I don't know cell driving car. It needs to work on some software. It needs to work to predict some stuff. Right. And this is the time when you start thinking about the principles because you cannot just take this model and put to like actual life system. Right. You need like a discipline around around those systems. Right. And this is when you think okay, like, but engineers have some ideas like how to manage this engineers know like how to do the stuff that they have like a dozens years of experience. Like why don't we learn from this and apply the same principles to our model management or kind of AI product life cycle. So this is I believe this is when it came from and today like many companies are struggling with with this particular problem, not like how to get model. Yeah, right. I would say that maybe a little bit controversial opinion. I don't believe like data science is unique, very unique or very, or, or like very, very different from software engineering. Right. Or very complex comparing to data science. It's just, we have early days of data science in software engineering. We have solved all this complexities. Quite, I think quite successfully still evolves, but we have gone from very monolithical like version control systems and waterfall approaches to a trial to version control systems like it to motherality in systems to open protocols to everything. And we have solved and improved all our processes significantly. I believe in data science. I mean, it will be happening something like this as well. So we will be taking best from software engineering applying it to data science process. And pretty soon we will have like an ecosystem of tools that will be solving all the complexities of all the processes. I can add to this. I can add to this. Basically, we need to kind of learn from like software engineering experience and transfer this to data science. And as Ivan said, a lot of like all this principle applies. However, we should know that there is a difference, right. And it will, it will be a big mistake to ignore the difference. And this is actually one of the reasons why many data science tools are not very successful. Because when you just take like one principle from software engineering, which makes total sense and they apply for data science. And the folks might be gets reluctant to use this. And so we need to understand this difference. We need to kind of think deeper about this kind of data scientist behavior, like what, what they need, what is important, what is not that kind of apply. So, Dimitri, I have a question about this, right. And I think we're getting into the difference between traditional software engineering and data science, especially machine learning, right. There is, I mean, beyond the fact that, you know, the word learning is built into the ML. Beyond that, there is also this notion of growing complexity of systems. When you're building a machine learning model, the idea you're already thinking about future versions of it, you're already thinking about the different ways, ways different than traditional software that this can break and that way you have to be responsible from a retraining perspective, from a debugging perspective. Can you sort of help me tie this into why you were thinking about DVC and how, you know, maybe a little bit about how the DVC solution sort of plugs into this. Yeah, so this is a good question and, and, and yeah, and we build DVC kind of to reduce part of this complexity. So the, I believe the complexity comes from one of the bigger source of the problem with the complexity is we have a two different stacks today. Every team has two different stacks. You have a set of tools for software engineering, right. With all this like traditional tools, right, like you have a version control system, you have like git to your like git bucket or github. You have your CI systems like that, like dozens of those code quality system control, right. You have like agile process and all the stuff from one side. And on another side, you have the same set of tools, not the same set of tools, but a new set of tools for software engineering, and you need to connect them. And it basically creates a complexity, right, instead of like five different companies, which can connect it to each other, you have like, you got like five more companies, and they're like interacting with each other. And workload is not defined. It's not well defined. And people just get lost on this like complexity. And today we have like so many system, right, because some, some folks need like machine learning tools, like for modeling, the other people needs like data processing tools like for clusters for like HDFS or some other stuff. I mean, and a lot of problems can be solved, relatively easy, but not easy to put all the system together and kind of come up with a meaningful workflow around the system. So yes, what I think today we kind of all the complexity moved from like how to build model, how to solve this problem to how to organize the workflow how to make all the system work together. I think this is like one shift, which is happening today. Yeah, I, sorry, I'm going. Yeah, yeah, I would add to this coming back to your question, how do you see if it's right. Maybe we jump ahead a little bit. My perspective for data science, there are two differences, right, major differences data and science data means that you have now some artifacts to take care of. All models data sets and science processes different. It's like R&D like highly iterative process where result is not well defined. It's the metric driven and we go can go on and on, right. And that's what DVC is about. It's about collaboration that fits into this two differences. It's a collaboration tool to how do we kind of share our data sets, where do we put them. It's like fundamental problems. Every team faces first time first time they start doing any ML process or data project. How, how can I send you this ML model, how I can communicate with my team, how can I communicate with production systems. That's what DVC is about. It's a fundamental layer, collaboration layer for data science projects. Tell me this, would you say that DVC, whether the way you're building it or for some other format of it was inevitable in the evolution of products in the overall data tool chain? Well, good question. Yes, yes. I believe it's inevitable because again, like, as I told you, like every team that starts doing data or ML project, they have to come up with some conventions. Where do they put their models? How do they name them? How do they put them to some machine to train, right? They come up with prefixes, suffixes, some locations or necessary or something like that. And it reminds me a lot of software engineering back in like 20 years ago, right, when people were sending archives to this software code to each other. And then merging them together in some peculiar way using ML. Not personally. Step back and the way, even the way you're framed, which is that data is the new dimension that has been added and the fact that there is science that is a second dimension. It's fundamentally a search process, right, where you're building the systems and these systems are continuously evolving, getting it closer and closer to our objective that we need to meet. Now, that seems like a fairly complex process because it involves a lot of understanding of what problem we are solving first and how are we solving and also actually implementing learning from it. And so on. If you look at the software engineering journey, we had almost 30, 35 years. The first versioning paper was written in what, 1970? And Git came about in 2000s and GitHub came around the same time. We had a long duration for us to learn, understand, debate, explore various options and so on. How are you seeing the repetition of this whole process in the data science discipline as it is happening in 2000 and how do you see it going forward? How do you see the evolution of this space? Yeah, so it was quite a journey, right, for software engineering. It took like 30 maybe 40 years, maybe even more for this kind of change, right, from like, hey, this is a version control system and this and today like every company, every software company uses, right. I believe in data science, all this evolution will happen like just way too fast, comparing to software engineering. First, I mean like five years, not 50 years. And I believe there are a few drivers for this. First, Internet, right? So like in 70s and 80s, it was of course not easy to innovate when you like send actual emails, not emails. Second, more folks are involved in this process in software engineering and data science. And what is even more important, we can learn from software engineering, we can learn from other disciplines. For example, today like they're not only software engineering, right, they are designers, they are some other professionals and we can learn how they do their work and get those ideas. And even more than that, we can take their tools and build out our tools on top of those, right. And this is the next driver. So, and this is actually our fundamental belief behind the company and behind the tools. We don't try to invent new systems. We try to build the systems on top of engineering one on top of existing one. And this is one of the major drivers which can kind of get us through this journey in five years, not in 50 years. Yeah, that's right. We can definitely benefit from the knowledge software engineering discipline acquired for sure and can build on top. And don't forget like software engineering is way more efficient right now. It's not even about Internet only, right. It's about all the tools software engineers have. And building tools for ML, ML Ops, ML engineering is engineering discipline. It's like in our company, I wish we had more data science, I would say. And we discuss and debate with Mitry a lot about our team that they even, we feel some disconnect even like we're building tools for data scientists. And it's software engineering discipline, but they don't sometimes we wish they had more involved into data science to understand the space. Makes sense. I have a question from Neeraj who has asked something on the QA panel. He says, how is DVC designed for experiments in development environment where we work on sample data which is relatively small versus DVC in staging or in production where the data is huge. And further goes on to ask how can we reproduce the pipeline in staging. Okay, so we go to details on DVC, right. This is where the audience wants you to go. Yeah, we can ask. We will answer, but before jumping to details of DVC, let me give like a bit like overview of DVC, right. Because for some folks who are not familiar, they might not understand DVC kind of from even from high level. Let's start from high level. So one of the DVC, we say DVC is version control system, right. It's kind of good for data. But if you look closer to DVC, it's not. This is like a kind of surprise. It's called a data version control, but actually DVC just codifies your data. It codifies your data, replace data by meta files, put your actual data to your cloud, which is like S3 like Azure, or maybe you have like separate like SSH server, for example. Like DVC supports all the major storages. And you do the actual versioning by Git, right, the meta file versioning. So in this way, we kind of just codifying data and we are not kind of reinventing version control system, right. This is exactly what I mean when I say we'd like to build system on top of engineering workflow. We don't want to reinvent version control. We'd like to argument version control by data files, by pipelines, by model versioning and all the other stuff. And metrics, yeah, metrics is like one of the crucial part of this, of the workflow. If you're like familiar with DevOps paradigm, you can think about DVC. It's kind of like, kind of a terraform, right, terraform, for example. A terraform codifies your infrastructure. DVC codifies your data. This is analogy, not the perfect analogy. Yeah, in the infrastructure world, there is even a term, a fancy term now, GitOps. And I would say, what DVC is, it's GitOps for data. Yeah, data ops, one more term. Data ops, one more term. Yeah, we can go one known with terms. Yeah, solid metric for interrupts. Yeah, go ahead. So I'm actually done with introduction of DVC. So the question that Neeraj had was a little bit deeper into DVC. Yes, right. And I know we'd like to have kept it later, but what Neeraj asked was, how is DVC designed for experiments in development where the sample data is relatively small versus in staging or production where the data is huge? Yeah. First of all, let's discuss experiment on DVC. So on DVC build on top of the Git system, right? We operate by commits, right? For DVC, like experiments, it's a commit. And if you are working with this concept like experiment as equal commit, you kind of are good to share your experiments among the team, right? And if experiments built on your focus machine with a small data set or maybe with like synthetic data set, and you need to kind of replace it by the production one, what is needed, you just need to replace the data sources, right? In DVC, we have like a concept of pipelines. And in the pipelines, you need to replace the data source to production one, and you run the pipeline, do DVC repro on top of another data set. So this is what people are doing sometimes, sometimes they do this in their kind of in a manual way. In some scenarios, they do this automatically. For example, if you have like CI-CD system and you do commit, commit your experiments which go through CI system, then CI system can do the replacement of data set, your data set to the production data set. And CI system can run training on the production data set. And in this scenario, the beauty of this scenario, user can be even separated from data, from the actual data. So as a user, you might have access only to your data set, only to test data, not test, but like development data set, not to production data set. So one thing I'd like to add here is that now DVC has a substantial community, you know, and every day they're putting out more tutorials and documentation on how to address all of these scenarios, they're very active. So Neeraj, I would recommend that you lock into the various forums that are there in DVC. I think this would be fairly, you know, they would be ready tutorials on this. Yeah, my perspective to add to this is for those who are familiar with Git, you can, it's close analogy to think in the same terms, like how would you put Git into production? How do you use Git in production? You use Git in production as a protocol to deliver stuff to production, like you do Git clone or Git pull or Git push to communicate with some production systems. Maybe in some automation scenarios, you do Git commits as well to capture some results, for example. The same with DVC, it plays very well as a kind of ledger of what's happening under protocol and project description and plays very well in the development phase when you collaborate with your team members, when you collaborate with production systems, but it doesn't try to replace air flow when we talk about pipelines or sophisticated production systems that run pipelines 24-7. The goal is to maybe deliver data sets there or maybe capture results there so they are visible so that they can be reproduced so that we can share them. That's the goal of DVC as a foundational level. It's a protocol. It's a way to organize your project. It's a protocol between systems. It's a way to capture your data artifacts. Wonderful. In addition to what DVC does, I liked how you talked about why you approached DVC as a problem in the first place. I think both Dimitri and Ivan were talking about this, your own individual approaches where you eventually seem to confluence. The fact that you were thinking about the data tool chain that itself was interesting. So I have a question that moves a little bit away from DVC. Maybe we'll come back to DVC in a bit. But the question that I have is, in your opinion, what are the next logical steps in the development of the data science engineering tool chains? Do you have any hypotheses, anything that, or maybe you've seen something, maybe you've seen some interesting projects that are happening, which you think, yeah, that makes sense. This is what needs to happen next. Yeah. Let me think. Screeble, right? I saw one interesting question. Thank you. To be serious, a lot of interesting stuff is happening, right? And what we see is definitely, like Dimitri mentioned, we would like to see eventually an ecosystem of different tools play well together. And what we see now, CI-CD system appear for ML, and Dimitri can talk about that more. And we see feature stops. Now at least people realize there is a term for this system, right? I just want to say to people, we are not being these guys. I have no idea why. That's what we have across the industry. It's not about Screeble or DVC Perse, right? It's not about only us. And we see deployment systems. And what I like most, we see open source deployment systems as well, right? That solves all these complexities. We see tight integration of software engineering tools with data science tools like Qflow, for example, runs stuff on Kubernetes. So, yeah, and I hope step by step we go into the direction Dimitri mentioned. Yeah, it's amazing. The amount of progress in the last two, three years, I mean, it is exponentially growing the tool ecosystem itself. Go ahead, Dimitri. Yeah, I just want to add, like, yeah, I want to describe this kind of in a really nice terms. And I believe the next big thing will be related to integration between those systems because today many companies, many systems are built in a way that we are kind of, we are end to end. But I believe in the future we will see, like, more integration between those systems when, for example, people can work with, for example, work with data through, like, DVC or some features tools. And they will be able to deploy the model and then we'll get the result back, like to close the loop, right? To see the difference between metrics, what you got on modeling stage versus what you got on the production stage. What is the equivalent of this in the traditional software engineering? Are we talking about network protocols, REST APIs and standards? What is the closest analogy that comes to your mind when it comes to integration? Yeah, integration. So I believe that a big portion of those integration is done by version control system on the development stage. Part of those came from API if you are talking about, like, deployment system and production systems. Yeah, and maybe, I mean, even it's not even about only version control, it's about version control integrates all these pieces together, right? But next level is that we have all these tools we integrate, like GitHub, with CICD tools, then with some production, yeah, Git ties all these together, but we have layers of tools. Oh, yeah, yeah, yeah, yeah. Just Git is not enough, of course. You need to kind of... Yeah, Dmitry, maybe remember when we initially met, the way probably, as far as I remember, Dmitry pitched me, DVC was not about DVC and it's the end goal of us solving some problems. I remember we had discussions about CICD and it was three years ago, CICD for ML. Maybe you can actually share that part of the story. Yeah, actually, CICD is one of the pieces which connects the things together, right? In software engineering, the same might happen in data science. When you have your code, you have your kind of developing phase artifacts and CICD connects those artifacts with computational resources, right? When you build some stuff and it reports the result, which is important, right? And in data science, this result is even more important because in software engineering, what you need to say, yes or no? Like, was the build successful or was it not? Test, pass, test, fail. In data science, accuracy went down like 0.5% and precision went up to 0.75%. Is it like good or not? I don't know. This is the big difference between data science and software engineering. So that's what I'm wondering, right? When we think about CICD system, the one known unknown is the data. The data keeps changing, right? And in the case of traditional CICD systems, we had a specification and we could write tests for that specification. And then we could check in our CICD systems as to whether it is meeting the spec every single day. Now here, spec is unclear and it is changing every single day. So what are the, there should be a missing piece somewhere. The CICD systems have to be, the concept has to be reinterpreted in the data science space again. Yes and no. I mean, before we start to reinterpret this, let's first, I would say let's first do the first step and interpret it in a similar way as software engineering has. Let's first learn how to run tests for data, for example, and run tests for ML models, at least with some golden data set we have. And let's learn how to, yeah, it will be a good start. And what you mentioned that data is changing, yes, but it reminds me more about monitoring systems, I would say. It should be part probably of some monitoring systems and which I will win as well, right? For the close to the development phase, and I will win fast. When you want to detect some drifts in how model behave, right? And it's a little bit different problem to my mind, at least comparing to CICD, but who knows, I mean, maybe it will be part of CICD as well. Yeah, I think, I think important important part is here is the important part is the workload, how the workload is organized in software engineering. Yes, you kind of have this like kind of solid artifacts, right? Your code or maybe your data, you expect them to pass all the tests, you expect them to be like correct, like 100% if possible and then build something out of it. In data science, unfortunately, or maybe fortunately, I don't know, you don't have like, how to say, you cannot be sure. You work kind of in a different, a little bit different paradigm. It's okay if your data change. So you should expect this. It's okay if your model makes a different prediction. I think, but you still need to be consistent in your process, right? But the, all this complexity, it moves from the artifacts to a high level to the workload level, right? You need to be sure that your model is not drifting like too much, right? Whatever it means for you. And you should be okay if it drift like a little bit. You should be okay if, for example, precision decreased if accuracy increase, like something like this. So you don't need to be like 100% accurate as you wish in software engineering. And this is like a huge, like, I don't know, cultural difference, right? When we talk about the difference between software engineering and data science, because for software engineer, engineer, it might be not clear. His motivation pushed him to reduce all this uncertainty, right? In data science, you are okay with uncertainty. You just need to like a build a right workflow around how to mitigate these issues with uncertainty, how to deal with this uncertainty. So it's the worst nightmare for software engineer. Why can't we increase usually precision and recall simultaneously, right? 100% or something like that. Yes. Another question from Nidaj, which is, he says it's related to the same thing where Dimitri talked about changing the data source from sample data to production data. Nidaj says, as his production data is a collection of so many .dvc metadata files, how does one specify so many files which are increasing every day? Is this one best relegated to the forums or is this one that you can quickly answer? Yeah, I can answer from like high level. On the low level, yes, we can discuss on the forum or through other channels, right? On the high level, yes. In the dvc, we create meta files for like all the sources and some amount of files increases and sometimes it bothers people, of course. But we know about this problem and I believe we solve this problem and we will be releasing dvc next week, dvc 1.0. And in dvc 1.0, we kind of moved away from separate dvc files, separate meta files into a single one. Also, we separated the part that user work with and what system is generating, like with hash sum and all this stuff. So, you kind of have one specification, human readable specification and you have a special file with technical details, with check sums, with all those internals. Yeah, and I would add to this that dvc itself handles them, right? It's kind of, dvc is built to be a little bit more explicit than git because we deal with data. We want to be more explicit. We don't want, for example, to clone all the data artifacts you had and the whole history of your experiments at once. So, we made it a little bit more explicit intentionally. So, you see these meta files in your project, for example. But dvc handles them. Like, you have a bunch of commands and usually you don't even need to edit these files manually or something like that. So, dvc helps you in this way. The only file that Dmitry mentioned is kind of make files we have for ml. Those, again, like dvc first of all can help you edit them and second in dvc 1.0 it will be a single file. Wonderful. So, coming back to this question of discipline, right? The end-to-end scientific process and essentially the way to understand what is happening today is that today the data science, the modeling development process, we are starting to impose certain checks and balances through the entire process. It was anyway required that if it was going to be a science to begin with, reproducibility and all of these should not be a strange word. But the question that I have in mind is that the end-to-end discipline applies across all layers of the stack through the entire journey of the data. Whether it is starting from the data collection itself or even at the process level. For example, when you are iterating on the models itself, that also needs discipline. I'm just curious about what are, do you see any significant gaps? For example, there are no standards for data collection today. Our client just asked for a recommendation on standardizing that process and when we looked at it, we couldn't find one. So I was just curious about what other big pieces needs to become a lot more controlled to achieve end-to-end discipline or confidence in the outcomes. Yeah, so it's never ending story. I don't believe we can achieve 100% confidence or 100% reproducibility in this sense because data pipelines are hard. We have so many moving pieces. We have data ingestion, some big data systems, some databases on top of that. Then we fetch some files from these systems to do actual ML modeling or something. And we can capture, I mean, the question is do we actually, in some industries, I believe we do need to capture the whole pipeline and version it. But in majority cases, I would say the really good first step would be to capture at least starting at some point. At least get some snapshot of some intermediate artifacts and data that you deal with today or your model is trained with. So that we can at least have some responsibility and accountability moving forward that something happens to see what data was used. I mean, you have to break up the problem at some level and each subsystem can have its own definition of its discipline and the tools to encourage discipline and so on. But I was given that any guarantees that you have to give for the model as far as correctness or the explainability and things like that. Somehow there needs to be coordination across all the regimes of discipline at every step in the life cycle. Right, I mean, I don't believe there will be one single tool that will cover all the kind of responsibility problems across the whole data stack. It's just what are so different like deaf world when you do get an ML modeling versus some big data clusters and some data ingestion like Kafka. Try to version Kafka with DVC. Probably it won't work and it doesn't make sense. If it's needed, there will be some tools which will be taken care of there. I have a question that a little bit tangential to this just to add a different flavor. You guys are building a product. You're building a tool and in that sense, you guys are entrepreneurs, right? I don't want to lose sight of the fact that it's a business that is marketing, there's scaling. There are choices that you're making all the time. And I would love for the entrepreneurs that are there in our audience, not just about the data. If you could tell us a little bit about when you went down a wrong road or when you've had a significant disagreement about your roadmap. I'd love to hear a little bit about how you, what your mechanism was to either identify when it is that you're going down the wrong road or to come to an agreement that this bit of the roadmap, you know, even though we're two sides are pulling in different directions, how do you finally come to an agreement? So either of these, if you can just talk about a little bit. And the journey of DVC itself and how the community is going. I think that might be a little broad, but if we can just focus on that. Go ahead. Yeah, so regarding the roadmap, this is, first of all, we can, we have kind of like a big roadmap, right? What kind of product we need to build? What kind of problem we need to solve? And this is, it's about big steps. And what I see, we haven't changed this roadmap much. So as Ivan said, like three years ago, we were talking about CICD for machine learning, right? And on the last year, this year, we see like a clear signal from the market like we need this, right? People start talking about CICD for machine learning. So we know that versioning part, data transferring part was crucial. And we see this today and we are building DVC. Like some new directions like how to make, how to version, not data sets, not file, how to version, how to version object like, how to build like feature stores, some stuff. We haven't had like a big plan to move to this direction, but we were thinking about this before and we see today like more and more movement happens. So I believe if you don't have like a big disagreement on this level, however, when you start building like a product, when you start prioritizing tasks, right? And decide like what we need to focus next, not year but month. So this is when we have like all the disagreement and all the like discussions around like what we should do. Yeah, and it's fine to have disagreements, right? You just need to come up with, you disagree and commit and move forward. Yeah, they have to have disagreements. Yeah, and also I would say it's important to test hypothesis fast. I mean, it's fine to fail. It's just you test hypothesis, you move to something else and you do something else after that. It's just how do you test as many of those as fast as possible? And also I would mention that with DVC, our community helps a lot. Don't forget we have a lot of users from day zero. I remember like we had, initially we had like one guy or someone came and asked some question and we're happy. Yeah, and we were kind of trying to solve the problem and learn as much as possible. Now we have like three, four, five, ten issues a day sometimes and some discussions on Discord. It helps to drive road map a lot. At least short term road map for DVC is very clear in the sense and it helps to prioritize tasks like Dmitry mentioned. Should it be CI or should it be deployment or should it be something else? Community helps a lot. So thank you everyone who participate there. Maybe to your actively going beyond the folks who approach you, you have a significant outreach program as well that you started recently. Can you tell us a little bit about that? I think the few folks in the DVC community in India that would be interested. Yeah, so our community as Ivan mentioned, right, was growing like from like one request per week when we got like two years ago when DVC was just launched to like five, three, ten requests per day. So this is like, we got this active community of the users and the natural next step for the community to build kind of more systematic way of working with community. And we launched like the ambassador program, DVC ambassador program. So what we try to do, we establish connection with active team members all around the globe. And we try to help them to kind of work with DVC, to write about DVC, to talk about DVC, to organize meetups about DVC. We introduce them to each other. So they have like a lot of common interest and common projects. So they help each other. So and today we kind of reached this level when the community kind of went to the next level, to the ambassador level from kind of random folks to more, I don't know, it's systematic, right, work with community. We just launched this community like less than a month ago, I believe we published the first blog post and we got like five ambassadors with like some of those already had a good experience with DVC with created meetup about DVC meetups about DVC or writing blog post. The other don't, but they want to. And now, yeah, we are very excited about this move, kind of a new thing for us, very new thing, kind of additional team, if you wish. Yeah, good channel for them for people who are interested to reach out and in some way start becoming a part of this. Yeah, of course, first of all, on the website on DVC website, you can find this community page and there is some information about the ambassador program. And recently, one of the ambassador published a blog post on our website, you can find the blog post and read more about the ambassador program. So how it works if you're interested. Yeah, please welcome. Wonderful. So, I'm thinking that that's a great note for us to be able to wrap this up on. But Ivan and Dimitri, I know that we will usually get questions that come once people see the video they posted under the video on Hasgeeks own page or comments come later on on YouTube after people have seen the video. We'd love to be able to circulate them back to you and get some of your answers. And I think you guys have fairly public profiles and DVC.org is fairly public as well. So for anybody that has any questions, you have all of these different avenues, the various fora that DVC participates in. The Hasgeek page where the stock is where your registration for the stock was where the video will show up, as well as the YouTube link where this video is currently being streamed live and where it will be available afterwards. So with that, I mean, thank you so much for making yourself available. I say this with a lot of warmth because like I said, it's 6am and I haven't seen 6am in a long time. And the second thing is the energy that you brought to this was again amazing. So thank you so much for being here. And we will continue this conversation in some format or the other. As I promised you, if you find stickers that would like to ask some more questions in your time zone, we'd be very happy to do this again for you at that point. Thank you. Any closing thoughts? Yes. First of all, thank you Ivan and Dimitri. We've been having conversations but it is great for you to talk to our community as well. Scribble, just like how iterative is Friends of Scribble, we are good friends of DVC as well. We will be more than happy both as users and facilitators to connect you back to the DVC folks. We are a strong believer in the project itself and users of the project. So, you know, looking forward to rapid growth in the Indian community, DVC community. Thank you Ivan. Thank you Dimitri. Thank you Dimitri. This was a great discussion. See you around. Talk to you soon. Take care everyone. Bye bye.