 All right, let's start. Hello, everyone. I'm Veena and I'm here today talking about how to get ML or the machine learning right in a complex data world. Right up, just a quick overview about who I am and why am I here. I started out as a software engineer and then off to become off into data and ML engineering space, but currently I work as a developer advocate for this open-source project called Lake FS. Lake FS is an open-source data versioning engine, which offers Git-like APIs and interface for you. Essentially, you can implement or you can come up with Git for data, which is Lake FS. More on that later, but let's dive right in. ML logs in a complex data world. I'm sure this picture is not new and we've seen it all. As ML engineers and ML practitioners, we also understand how actually training or ML training code is a very small part of the bigger ML logs or ML ecosystem, where most of our tasks and pipelines consist of data processing, creating new features and post-training the models, how to deploy them and how to put together a model registry, and almost everything around the ML code revolves around the data that are used to create these ML models. Another interesting part is, when you compare it with the software application development with the ML model development, it is an iterative development model in the ML space. You keep going back and forth until you get the right model that gives you the right metric or the accuracy that you're looking for, and it's not quite linear like the software development. You just develop, test, and deploy and that's it. Because of this iterative nature also, it becomes a lot more complex to get this ML models right with the steps starting from multiple data sources, some of them real-time, some of them in the batch processing world, and then you have this raw data that you're trying to clean, pre-process, we have multiple pre-processing methods we would experiment with as a ML engineer. Sometimes you standardize it, not standardize it and whatnot. Then of course, at the end of the day, you want to curate these features or curate this training data into specific features that can be reused by different ML teams or different ML models. So then you do end up building a feature store from where you can retrieve the features that you want to train and experiment with these different ML models. Then there is the standard ML training, validation, trying to identify the right ML model for you at the end of it. You have a model registry where you register models or the model artifacts that can be reused later for you as well. But in this whole process, we do tend to use ML experimentation platform. For a lot of us, it is just a Jupyter Notebook. For some of us, we do have a dedicated ML experimentation platform, which is a scaled up infrastructure that lets you experiment with a higher velocity. And because of this iterative nature that I talked about, there are a bunch of challenges. The first is we already touched upon having this experimentation infrastructure that would allow us to train these models at a higher velocity. And the other one is, of course, explainability of ML models because a lot of the ML models thanks to neural networks have become black box agents. So now explainability of these models are one of the challenges too. And to make sure the models are explainable, you also want to reproduce them. And the reproducibility or the crisis of reproducibility is a lot more pronounced in certain industries. If you think about pharma, for example, or even the finance sector, where the restrictions or the regulations are even more constraining, and you want to make sure the reproducibility of the models are in place. And coming to the next challenge, of course, is data feature and ownership. Because there are multiple people or the multiple teams involved. Because as ML engineers, you are dependent on your upstream for the data engineers. And to some extent, ML ops engineers on your downstream to deploy the models that you train as well. So you need to have this data or feature ownership to define who owns what. And that also makes sure that there is this quality of the data of the ML pipeline that is in place as well. And collaboration, like I talked about, even with an ML engineering team, you can have different engineers working on different projects or different modules, who might be reusing some of the features from the feature store. So when you're all working on the same data, we need to have an effective way to collaborate, which is exactly what we have, which is Git in the software, application development framework, which we don't have for ML ops yet. Again, version controlling all of the ML assets atomically. When we talk about version control, so far, at least, we have a bunch of tools that are in the ML ops space, which I will also cover a bit later. They cover versioning of models, versioning of your code, and versioning of configs and metrics and everything, not necessarily the training data. And even if you think about feature stores that are in use today, they, again, version the features, but not necessarily the actual raw training data from where you derive those features. So at any point in time, if you want to go back and create a new set of features, you won't be able to identify the specific training data that gave you those new features to create these new ones. So today, as I talked about, we do have plenty of tools in the ML ops space, and a lot of them claim to be end-to-end, too. But then I don't think the end-to-end ML ops problems have been solved yet, because as I just talked about here, there are bits and pieces of the problems that are solved, but there are still liaring gaps in the ML ops space which is not quite done yet. Which is also why we want to go over the challenges and see what we can do, or what Lakefist can do for each of these challenges. So first, going over the experimentation, and I've already touched upon briefly about this iterative nature of the development. So you start with the raw training data, you train your model, experiment with it, and then if you do get the metrics right, good to go. If you don't, then the loop continues until you arrive at the correct model. And because of this, every time you train your new model, you're making sure that the new model or the new data is reproducible, because at the end of the day, if you want to go back to a specific iteration, you need to be able to identify what was the training data and what was the configs that gave me this model that is now 99% F1 score or whatnot. And to ensure that data consistency is not in place because today there is no way you can track or version all of these components of data, code, artifacts, configs and the metrics together, not all of them at one place. Which would mean that if you don't have that in place, you will not be able to reproduce any of these experiments for a later point in time. This is exactly where Lake FS comes in, where it helps you with building an experimentation management platform, and it gives you versioning of all these ML components together. And I will show you a demo later, which will help you develop a better picture of how Lake FS can help you build a versioning of all these components, how it actually does this like I covered before, a bit, Lake FS gives you a git for data or git like interface for your data, right? So whenever you want to run a new experiment, assume you're training data sitting in a git repository or a Lake FS repository, and now you can create as many branches as you want out of that repository and run your experiment in those branches. And at any point in time, you want to go back and reproduce a specific experiment or go back to a specific experiment. All you need to do is just check out that commit and you would have all of it in the commit. You can have the data, models, artifacts and everything. And you can version all these ML assets together with Lake FS and with one commit, you will be able to reproduce all these components and your experiment as well. And to give you a little bit more into what exactly is Lake FS, where does it sit with your data stack or ML stack today? Lake FS sits as a data versioning layer on top of these object stores. So if you have your training data that is sitting in AWS, GCS or Azure Blob or even Minio on-prem store, Lake FS sits on top of that and it gives you git like interface. So you can do branch creation, commit, merge, revert and any git like operations can be done on your data that is sitting in an object store. And now your existing host of applications that you may currently use, Spark or TensorFlow, Keras, MLflow and whatnot, you can directly use them on the data that is in S3 or the object store or you can access them through Lake FS for the versioning enabled. And with very minimal intervention to your existing code base, if you can look at, we get it on the right side where we have here, the only changes probably you will have to make in your existing code base to have Lake FS into your ML stack is that you just update the data path with an extra prefix would be the name of the branch and that's about it. And how does Lake FS exactly work inside? So when I am trying to create a new branch, is it gonna copy all of the data into a whole new branch so two people can work on two different branches and run their experiments? Not necessarily. So what Lake FS creates is when it creates a commit, a commit is nothing but a bunch of pointers or object references to these objects underneath. So when I am creating a new branch, I'm only copying these pointers so it's a metadata only operation. So even if you have multiple ML engineers working on the same training data, they'll have their isolated branches but they're not necessarily copying data into each of these branches. They still point to the same objects underneath. So what happens when somebody deletes the data, right? Lake FS has an option where you can decide if you want to soft delete or hard delete, just like commit history, you can decide how far back you want to go with git. Similarly, Lake FS has that as well. And another thing is if the minute there is any write or a new addition to the data that is underneath, it does a copy on write, meaning the new objects would be added and then object references will be created from that commit. Now, enough about Lake FS. Let's come back into the ML experimentation space and see how Lake FS can help you get there. As I've gone over it here, we have a Lake FS repository and by default we'll have a main branch here and assume main branch is the production branch. We will use main for production purposes only. And now, anytime you want to train your new ML model or if you have your ML pipelines that are running weekly, daily, so you're every daily run, you don't necessarily need to run in production. You branch out of the production and do your weekly training or daily training in that experimentation branch and then you can have your own set of quality controls too and if only if the model meets certain threshold metrics that you would have defined for your businesses, only then you want to merge that into production, meaning you're only pushing the qualified models into deployment. And not only that, so every time you try to do a commit or when every time you merge the models into production, you also have this timestamp or a commit message where you can put in as many metadata as you want, not necessarily about the hyperparameters and the infrastructure details, but sometimes you can even add who is trained the model and what is the lineage of data that's coming in from that specific instance as well. So this way, similar to a Git, Lake FS also lets you create these tags to tag each of these mergers or releases. And this way, at a later point in time, if you want to recreate or go back to a specific training iteration and want to use that specific model, again, check out the specific commit and you're good to go. And unlocking collaboration very similar again because you have these Lake FS branches that will give you these isolated environments for your teams to work on their own, but on the same or the common training data that is available to all of them. And again, by this, it not only allows you to work with the same data without duplicating into multiple places, it also has Git like branch protection rules. So you can have branch protection on the main branch saying I only want XYZ team to be able to merge into main or work with the main production data or models directly. So basically, are back on your ML pipeline. So you're not giving access to everybody in the team on your ML pipeline. And coming to the most important challenge, which is version controlling all the assets atomically. What I'm trying to say is when you have all the data and features and models and artifacts and configs and metrics, you want all of them to be captured only then by doing a single checkout, you will be able to get all of these together to reproduce a specific iteration of an experiment. And again, because of this Git like interface, all you need to do is after every commit, let's say you have your training data, you pre-process it, commit it, and then you create new features, commit it, you train your model, you have the model artifacts now, you commit it. And once you have all these pipeline done, when you merge it back to production, all of these are atomically merged. So it's not like only the code is updated, the model is outdated or only the features are updated, there are training data is not. So all these changes are atomically pushed, making it absolutely easy for your team to reproduce an experiment. And this is just a couple of things that I talked about specifically with respect to the ML ops side of things. But then LakeFS as a generic data version engine can be used for any other data engineering or general purpose use cases too. For example, if you want to do ETL testing or data pipeline testing, again, branch out and test, do everything in that specific branch. So you're not working with the production data directly. And like I said, you can create specific tags just like in Git. So you can go back to that specific tag to recreate an experiment. Again, it helps you with the reproducing a specific one too. And reprocessing, because most of the times when we try to reprocess the data, what we do is we run parallel pipelines and it's the same thing, but you can do it in two different branches. So you make sure that before you push the data into production, all your quality tests are run. And again, despite all of this, things still do go wrong. And if something happens in production and you want to debug and troubleshoot it, the first thing you have to do, just like how we do it with Git and code is that just revert back to a previous comment where things are good. So your downstream consumers are not affected when something goes wrong in production and then you can always debug and troubleshoot in production and what went wrong and everything else. So this is again a quick overview of what other use cases are, in what ways people have been using Lake FS. So let's dive back into the ML experimentation side and I have a quick demo to show you how you can use Lake FS. So a funny thing happened this morning. I recorded everything, but then not the audio. So I'm gonna do a fun voice over today over the video recording I have for the demo. So I hope it's fun. Okay, so this is how Lake FS UI looks like. So you have repositories, I already have a bunch of repositories in here, example, Netflix movies, data and whatnot. And if you can see here, there is a default branch and there is also storage namespace, which is an S3 bucket on top of that, I created a repository. And let's go ahead and create a new repository. And to create a new repo, I also need the storage namespace, meaning I need an S3 bucket that is already holding the data because I can only create a repository on top of some data that's already existing so I can version that data. So I don't have that yet. So here, because I wanted to keep everything in local, I'm using Minayo. And Minayo does provide me with S3-like interface and Lake FS can work with any object store that understands S3-like API. So I'm good to go with Minayo. Let's go ahead and create a new bucket here. Here I'm using the wine quality dataset just to experiment with the cliched ML experimentation. All right, so I have my bucket. So let's go ahead and create this Lake FS repository. And I wanna keep the default branch to be main, of course. Currently it only has main branch. Awesome. And another thing I also wanted to mention is that Lake FS gives you the UI, this Lake CLI, the command line utility, and also it supports multiple language API clients. So currently I'm a Python developer so I'm using Python API client to go over this experimentation. So I just have a Jupyter notebook with a simple Python experimentation. Just a bunch of installs and imports before we get started. Okay, so the first step here is that I wanna create a new branch because I don't wanna work directly with the production. So I already have the prod branch as main, but I'm gonna create a new branch called ingestion branch, which is where my training data is gonna come in. And if you notice here, the ingest data branch and the main branch have the same pointer hashes. What it means by that is that underneath, they're pointing to the same objects so their commit hashes are still the same. Meaning the objects are not getting deep copied, it's just the object references to the same set of objects underneath. And now I'm uploading the, you know, wine quality data set to my ingest branch. So currently there is nothing in main, but in the ingest we just uploaded the wine quality data, which you'll be able to see. And you can see the uncommitted changes, what changed, how many files changed and all of that. Let's go ahead and commit this. And once you do that, great. And now that we have the training data or the raw data, now let's go ahead and do a bit of data exploration before I dive into ML experimentation. Just, you know, looking at what the data looks like. Another interesting thing is, so Lakefors also has DuckDB embedded in its UI thanks to WebAssembly, which means that you can have a look at all your files in the UI itself without having to write a whole new Spark job by yourself. Even if you have Parquet files, you would be able to view them and query them from the UI without having to write a whole new Spark job or spinning up an EMR cluster just for that. And as you can see, this is how our data looks like. Okay, I'm trying to do a pair plot because it has a couple of columns, one's acidity and another is pH and I'm expecting some correlation between the features. So I'm just trying to do a pair plot of that to see if there are any actual correlations. And it does not seem to be as strong a correlation. But as you can see here, it's a clearly imbalanced data set and we only have good enough samples for a couple of classes, not all of the target classes, just a bunch of caveats with the data set. And for the experiment one, I have a bunch of configs already defined, which is I'm gonna use F1 as my metric, which I want to evaluate these models on. And I'm gonna scale the input and I'm gonna do a PCA for this experiment one and see if it helps me come up with a good model. So let's go and create this new branch, experiment one. Again, I created this experiment one from the ingest data branch. So they both point to the same data underneath, which is why you see the commit hashes are the same here as well. So it just has raw data at the moment. Let's go ahead and dump these parameters or configs. So when I'm trying to read them later, I have it. So I have these configs too, as you can see. And I wanted to go ahead and create a bunch of features. And for that I'm starting with PCA and I'm just plotting the feature variance. And I am seeing only six of these principle components, give me 90% variance. So I'm just gonna roll with only six of them. Okay, so I have these features and I'm gonna save these features back or like commit these features back to my LakeFiz repository too. Now I'm gonna split them into the train and test. It's the standard ML pipeline, nothing fancy here. Just making sure I show you all the ML assets are being versioned. So again, like a simple random forest classifier, but once I do have that, so I have the train and test data also captured and I'm committing after each of these steps. So at any point I wanna go back, I can just check out that commit. And if you do not commit in these incremental steps, of course you won't be able to go back to a specific step that you want to. Again, we have the model all well. And let's go ahead and save these metrics as well. So I was not sure if you noticed, I only got about some 64 F1 score, not so great. So let's roll with the next iteration of the experiment. Okay, another thing is, before we go to the next iteration of the experiment, if you look at the commit history, you will be able to see all the comments that you made and this commit history is also available to you in the API, like you can access the commit history through the API as well. And you can use this as a way of capturing your data lineage or the ML pipeline lineage. So who created these features or what code created these features? And you can go back all the way. For example, you keep going to the parent commit. So you go to the parent commit of it all the way back to the raw data and you can see what was the raw data that created each of these steps as well. And now experiment two, which is, as you can see, it's just a bunch of different parameters, but similar pipeline. So I'm gonna quickly go over the experiment two. Great, so we ran the experiment two and we have, again, the raw data, preprocess, metrics and whatnot. And you can also compare the metrics from two different branches in here. Let's do the, I'm just trying to compare the experiment one and two branches and see what changed, which is the scores. And you can see the one branch gave me about 64%, another gave me about 88%. Now experiment two, let's call it the winning model or the winning branch. And I only want to merge my experiment two branch back to main, which is deploying the highest F1 score model to main. Okay, so I'm trying to merge. And by default, I want to make sure experiment two branch wins, which is the source wins. And you can also select if there is a conflict just like in which branch do you want to override from. So if you see the commit history from the main branch, you can see the commits from the experiment to carried over to main, of course, because we merged it. And now, so we had a training data, we had two different experiments running and we merged one experiment into production. Now, if I want to go back to a specific iteration, how am I gonna do that? Because this whole premise of the presentation the challenges was about reproducing a specific experiment, which is where LakeFS tags come in, which are similar to Git tags. So we'll just go ahead and create a tag. As you can see here, I've created a new tag, which is the timestamp. And everywhere where I'm trying to read from a specific branch, I'm reading from that tag instead of the branch. It's basically you're reading from a specific commit ID instead of you're reading from a branch name or the branch head. And when you run the entire pipeline, I'm not training a whole new model here because I already have everything. I'm only reading the existing models and trying to see if the model still performs the same way that I wanted. The model gave us 64.8. I just read the same model and the data and just tried to reproduce the same iteration and I'm able to do that here. And that's all I had for you today. Another thing is, like I mentioned before, Lake FS is an open source project. And if you are interested in contributing or even taking it for a spin because you never know how useful it can be for the challenges that your team is working with. And these are again some of the companies that are using Lake FS and actively contributing to. And if you have any other questions, you can like join our Slack channel where we have active discussions going on about the ML Ops, the data architecture and everything too. And thank you, that's all I had. Any questions? Okay, by a quick show of hands, how many of us here are in the data engineering space? And ML, okay, so the rest of us are software engineers? What, okay, so that's some diversity out in here. Great. Leveraging object storage as three, you know, different options. Manaya, you mentioned one of them. What's it look like and how do you apply those principles against data or house solutions? So your data may be in Snowflake, Redshift, et cetera. How are you attacking that differentiation that can solve the storage? Okay, so currently Lake FS does not support data warehouses. So it only works with object stores. Meaning it only works with data lakes. Right, and I understand that there's a difference in the underlying. The way in which. You know, the way it works. So how have you thought about how are you gonna expand into that? Because, you know, I personally, we store most of our data in a warehouse for analytics and so forth purposes. So even though S3 is great for object storage, other customers or potential users might not have an object store, right? Gotcha. Yeah, so currently there is one thing though, Lake FS is extending its support to different open table formats. For example, it already works with Delta Lake with its full integration. And we are extending the support for iceberg format as well. And when you're talking about Snowflake, for example, now they are also supporting iceberg table format for the data that can, you know, that's underneath. So when we do have, I think the iceberg support is planned for the Q3 of this year. And once that is out, we should be better able to support the Snowflake community at least. And we do have plans in the future, but then it's again in the roadmap for sure. Awesome. Thank you so much.