 Welcome to today's session around building machine learning systems. My name is Sabin. I'm the co-founder CTO of Outer Bounds, where we are building machine learning platforms so that all organizations can build ML systems in a more straightforward manner. Before starting Outer Bounds, I was at Netflix for close to six years, helping build out their machine learning platforms. So for those of you in the crowd who like Netflix's recommendations, much of my work went into that. And if you do not like Netflix's recommendations, I left Netflix a couple of years ago. I don't have much to do anymore with that. But jokes aside, it's like today's agenda, especially in light of all the excitement that we see around LLMs and JNAI, is how do you actually go about building principled ML systems, especially for organizations who might be new to machine learning? What is the right mindset to think about it and what are the right tooling investments that need to be done? And of course, I mean, you're like we at Netflix and many other places that I have worked at, there was a point in time when we were also new to ML adoption. Now one mechanism of thinking about investing in machine learning is to think of it as no different than software engineering. Where you sort of like plan out ahead, you think about every single step in great amount of detail. You have your project plans, you have your quarterly, your long commitments to your leadership, and then you sort of like get to work and you start building it in cathedral style. Imagine you're like, if let's say in early 2010s, if you were building a recommendation system, it's very likely that you would sort of think about it that way that okay, where am I getting all the data that I need to train my recommendation model? How do I think about that specific model? And you would build a platform that would be very specific to that single use case. That works beautifully well, as long as you are willing to make that investment, make that commitment, which oftentimes can be non-trivial. But it also comes with certain specific downsides, right? I mean, cathedrals are expensive to build, that also then limits the number of areas where you can invest in from an ML point of view. They are also very static, very rigid. If your requirements change, if there are new innovations in machine learning, as it so often happens, then you might be left behind, or at least there is a risk for this. And then on the other end of the spectrum, you can go and set up a bizarre. This is basically what we end up seeing oftentimes on Twitter as well these days. You can encourage people in your organization to go out and prototype solutions, take advantage of the latest and greatest advancements that are happening. And that's really great. It makes your organization quite a bit nimble, but then it comes at the risk of, how do you actually then align your entire organization and actually move all your prototypes into production? And then, ultimately the goal is that, okay, is there a middle ground? Can you actually get to a stage where you can encourage data scientists, ML engineers, software engineers in your company, in your organization to experiment freely, but then at the same time, provide them a structure, provide them appropriate guardrails, so that then they can go out and move these prototypes, move these experiments into production in a very straightforward manner. And they can then go ahead and apply these learnings across multiple different parts of your business. It shouldn't be the case that, as an organization, if you decide to invest in a recommendation system, then that becomes a multi-year long proposition. And then when you decide that, hey, we also need to worry about payment fraud, or we need to worry about marketing optimization. And that also then sort of requires a multi-year investment all over again from scratch. So, yeah, establishing an ML form might be a great idea in this respect. If you can really sort of establish a culture of experimentation, where it's easy for people to get started, but then the platform and the tooling choices that you enforce that basically then allows them to move things into whatever you define as production. And that sort of like, then starts this flywheel effect where people are able to iterate on these models a lot more quicker than they are earlier able to. And what we have seen at Netflix and many other places as well is that the single most important factor in improving the performance of your model isn't oftentimes data, isn't oftentimes access to compute, but it is just sheer amount of rapid iteration that one can do. So let me walk you through what does the life of a data scientist looks like, right? Like if you are responsible for building models, if you are responsible for creating these ML systems, you might resonate with this quite well, right? So imagine you have a data scientist in your organization. Every single day they are responsible for accessing data. They may have their favorite IDE. Oftentimes it's a Jupyter notebook and they would want to access this data in a safe and secure manner and they would want to play with that data, really understand what is the story that this data is trying to tell. And this data oftentimes ends up coming from a data leak or a data warehouse. It's likely that you may have many of those investments already inside an organization. And then one of the first stumbling block is how do you even understand what data exists? How do you access it? How do you access it in a manner such that it's conducive for machine learning applications? I mean, if you're training a model, then what you really care about is high throughput data access. You want to make sure that you are able to get all of that data on your compute instance and then you are able to slice and dice it quite effectively and do your exploratory data analysis. But oftentimes as you're doing this analysis, as you're training these models, you may not have enough compute resources on your laptop. I mean, I use a MacBook Air that's already four years old and I'm pretty sure that would be the case for many other people as well, that much of the compute resources that they start out with then after a couple of years, it feels that the cloud and other advancements have left behind their personal laptops quite behind. So this is one of the first places where data scientists then start to interface with the cloud. Then there is an expectation that you're able to lift and shift your compute and you're able to move it to the cloud. And by cloud it could be your on-prem instances, it could be GPU boxes hiding inside your closet. But it also comes with a good amount of engineering overhead, a good amount of engineering complexity. Now all of a sudden you have to figure out, how do I take my code, maybe containerize it? How do I take all the software dependencies, make sure that those are available in my container? How do I make sure that my container can also access data just like how I was able to access data inside a Jupyter notebook? And then how am I able to make sure that all of this work runs reliably well and is also conducive to experimentation? If I want to change something in my code, then I'm able to make that change quite easily and then re-execute it without really sort of going through a process that might take me multiple days to push out a new version on top of my cloud instances. And then it's likely that you may not have one workload, you may have many workloads, you may have a workload that's responsible for doing some data processing, generating some features, generating some vectors, maybe you're interfacing with a vector database all the rage these days. Maybe you want to train one model, maybe you want to train multiple models, then you want to do something with that model, maybe you want to push out the embeddings that you have generated into some other downstream systems. And that's where usually some workflow orchestrators come in as well. And it's again, very likely that as an organization, you may have invested in a data engineering orchestrator already, you may have an airflow for people who are Kubernetes native, they may have already invested in Argo workflows as well. And then there's an expectation that as the data scientist productionizes their machine learning training workflows, then they are able to migrate all of that work onto this workflow orchestrator. And that sort of like, you know, then introduces yet another possibility for engineering overhead that now, how do you sort of like map all your constructs in the machine learning universe and make it work in a Kubernetes native orchestrator? Then let's sort of like, you know, go one step forward. It's likely that you may not have one model even if you're working on a single project, you may have multiple different versions of the model because the work is never done. There's always sort of like, you know, more optimizations to be made, better models to be discovered, more data to be trained upon. So you'll have multiple different versions and somehow you have to figure out what are these versions? What is the right version that you need to be using for production? What are the right versions that you need to use for running, let's say, AB tests? And all of that catalog needs to happen some way. There's also an open question around reproducibility. What happens if your training pipeline fails for whatever reason middle of the night? Are you able to then reliably reproduce that error and really sort of like reduce your mean time to recovery? What happens if let's say a colleague wants to reproduce your work and build on top of it? Do you have the guarantee that anybody else can also run your workloads and actually get to the exact same solution? Then we sort of like, you know, start getting into bit of the realm of software engineering as well, right? I mean, these models all by themselves, they have zero value. These models, they need to go out and live in the real world. These could be, let's say, microservices that are hosting these models. So let's say if you have a recommendation system, then you can host your model inside a service anytime your customer shows up. You ping this microservice to get your recommendations. You might be shipping these models inside a cell phone. So that could sort of like, you know, be a use case for many companies as well. Many companies also end up using these models for internal business process optimizations, right? Maybe you're using just the model outputs to write a memo or create an Excel sheet that gets shared with your leadership. Many different ways to consume this model even within a single organization. And then we sort of like, you know, get into the world of actual machine learning. How are you actually training this model? How are you thinking about this model, right? I mean, up until now we have only spoken about engineering concerns. How do I think about data? How do I think about compute? How do I think about reproducibility? But at some point a data scientist really, really wants to think about how do I build that model, right? Like, what's the training framework that I use? Do I use TensorFlow? Do I use PyTorch? Do I use RayTrain? How do I think about data? How do I think about feature engineering? And that's the area where they bring in their expertise to the table. And the goal for us is how do we basically make sure that this entire stack of concerns that we just sort of like, you know, went through that the data scientist has complete freedom to operate at the highest levels of that stack where they can bring in their own expertise, their own choices. And from an infrastructure point of view, how can we make sure that this freedom of choice is really preserved for them? For platform builders in the room, we all know that, you know, like managing data, managing compute at scale, thinking about these workflow orchestrators, thinking about reproducibility. These are really hard engineering challenges. And then the goal sort of like becomes that, okay, what is the ML platform that can be provisioned for the data scientists that allows them to play freely and really sort of like, you know, without any of these engineering constructs handicapping them at any given point in time. So that was the situation we were in a few years back at Netflix and we decided to do something about it. We created this project called Metaflow. It's an open source project now. You can go check it out on GitHub. That was basically geared towards solving many of these problems. And now for the rest of the talk, I'll just like walk you through some of the salient features of what Metaflow does, how it does. And if you have questions, we also have a booth at the conference. So please stop by and ask us any questions you may have. So the primary construct within Metaflow is that of a workflow, right? I mean, workflows are a very natural abstraction if you think about machine learning. It's easy for a data scientist to draw out on a whiteboard how they are thinking about training a machine learning model, right? Maybe, hey, in one of the first steps, I want to access some data. In the next step, I want to do some processing on top of that data. Maybe generate some features. Maybe generate some embeddings. Then I want to train a model. Then I want to do something with that model. The shape of that workflow may look different, but by and large, there is some structure that you can sort of expect out of it. And then the first question becomes that, okay, how can you make sure that a degree of scientist is able to declare this workflow in the most simple manner? So if you look at this diagram, we have this nice little decorator called step decorator. So you can attach it to any Python function and that becomes the node in your graph. And then if you have this self.next call, that becomes the edges in your graph. And if you just define this workflow in this particular manner, that's all MetaFlow needs. Now, of course, this requires a bit of work for people to sort of copy paste their work from their notebooks. So we definitely want to make sure that as soon as you do this, there's an immediate productivity boost that you get. Let's see what those productivity boosts would look like. So MetaFlow comes by default bundled with a fast data access client. So many times you might be accessing data from your block store and you might be bottlenecked by the network throughput between that block store and your let's say EC2 instance, if you're on AWS. But the same sort of like criteria and the same principles apply across all clouds. And as I mentioned previously, like most data access patterns for machine learning are high throughput data access patterns where maybe you're responsible for pulling data for all partitions or a select set of partitions from your data warehouse, from your data lake onto your compute instance. And then you want to slice and dice and really understand what's happening. And in that particular scenario, if you're, let's say, wasting 10, 15 minutes just waiting for data to show up, that can be a bit frustrating. With MetaFlow, you're able to get, let's say, outputs of 35 GBPS of throughput which can be sort of like plenty fast for most use cases that people might be involved in. And as soon as you version or as soon as you create this workflow, everything now starts getting versioned. Every single execution of this workflow, let's say if you're executing this on your laptop or anywhere in the cloud that gets allocated a run ID, all the code is snapshotted for you so that you can travel back in time and understand what code was executed for which execution, all the intermediate data is also snapshotted and captured. So you can ask this question that, hey, my colleague John, he was training a model six months ago. I want access to his code, I want access to his models, I want to replay his compute as well. And we are able to basically provide all of that for people without them having to think about any of these versioning tools or thinking about GitOps. So all those concerns are hidden behind the scenes. So we are able to create isolated namespaces so people can work in tandem with one another. So maybe there's a model that you and your colleagues are working on and you want to make sure that you're not stepping over each other's toes. So everything sort of gets isolated by default. All modeling libraries are supported. So if you're a PyTorch fan, TensorFlow, or if you're rolling your own libraries, you can very easily start using any framework in the Python ecosystem, and that would be sort of like made available to you. Now, this is all good and fine as long as you are on your laptop, right? But then sort of like the, as I mentioned, one of the biggest stumbling block comes is when you have to really sort of like interface with the cloud from a compute standpoint. And we try to make it super simple and straightforward, again sort of like relying on the magic of decorators. So you can let's say attach this decorator resources and specify however amount of compute you need, how many GPUs, RAMs, CPUs, and what we'll do is behind the scenes, we'll package up your code, your libraries, make sure you have access to all the data, and then we'll run only this specific step in the cloud. So in this specific scenario, what's happening is that you have the start step, maybe it's doing some data processing, and then what this user wants to do is they want to run this train step, 100 instances of this train step, each instance having 128 gigs of RAM in the cloud. And we make it trivially easy for this data scientist to access all the compute that they have been given from their organization. So no worrying about Docker images, you have to think about Kubernetes concepts, all the logs are streamed back to you. If there are any failures, then you can sort of like, you know, very easily sort of like step through your code and really understand what went wrong and where. And that sort of like, you know, ends up being quite a big productivity boost. And then this also allows you to slice and dice your compute as well. So there could be scenarios where you need GPU only for training your model. So you can run your data processing steps and other steps on other cheaper instances and only run the steps that require GPUs on the most expensive GPU instances. And that sort of like reduces your cost significantly. Now, one of the main topics for us to be here at this conference and the agenda for this talk is to talk about orchestration, right? Now with Metaflow, you're able to create this workflow in a very simple manner. And there's also going to be this question that, hey, you know, aren't there like so many workflow orchestrators? What we just spoke about is a workflow. Airflow also runs workflows. You know, somebody was talking about Kubeflow and like many other options as well. Argo workflows is also very prominent workflow orchestrator. Now, if you talk to different organizations in your company, everybody will have a different perspective of orchestration, right? If you talk to software engineers, they'll think about orchestrations as let's say CICD solutions, right? If you talk to data engineers, then there'll be data engineering orchestrators. If you talk to machine learning engineers, data scientists, there'll be ML orchestrators as well. And that can be a rather complicated story for organizations like now, are you on the hook for managing these separate deployments? And do you need to now spend up dedicated platform teams and engineering teams just for sort of like, you know, taking care on this maintenance overhead. Now, one interesting facet about machine learning and machine learning workflows is that they're not an island, right? It's very likely that they're connected to other workflows in your organization. There could be data engineering workflows, let's say running on your data engineering orchestrator that are preparing data that your machine learning workflows may need to consume, right? Maybe the output of these machine learning workflows are consumed by some security workflows, right? Or maybe your CI CD workloads, the output of those feeds into your machine learning workflows as well. So there is sort of like a lot of interplay between these different universes. And what that then sort of like implies is that, wouldn't it be nice if you have in many ways a centralized workflow orchestrator for your entire company? And it could be a workflow orchestrator that you already use and it doesn't necessarily need to be a different orchestrator. And you can use different tools to author or architect your workloads that can then in turn run in the central common orchestrator in your organization. And that's basically what we end up doing with Metaflow. So you can use Argo workflows, for example, as your common workflow orchestrator. You can still use Argo CD and all the concepts in Argo workflow to declare and define native workflow templates in Argo and that will sort of like work really well. And then you can also use Metaflow to author your Metaflow flows for your data scientists and ML engineers who don't really need to concern themselves about the integrity details of Kubernetes and its ecosystem. And then Metaflow will take that workflow and compile it into something that Argo understands. So in many ways, you get sort of like, you know, the best of both worlds. With the single click, people are able to submit their workloads onto Argo. And then it becomes not only a native Argo workflow template, but it's also a native Metaflow flow as well. So machine learning engineers, data scientists, they get all the advantages, all the benefits of very quickly iterating through their ML pipelines. And then as platform engineers, we have the safety and security that all of these workloads are running on a single unified workflow orchestrator that we already know really well. And then sort of like, you know, the next bit comes in where, as I mentioned right, these workflows, they're not insulated from each other. This information needs to flow from one workflow to the other. You may have, let's say, a data engineering workload that is writing data to your data warehouse, let's say your writing data to Snowflake. And then as soon as that data is written, you want to trigger a machine learning training workload that can train on the new data that was just generated inside Snowflake. And then once, let's say this model has been generated, maybe it needs to kick off certain other downstream workflows. So then we are able to define these different triggers in many ways so that you can place these dependencies as well as past context through. And that can be oftentimes quite powerful in organizations where, let's say, different teams are responsible for contributing to the same machine learning system. So now I have here, Yuan from Argo, who's going to talk about... Argo projects consists of a set of Kubernetes-native tools for deploying and running applications and workloads on Kubernetes. It uses GitOps paradigms such as continuous delivery and progressive delivery and enables MROps on Kubernetes. There are four core independent Kubernetes-native projects, but many teams use different combinations of these projects to address different use cases and challenges. The first project is Argo workflows. It's the Kubernetes-native workflow engine. Argo events focuses on event-based dependency management for Kubernetes. Argo CDE provides declarative continuous delivery with a fully loaded UI so developers can just take a look at the UI to see what's happening in their clusters and watch for application deployments while following the GitOps principles. Last but not least, Argo robots provides advanced Kubernetes progressive deployment strategies. Argo workflows and Argo events provide building blocks for the foundations of cloud-native machine learning workflows. And we will discuss them in more details. Besides these four core projects, there are many other projects in the ecosystem that are based on Argo, extend Argo, or work well with Argo. The Argo project is used and trusted by more than 200 end-user companies and has more than 14,000 Slack members, more than 25K GitHub stars and 6K folks. These latest diagrams from CNCF and Linux Foundation that provide project rankings for developer visa velocity based on project activities such as activities on GitHub, requests and issues, number of commits and so on. Argo is one of the fastest growing among CNCF and Linux Foundation projects. Argo currently has contributions from over 800 contributors. We provide mentoring for new contributors as well as regular contributors meetings to provide an opportunity for the community to participate in design discussions and decisions. There are more than 40 or maintenance to all the Argo projects from over 10 organizations. Argo workflows is the continental native workflow engine for Kubernetes. The main use cases for Argo workflows are machine learning pipelines, beta processing, ETL, infrastructure automation, continuous delivery and integration. Here's a screenshot of what Argo workflows you are looks like. Argo workflows contains a set of CRDs and controllers that are Kubernetes native. CRDs are Kubernetes custom resources that natively integrates well with other Kubernetes resources such as volume and secrets. Argo workflows also provides interfaces such as command line interface, server, UI, as well as decays for different languages. The command line interface is useful to manage workflows and perform operations such as submitting, suspending and deleting workflows through command line. And server is used for integrating with other services. There are both REST or GRPC service interfaces as decays for different languages such as Python, Go, and Java. The UI is useful to manage and visualize workflows and any artifact logs created by the workflows as well as other useful information such as resource usage analytics. Argo events is the event-driven workflow automation currently it supports events for more than 20 event sources such as webhooks, S3, GCP pops up, bit, slack, et cetera. And it supports more than 10 triggers. For example, Argo events is responsible for watching new events from different event sources and then trigger different actions such as submitting Kubernetes custom resource objects Argo workflows, AWS Lambda function calls, Kafka or Slack messages, and so on. We can use Argo events to manage everything from simple, linear, and real-time events to complex multi-source events. Argo events is also cloud event compliant so that you can use this standard common event specification to integrate with other systems very easily. So now let me walk you through the entire stack as a closing. So we spoke about how this entire ML stack is incredibly important for a data scientist to understand and work through. So Metaflow has a variety of different integrations. So on the data side, it supports all popular data warehouses and data lakes. It also manages a model registry as well as an artifact registry so that we can track all the work that your data science organization is doing. So if we're on the cloud, we have integrations with S3, Azure Blob Storage, Google Cloud Storage, as well as any on-prem S3 compatible storage option that you might be working with. For compute, currently we offer two options. On the AWS side, we have a native integration with AWS batch and its HPC offerings. Otherwise, if you are on Kubernetes, if you have a CNCF compliant distribution of Kubernetes, you can run Metaflow on top of it as well. On the orchestration and versioning side, Metaflow will take care of those behind the scenes for you so that you don't have to worry about it. Any work that you do automatically gets versioned, gets cataloged, and it's very simple and straightforward for you to then sort of like take those results that are generated by Metaflow and connect those to any other systems that you may have in your organization. For deploying these workflows, we offer primarily three different choices today. Again, if you're looking for AWS managed offerings, we can compile down your work into something that AWS step functions can understand and then run reliably well. If you're already using Airflow, you can also use Airflow alongside Metaflow. We'll do exactly the same thing. And then as Yuan just pointed out, we can also compile it down to Argo workflows as well. So you get sort of like the benefits that all of these workflow orchestrators provide in addition to the benefits of Metaflow. And of course, as a data scientist, then you're sort of like quite at will to use any Python framework that you would prefer to use. So in closing, at the end of the day, data scientists and software engineers, they need to collaborate to build these rather valuable ML systems and data introduces a lot of entropy. So that then in many ways implies that either to improve your models or just to sort of like keep up with the fresh data that's coming into your organization, you need to keep on building better models and you need to keep on refreshing your ML system. And to take on sort of like that complexity, you can then rely on the strong foundations that projects like Kubernetes, Argo workflows and Metaflow provide. Now, if you are interested in using Metaflow, please come talk to us. We have a booth here at O20. And if you decide to move forward with Metaflow, you'll be in great company of many of these organizations like Netflix, like Autodesk, Zappline, Amazon Prime Video and many others. And if you're curious to get started, you can also go to this link, metaflow.org slash sandbox. That'll provide you a hosted BS code experience with a Kubernetes cluster behind the scenes. So you can try out all of these examples and many more that are available in our public documentation. And I know we have five minutes remaining. I'm happy to take any questions that you may have as well as we can continue the conversation later on in our Slack channel as well. Thank you. Nice talk and as you mentioned from Apple. So there are a number of tours already. There are a platform, two kits for the machine learning workload, workflow management, task orchestration tracking. So can you and the comment or compare, for example, under this Metaflow and plus ARGO CD with like a Kubeflow in particular running on Kubernetes machine learning? So of course, there are many solutions out there today. If you think about, let's say, ARGO CD, that's a solution that's more for rolling out your software artifacts versus let's say something that's built for data science use cases. So that's sort of like, you know, again, harkens back to the slide that I had previously around having many different kinds of workflows that an organization might engage in and you may need sort of like different kind of solutions for that. I think Kubeflow would be a lot more alike to Metaflow in many ways. And I think with Metaflow, we have taken many different design decisions in terms of how we want to make it a lot more natural and idiomatic to data scientists to express their compute without having to sort of like worry about in Kubernetes aspects. Okay, so another related one, that do you think these two or the platform are complementary to, for example, the RAIN or distributed machine learning at the wrong time, they can work and are known with each other or they are just completely different fields? Excellent, excellent questions. So you can definitely, I think this was just early last week when we announced our official integrations with tools like RAID, deep speed, PyTorch distributed and TensorFlow distributed. So you can take your workload and run it through Metaflow and Metaflow will then go ahead and create a RAID cluster or a distributed PyTorch cluster for you behind the scenes and run your workload for you. So again, as a data scientist that you don't really have to worry about how that gang cluster is getting constructed and how to sort of like, you know, set up the right environment variables and the context to make sure that that gang can communicate with its members. Okay, thank you very much. All right, if there are no other questions, I'll be hanging out in the hallway. Please. Thank you.