 Okay. I guess we can get started. Hi all. Thanks for joining us from all over the world. We have quite a few people in attendance today. My name is Savin and today with my colleagues Jason and Brian, we're going to talk about a project that we very recently open sourced called Metaphyl. And in today's workshop, we will walk you through in terms of how Metaphyl might be useful for your data science projects. And we hope that this entire session is super fun. We would definitely like to thank you, our collaborators. So this workshop earlier was supposed to be part of the user 2020 conference at St. Louis. But unfortunately, because of the pandemic, we had to move everything to a virtual venue. And we were supported very easily by the user communities in Africa. So much thanks to them. Before we begin, because there are going to be certain hands on component to this workshop, it would be rather nice if we can install the R package Metaphyl on your laptop. The instructions can be obtained by going to this URL, bit.ly slash Metaphyl R. And if for any reason you're unable to install Metaphyl, that's again totally fine. You can just see what the instructors are doing and just follow along. And throughout this workshop, we have a number of teaching assistants who are going to handle your questions. So you can either post your questions on Zoom Q&A. And it would be nice if you can make your questions visible to all the attendees. And you would have also received an email confirmation about the workshop that has a link to Gitter. So if that's something that you prefer, please post your questions over there as well. We are monitoring both the channels. And we also have live captions. So on your screen within Zoom webinar, you would see a tag called closed captioning. So you can enable that. And that will essentially allow you to follow along the workshop. There's a separate URL as well for captioning. If that's something that's of interest. So let's begin. First, the very exciting news. Metaphyl for R was open sourced just this past Tuesday. So you guys are the first people to get a look at what Metaphyl is all about in the art universe. And up top Metaphyl is a platform for data science projects. It was built at Netflix. We have been using Metaphyl for all sorts of machine learning and data science projects for the last almost three years. Almost all of machine learning at Netflix, except for recommendations uses Metaphyl. And in today's workshop, the agenda, the high level overview is that I'll cover what was the motivation for us to build Metaphyl. And when should you be using Metaphyl? When should you be considering looking at Metaphyl? And then my colleagues Jason and Brian will take you on a hands on session. They'll take you through a case study and they'll take a business problem and identify some of the issues that you might run into in a day to day fashion. And how can Metaphyl essentially help you get over that? So now before we begin talking about Metaphyl, it's really helpful to understand why did we even build Metaphyl, right? What was the motivation for us? Now, when people think about machine learning and Netflix, the very first thing that jumps out is our recommendation system, right? So for a lot of you who might have subscribed to our service, when you log into Netflix, you see a whole bunch of TV shows that are all customized to your tastes and preferences. And we have been investing very heavily in that area for well over a decade now. But now as Netflix became a global company, as it became a studio by itself, and it started producing more and more content, there were a lot of different areas where we had to invest when it came to data science, you know, areas like how do we properly value any piece of IP that we look at, say, you know, we need to produce a TV show or a movie, then there needs to be somebody who needs to figure out that if you know, like this TV show is worth these many dollars, and we use a lot of data science to guide our decisioning around that. To make sure that, you know, your viewing experience is optimal, we do a lot of research around video algorithms, audio algorithms, figuring out how to cash our content as close to you geographically as possible, and use a lot of machine learning to enable that as well. Predicting the quality of network. We spend heavily on marketing articles to you. There's a big experimentation framework as well, which is primarily driven by ours, so we are big on causal inference as well. So you can see that we have a diverse set of use cases that people inside of Netflix are working on, and which means that our data scientists, they come from a diverse set of backgrounds as well. They have skills in Python as well as skill sets and are. And when my team, the machine learning infrastructure team went about building a platform for them, at that point in time, we had to make sure that our platform could cater well to the users who were in the Python universe, as well as the users in the R universe. And Metaflow is essentially that platform. And here we are today talking to you about it. Now, let me walk you through what a life cycle of a data science project in R looks like from a very high level. And this will essentially also highlight some of the problems that we definitely run into day to day. And hopefully you would also have experience at some point. So, you know, most of the data science projects, they start out in some sort of an IDE in the R universe. RStudio is really popular. A lot of people use Jupyter notebooks with R hurdles, or maybe, you know, you just use WIM, Emacs, some other ID. And when you first start solving a business problem, the work is very exploratory. It's very experimental in nature. You're constantly looking at newer data sets. You're constantly tweaking your algorithms, tweaking your approach. And there is a lot of back and forth movement. There might be certain things that seem promising, and you'll follow it with them. Then once you've generated your results, then it might turn out that, okay, maybe you need to take a step back and try out an idea, maybe a different set of parameters. And as it happens, there is a very strong need to maintain some sort of a history so that you can very easily move between versions of your own work. And data science is rather interesting because when it comes to this notion of versioned history, it's not just the code that we're talking about. We're talking about your data. We are talking about your results as well. And that's really, really important at the end of the day, because even when it comes to, you know, collaborating with other people, if you are able to keep a versioned history of your own work, then other people at least have a notion of figuring out, like, okay, what sort of progress have you made? Can they actually repeat the kind of experiments that you have run? And in this space, there are many excellent tools, like MLflow happens to be one that allows very explicit metrics tracking, very explicit code tracking, data tracking. But at the end of the day, as an end user, when you're doing your work, what's really important is that you're just focusing on the task at hand. And you don't have to think about versioning your code every single time you execute. Say, you know, maybe you're familiar with Git, but you wouldn't want to do a Git commit every single time you're executing your code. Because then, even if you miss out doing that Git commit even once, then you're sort of like lost that history. So it was really essential for us to build a framework that was taking care of versioning as a first task and so on. And it was versioning not just the code, but the data and the results that the users have as well. Now, once you have some idea about how you're going to version your experiments, version your work, then very often, you know, a lot of times people run into this issue where they need to scale out their compute. And scaling can mean a lot of different things. Maybe you're using a data frame and now all of a sudden that data frame doesn't fit on your laptop. Now, there are multiple different ways that you can tackle that. Maybe you use more efficient approach of processing those data frames on your laptop, say, you know, using data table or using deep layer. At times, you just need to have a much bigger machine to process that data set because even deep layer and data table might have resource requirements that far exceed the ones that you have on your laptop. At times, you might want to say, train a machine learning model using GPUs and you might not have access to those GPUs on your laptop. And so how would you go about doing that? And unfortunately, the R community today, they don't have very many good resources to interface effectively with cloud. Say, you know, you had a way of very easily just like taking your compute and putting that onto a much bigger instance on, say, Amazon's cloud or Google's cloud, then it would be much more easier and productive for you to do your work. And in the Python universe, there are a whole bunch of tools that are available. But then in the R universe, you're still kind of like making progress on end. And as I said before, right, like you can use, say, data table deep layer to be a little bit more efficient in terms of your data processing needs. Maybe you can translate your code into Sparkier so that you can distribute your compute. But all of that essentially at the end of the day requires some changes in your code. And it's just not productive. If there was a very simple and easy way for you to just like take the code that you have and really transparently execute on a much bigger instance on the cloud, then you would end up being far more productive. And that's something that we try to make sure that Metaflow is capable of doing. Now, one point is that, you know, reticulate is an amazing project that has made the Python universe very easily accessible. So now you have like all sorts of cloud APIs that you can very easily just access to reticulate. But still, getting things set up on the cloud, making sure that you can very efficiently interface with the cloud data store, you can launch multiple jobs, you can figure out, you know, how to move results from cloud to your laptop and all that still requires a lot of heavy lifting. And today later in this workshop, as we go through the case study, we'll walk you through around how Metaflow essentially takes care of that. Yet another concern that a lot of folks have is around this notion of accurately and timely getting their results. Now, there is this notion that, you know, a lot of our code is experimental in nature. But then at times it's generating really valuable results and insights for your business stakeholders. And there is often a need where you would just want to make sure that, you know, your workload executes reliably on time at a given schedule. Say, you know, every single time your data set updates, you would want to re-trigger your analysis, or you might want to redo your analysis, say, you know, like every Sunday night so that the business stakeholders when they walk into the office on Monday morning, they have the fresh set of results. Now, up until now there hasn't been an easy way for our users to schedule their code. You can definitely use Cron to schedule the code on your laptop. But then again, you know, there could be scenarios where you might just want a better monitoring or alerting functionality so that when things fail, you are like meaningfully alerted and then you can take some sort of like collective action and figure out how to keep up. There are a bunch of workflow orchestrators in the open source, like Airflow, et cetera. But now the problem with them is that they don't provide a native RAPI so you have to essentially shoe on your code and you also have to understand what their Python API or what their custom BSL is at the end of the day to get your work done, which again hampers productivity. And all of these issues were issues that we saw internally at Netflix as well. So it was very important for us to build something so that our end users at the end of the day, they can focus on what they really love, which is doing data science and not necessarily focus a lot on these infrastructure and engineering concerns. And that was the result of that exercise where we built Metaflow, where we take strong opinions on the lower levels of the data science stack. So, you know, for example, when it comes to how do we arrange data in the data warehouse or how do we schedule your compute on the cloud? How is your code action version? These are the things that our data scientists would like the platform to do on their behalf. But then when it comes to exercising their opinions on what sort of libraries they want to use, how do they want to do feature engineering, for example, that's where they would want to exercise their own opinions. And so we took a very human centric approach and we created this R package, which essentially then allows data scientists to just like focus on their work. Now, when is Metaflow a good match? You know, like when should we be looking at Metaflow as sort of like part of your tool set? And as I mentioned before, right, like here are sort of like these three questions. And if your answer is yes to any of these, then Metaflow can definitely be helpful. Say, you know, if you have multiple collaborators on one single project, or if you have multiple moving pieces as a project that's very complex and you want to keep track of like different states, then definitely take a look at Metaflow. Or if you want to very easily offload your compute to the cloud, you say, you know, Amazon S3 as the cloud data store, essentially, anytime you start pushing the infrastructure limits of your own laptop, then Metaflow can be a good match at that point in time. Or if you have any sort of like constraints around making sure that your results are produced in a timely manner and you can offload the execution of your code to say a workflow executor, in all of those cases, Metaflow can be a good match. But of course, you know, there are plenty of instances where you need none of these. Maybe you just want to have rapid prototyping on one single laptop and there are very excellent projects out there that you can take a look at, one of them. So in today's session, my colleagues, Jason and Brian, they'll walk you through a hypothetical business problem. They'll take some publicly available datasets and will try to predict housing prices in various neighborhoods. And through that case, they will essentially walk you through Live or K. What would it look like solving this problem outside the scope of Metaflow? What are some of the problems that people might run into from an architectural standpoint when you have to actually solve a business problem and make sure that your thing runs effectively into it? And then we'll talk about how can we introduce Metaflow in terms of solving this business problem and how can it be helpful at the end of the day. Now, this session today, you know, it's a long session, close to two and a half hours long. The key takeaway for today's session from our viewpoint is that today's session is not really meant to be an exhaustive overview of Metaflow. We're not going to cover every single feature that Metaflow has, but the session is more geared towards getting you introduced to Metaflow so that then it's easier for you to dive into documentation and reason about our programming model. And I want to make it clear here, Metaflow is not supposed to replace any of your existing tooling that you might already be familiar with. It's a complementary tool in your stack. It's supposed to help you be more productive with the tools that you already love and use. And yes, it goes without saying, please have fun. If at any point in time, you run into any issues, we have ample support available in Q&A as well as on our Gitter channels. Now, before I leave you, there's one thing that I want to highlight. In today's session, we are going to talk a lot about interfacing with the cloud, offloading your compute onto AWS instances. And we understand that, you know, at times, it can be really difficult to meaningfully get access to an AWS account. So we have created these sandboxes that you can actually go ahead and request at metaflow.org slash sandbox by logging in with your GitHub ID. And we will essentially provision you an isolated AWS environment with all the resources set up. And this is all complementary paid for by Netflix. And you can essentially test your own code, your own data sets on these sandboxes and experience all the features that Metaflow provides. And if you run into any issues at any time after this workshop, we have ample documentation available at docs.metaflow.org. And the entire development team is available round the clock at chat.metaflow.org for live support. And with this, I'll pass on to Jason for a deep dive on Metaflow. Okay. Thanks. Thanks so much to Savvin for the great introduction and great to meet you guys over this virtual conference. Let me share my screen right now. So sharing screen. Okay. Deep dive. Okay. So I'm getting started from here. Let me just turn. Okay. By the way, I cannot really see the chats. So I can see the GitHub channel. So if you have questions, if you post on the GitHub, I should be able to see them. And yeah. So this is meant to be a deeper dive, Brent and I are going to talk through a case study to basically introduce Metaflow with the greater details. So today's agenda for this deep dive section is first of all, we're going to talk about why are, why do we really care about R at Netflix. And then secondly, we want to go through a case study to introduce what's the motivation for Metaflow at Netflix. And then we're going to talk about how Metaflow can help with the case study. And if we have more time, we can chat more about Metaflow, the additional features on fault tolerant and production ready features. So without further ado, let's start with the first section, why are a Savvin has introduced earlier in the presentation. We open source the Python package earlier last year, December 2019. And just open source the R package one week ago. So it's on GitHub, Netflix Metaflow repo. So pretty exciting for us. This is why we're really excited about R at Netflix. So first of all, R has this very nice tidyverse package system. And basically you have packages for data IO, data cleaning, data wrangling, visualization, modeling, and the final presentation, communication. They have R markdown and Shiny app. And also we have a very nice data science oriented IDE R studio in R. This is really great because you have the editor, you have the R console, you have the variable explorer, and you have visualization, help docs, file directory, everything together in the same panel, which is really, really nice. And also it's very powerful to have this interactive visualization, our Shiny app. And also very importantly, R has a very rich libraries for statistical computing. And those libraries are also cutting edge because most of the statistic academic community like to publish their research on R on Cran actually. So if you need something for variable selection, statistical inference, causal inference, survival modeling, or nonparametric regression, or some other more advanced topics, R would really be the go-to place because it already has some very advanced, nice libraries available to you off the shelf on Cran. So inside Netflix, we have this big experimentation platform where we run A.V. tests all the time. And in our XP platform, we have causal models and visualization. And we use R for causal inference and Netflix. And we use Shiny app for visualization. So this is a technology, this is a blog post on our technology blog. It's a medium, you can check it out if you're interested. Okay, now let's go through a case study to get together to talk about why Metaflow. Yeah, so before we start with the case study, let's download the tutorial contents here. So we have the tutorial wrapped up in the repo. It's actually an R package. You can download from GitHub using DevTools. Let's do it together. So this is my R studio. You will just do DevTools, GitHub, install GitHub. Yeah, I think it will also be great if some other panelists can just copy paste the command on the chat in Zoom or GitHub. So it's easier for everyone to just copy paste. This is our 2020 Metaflow tutorial. Dependency equals true. So we're going to install it. Just one moment, reporting. So it's telling me my data table dependency is outdated, not updating for now. Just to save some time. And yes, so you see we have this new package and you can do pull tutorials. So this command will actually pull the tutorial content into the current working directory. So okay, so you see I'm having this new folder. So I have actually, we have five episodes from zero to four for our tutorial. Episode one is a baseline R project that's using vanilla R. And then in episode one, we'll first go through episode one in the case study. And let's go back to the slides. Yeah, so once we have pulled the tutorials in our current working directory, we can check out the structure of the tutorial folder. Basically, as I said, episode one is for the case study with vanilla R. And we also do a preview of Metaflow inside the case study. And the previews will be previews of episode one, two, three, four. And then later on in this workshop, we'll go into episode one, two, three, four. And with each of this episode, we're going to introduce some of the key features of Metaflow. And basically we'll do like a hands-on, we'll do this in the hands-on fashion. We'll basically try to modify the script together from episode zero to one to two to three to four. We're gonna basically introduce the features in the hands-on fashion. But don't worry if you can really follow. If you can follow, then that's great. If you have questions, feel free to type in the chat or in the guitar. If you can follow, then that's fine. You can just go to each of the episodes and run the scripts in those episodes. Because these episodes are like milestones or snap or checkpoints over our entire editing process. So you can directly go to each of the episodes and try it out. And yeah, so let's check if our tutorials will work fine. Let's do this in our studio. I'll go into this folder and episode zero. And now let's just run this, run the R. Let's take a look. Let's take a brief look over this script. So this script actually stores the other modeling script and data wrangling script inside this folder. And it's running the computation one by one. So I'll come back to this later. So we're running this just to make, just as a sanity check that we have installed the tutorial properly. So today we're going to talk about the case study on housing price prediction. The dataset looks like this. So we have this CSV file where we want to predict the price of the historical transaction of house in Seattle area from 2014 and 2015. And the raw features we're going to use are some of the attributes of the house. For example, a number of bedrooms, number of bathrooms, square feet living, square feet locked, and then number of floors and then waterfront and some other attributes. So the raw data are in the data folder, raw house data.csv. So this is the raw data and it looks like this. We're going to use these columns to predict this column. So our case study actually has four steps. The first step, in the first step, we're going to build the first model with baseline vanilla R or iterate on the features and models. And then we'll talk about how to scale out parameter search and finally we'll talk about what's the process for sharing results. So I want to mention that in all of the four steps, these are all in vanilla R. We don't actually use Metaflow yet. So this is episode zero. The goal of doing this, going through these four steps is to introduce some of the problems that we want to solve with Metaflow. For example, experiment tracking, version control, very importantly scaling out to AWS, our cloud integration and data management reproducibility. And yeah, and then after each step, I will do a quick preview of the Metaflow solution to each of the problems. And after the case study, we'll probably come back to Metaflow, introduction of Metaflow and then we'll talk about how each of the Metaflow features actually tackle some of the pain points. So this is how Metaflow actually think about the problem we mentioned here. You can see we have this stack here. On the upper level, we have model development, feature engineering, and the lower stacks we have data warehouse, compute resource orchestration, job scheduling, architecture, versioning. So we have we have observed that data scientists care more about this upper stacks and less about the lower stacks. But there's a lot of infrastructure needed for the lower stacks, especially if you want to have a robust cloud native cloud native solution. If you want to orchestrate and run things reliably with AWS, then there's a lot of infrastructure work needed for the lower level. So Metaflow focuses on the lower levels. And by doing that, we allow data scientists to very quickly on the top levels, mostly top two levels. And this is how we are thinking about these problems mentioned here for a typical data science lifecycle. Okay, let's just start with step one. In step one, we're building the first model with a vanilla R. This is just baseline R. So R is really great at data wrangling and experimentation. So basically, we have three scripts here for house price, where you will call this function to create this house price.csv. So this script does a little bit data cleaning, and it reads from the raw data, does some data cleaning, save the data here, and then we compute features. And based on the features, we build models. And we save model locally. Yeah, so this is how we build the model. Let's do it with the house studio. Yeah, you see, we are in this folder, we're in the right place, episode zero. Let's just go through episode zero and make sure that everything looks good. Scripts, let's source all the files. We need this. Scripts need compute features. Let's first need another one, which is pool data, I think. Yeah. Okay, so let's first call the pool house data and then it's going to pull this data file. This is how the data looks like after data cleaning. Let's briefly check out these scripts. So we're reading from the raw data and there's a little bit data cleaning goes on. And for example, we're making zip code into a character instead of integers. And then we're throwing out some columns as irrelevant. This is basically just a very simple data cleaning. And then we'll compute features from this. Yeah. So now we have some feature files. So I want to briefly talk about the features that we're trying to compute. First of all, we want to combine the number of bedrooms and number of bathrooms. That's why we're doing this cross product of these two columns and also condition and living. Another type of feature is our parameterized features. So this idea is that we want to check the average square feet living per room, but we have bedroom and bathroom. So we know that bathroom is a little bit smaller, but we're not really sure how much weight we should put here. So that's why we have three features with different parameters. We're going to come back and tune this parameters later. And we have three other additional features. So these are the feature file. You see I'm saving the feature file here with write.csv just to make sure that even if my RStudio crash with some bug, then I won't lose the intermediate results. I could just restart my RStudio and load the CSV and continue my experiment. And then let's build some model. Yeah. So it's a building model right now. Let's take a look at the script. It's pretty simple. This is X. This is Y because our prediction targeted price. So we're using all the columns that's not priced to the attributes. And this is the target. So we're training the model. And we're calling this trend GBM model function from this model script. And we're printing or printing a summarization of the model. And we're saving the model locally to this folder. Save models. Yeah. So you see the name of the model, we actually, yeah, we're actually using number 100 learning rate 0.01 as part of the file name because our model building process has parameters as well. So we want to make sure that when we're looking at the local model file, we know exactly what parameters we used to train this model. And then we can see in this model scripts, we actually, by default, we have N here set by here. And this is learning rate set by here. They have some default value. When I was building model previously, I didn't set any default value, but we can tune this parameters later on. Yes. So this is great. This is the vanilla R, nothing metaflow yet. Yeah. So just want to pause here and see if people have questions. Okay. Yeah. Let's man me move on. I can't really see the chats, but there's nothing in GitHub going on. So I'm going to just move on. Yeah. The second step is about iterating on our features and models. Yeah. So the idea is that as you saw before, we have parameters for the features. We have parameters for the models. Sometimes we want to try different versions of features and different versions of models. So sometimes they're just parameters and sometimes you want to introduce new ideas in the features or you want to introduce a new model. Just to simplify the case study, I'm just tuning the parameters in different scripts. So the problem here is that when we're tuning the parameter, we're going to end up with a different version of the script, compute features and the build model scripts. We will have different intermediate results files. And we're going to need to save different file names in save model directory. Let's quickly do that in our studio. So compute features. Let's say I want a different parameter here. And now make sure the features V2. So I had the previous one. I used that one as V1. So I'm now creating a different feature set. Features V2. Compute features. Yeah. And now I'm building models. I want to try a different parameter for the model as well. Then this I need to change this to make sure that I don't override the previous model file because I'm building model with a different parameter. And we're also using a different feature set. So I need to do this features V2. Let me save this V2 build model features V2. Yeah. As you can see, I have to, every time I try something new with some ideas, either new feature ideas or new model ideas, I have to somehow change the file names and variable names for my variables in our studio and also for the intermediate data set, the name of the data set. So I can know exactly which data correspond to which ideas that I tried previously. And that's kind of annoying. And the problem is, yeah, so it's even more problematic. If, for example, when you are in different, if you are in the new version of compute features, you try something new, but you forget to change the variable name or you forget to change the actual local file name and you would accidentally override on the previous results. So you override it and then, and if something was wrong, you figure out something, you notice something was wrong, but you're not sure what and then you change the file name and you build the models. And then at this time, your features V1 actually is out of sync with the previous version one. So this is a pretty big problem because it's going to create a lot of bugs and annoyances when you are prototyping. Yeah. So overwriting is particularly annoying when you hit a bug and hit an error in build model. For example, if your new feature idea is not that great, it introduced some, for example, highly correlated features and then your model building would crash. And at that time, if you want to switch back to the previous feature set, you're probably a little bit hesitant because you're not sure if the previous feature set, feature V1 is still in sync with the version one of the computer feature scripts. So what you ended up having to do is recomputing the full thing from scratch so that you know that I have everything in sync. So yeah. So recomputing from scratch is pretty annoying, especially when you have multiple stages and you hit an error in the final stage. So yeah. So these are the problems for these problems in iterating features and models. People really have to track models and features versions using a spreadsheet. So MLflow offers something like this, which is pretty nice. You will basically be able to see, for every run, they can automatically lock this into a spreadsheet in MLflow. If people don't use MLflow, then you have to do it manually. So this whole thing is pretty time consuming. And Drake also offers a very nice solution, basically to convert the whole workflow into this kind of a deck. And Drake is taking care of the versioning. And if you hit a bug in the final stage, like if you hit a bug in report, Drake can help you resume from the previous two stages using the data that's created in the two stages. And Drake is taking care of the data management and make sure everything is up to date. So we have some nice tools in our community right now to tackle these problems. So Metaflow does this as well. So if you're familiar with Drake, so the concept of that should be familiar as well. So this is how in Metaflow we actually construct the deck. Don't worry about this, a lot of code here. It's actually pretty simple. So you see, these are just a bunch of steps. In each step, you specify the step name, the next step, the name of the next step. And in the middle, you specify an r function, which is the function that you want to execute for that step. So here, this is the Metaflow deck. So the first step, you execute the first function, second step, the second function, and so on. And the interesting about Metaflow is that similar to Drake, we are capturing the code and data together in each of the steps for each run. So every time you run the script, we will create a new data object called step. And inside that data object, you will have data of the code and you have data on everything created inside this r function. And for every new run, we will refresh this. And this is how Metaflow can help for this problem of model tracking and data management. I'll come back to this later in next section. So this is only a quick preview of how Metaflow can help. Okay, so do we have questions? Okay, so I don't see anything on the guitar. Let me just move on for now. Okay. Yeah, so the step three is scaling out the parameter search. As you can see, we have parameters for the models and features. Some of times, these parameters are really important. For example, the learning rate of GBM models, the learning rate is really critical for how their model performs. So people like to do parameter search all the time. And if sometimes your model is kind of big, or if your data set is big and you cannot run a lot of parameter search in parallel on your laptop, then what people really like to do is to parallelize this in a managed cluster. For example, Slur, if they are not using AWS. So what people do is that they have to write a script to submit parallel jobs to a managed cluster from their laptop. But this approach is kind of annoying. I think it includes a lot of DevOps work because you need to think about how to copy the feature sets or the raw data to the remote machines because you're submitting a job to a bunch of remote machines. These remote machines are not yours in the beginning. They got allocated to you by the central cluster. So when they get allocated to you, it's just a new machine. It has nothing there. You have to somehow figure out a way to copy the project into the machine, copy the data. The data is sometimes very big and then the copying process can take long. If you're lucky, your cluster has this network file system. And in that case, the cluster can directly read from your local directory, which is nice. And yeah, and after your training finish, you need to think about where to write the training results because you may have 100 instances working. And then when the training finish, you need to ask each of them to write the training results, for example, the saved model back to somewhere. The problem is you somehow need to manage the right path for each instances and for each experiment. And then it's very important to avoid overwriting past results. And it's important to make sure that the instances don't overwrite each other. This is also a lot of DevOps work. It's Aerochrome as well. And also, you have to somehow keep the modeling scripts and CPU RAM resource requirements script in sync. So modeling script is your R functions. And this script is about how you submit stuff to the managed cluster. You may need to tune something for the cluster depending on how long the queue is, for example, max wait time. But those stuff, but those stuff are hard to, it's kind of not that easy to keep modeling script and this requirements in sync. And it's also very painful to know how, it's kind of not very easy for people to know how long the wait time is going to be for the cluster. And if your flow is mission critical, for example, if you're deploying this flow in production, and then your company's customer relying on this model, then you need to make sure that this model can get trained reliably every week. So you don't want to get stuck in a scenario where your model is waiting in the queue for a long time and somehow, and some random exception occurred and you cannot handle this exception. Maybe like, for example, could be some internet connection timeouts. So if you waited for too long. And the question is, how do you programmatically handle this kind of platform exception such as timeouts? This is also pretty non trivial to do if your model is deployed in production. Yeah, all of this is just DevOps, a lot of model operations and development operations. And none of these are data science, but we just want to focus on data science. And then DevOps are taking too much time for us. And this is a quick preview of how Metaflow can help. If you can still, if you still remember how we specify the step in our flow before, we have this step where we have step name, the next, the name for next step and our function. So it's very easy for us to scale things to the cloud. We only need to add a decorator here say we're running AWS batch compute environments, we need four CPU and eight gigs of RAM. And we just need to add the decorator here inside your current script. There's no need for change for your R function, no need to change elsewhere. And then here in the previous step, compute features you specify for each variable, the for each variable, take basically take a list of parameters that you want to search over with. And when you run this, when you run this, we will actually start five instances on AWS and each of them running the same, same, same code on build model, build GBM model. And then, and we will assign this parameters to each of the instances. So they will be building model with different parameters. And when they finish, they're going to the results are going to be read from the next step. So the point I want to make here is that it's really easy in Metaflow to scale a local model, to do scale any local flow to a cloud flow. So we're going to demo this in maybe 10 minutes. So it's a pretty exciting feature for us. I think this is probably one of the best features of Metaflow because it's literally turn my laptop into a supercomputer. So it, for me, the development experience would feel like the same as running locally. So I would just run the same script, as you can see later, when you can just run the same script, you can see the result printing, printing out on your console in our studio, but actually the code is getting run in AWS on any machine that you want to specify. If you're, if you have a big data, if your data set is really big, you can specify up to 96 gigs of, no, you can specify up to two terabytes of RAM on AWS. So that's the biggest instance AWS has right now. And you can specify hundreds of CPUs for each of the instances. And you can have hundreds of instances running at the same time. And it feels like it's just running on your laptop. So this kind of experience is really great. And yeah, so I'll come back to this later. And yep. So, and after you're done with the parameter search and you're back, you're back to sharing without section. So, yeah, so we have a great model here. We want to share results with our colleagues. So first of all, we wonder what's the, what's the right way to share our results with colleagues for just for simple inspection. For example, my colleague wants to inspect my features and final model files. How do I share them? So my features can be quite big. They can be hundreds of gigabits. I can't, cannot share by Dropbox. So that's, and even if, for example, I share them with AWS S3, I want to keep them updated on my work. Somehow I'm updating the features all the time. And then I want them to have access to my most up-to-date features all the time. It's not clear how do I share them. I cannot just give them access to my machine and folder because when they're inspecting my local files, they may accidentally change something. And that's also quite inconvenient for them to log into another machine. So, and also when they, when they write, when my colleague writes an inspection notebook, for example, some visualization and they have a point on how my feature looks kind of off. Maybe I have a heavy tail distribution that's indicating a bad feature. My colleague wants to share back with me a notebook. The question is how does he share this back with me? So if he just sent me the notebook, the notebook runs on my colleague's computer, not on my computer because the notebook is reading, definitely reading data from like a local file directory and the local file directory works on my colleague's laptop and doesn't work on my laptop computer. So it's not easy to, for my colleague to share a notebook back with me so I can reproduce and run in my laptop as well. So, yeah, so this process is kind of pretty non-trivial, I would say. And if my colleagues need to reproduce the whole workflow, I need to write instructions on how to run each script in certain order with different arguments. So this, this itself is kind of not easy because you need to also write about the dependencies environment. And then if you are updating the scripts, you need to update the instructions as well. So all of this is kind of pretty time consuming and error prone. So this is a preview of how Metaflow can help. So Metaflow actually maintains a global data store of all passwords for all teammates. So you can see instead of sharing results with each other, all of the Metaflow users writes their data and code to a global data store. So if you just download Metaflow, you're running locally. Your data store is just inside your laptop if you're just developing in solo mode. If you're collaborating with your colleagues, we have a guide to set up this global data store on AWS. So as long as we both have access to this global data store, then every teammate is going to write data and code to this data store on AWS. And each user has their own, each user has their own namespace. When they execute a run, the run is going to produce data and code, the code and data, they live in their own namespace. You won't accidentally overwrite a teammate's work. So you can feel safe that everything happens in your own namespace is properly isolated from teammates to teammates. And if you want to inspect your teammates' results, you just need to switch your namespace. We provide a way in Metaflow to give you a way to explicitly switch namespace. And then you can switch namespace to your colleagues' namespace, and then you can inspect the previous results in an immutable manner. So you can read the, you can read the password. You can read data from password, but you won't be able to modify it. So we're saving everything on AWS S3 in an immutable manner. Yeah, so I'll just quickly stop here and then see if we have questions. Okay, so looking at Gitter, because I cannot see chats when I'm in presentation mode. Okay, nothing happening on Gitter. So let me just move on. Okay, so this is our case study coming to a full cycle. And then again, these are the problems we want to solve with Metaflow, this experiment tracking, version control, scaling out to AWS, data management and reproducibility. So all of this are just DevOps and model operations and how to make the whole thing production ready. But as a data scientist, we only want to focus on the data science part. And then all the other things are not really something we want to spend time on. And it's pretty frustrating if you get stuck in all this non-data science stuff. Yeah, and then just want to come back to our philosophy. We want to take care of Metaflow, want to take care of the lower levels of the stack, and so that the data scientists can move very quickly and iterate very quickly on model development and feature engineering. And this is how time to first production for data scientists at Netflix. So with Metaflow, we're able to actually get any data science project from prototyping to first production within days, not months. But in real production, as a company, it's really take very long time to get a data science project to production because all of the hassles we mentioned above, and we didn't even mention how to make things fault tolerant and reliable in the production environment. So that process is really time consuming as well. So that's why we are still having some projects taking very long. But with Metaflow, the goal is to make time to first production in days but not months. Okay, so that's it for our case study. The next section, we're going to be talking about how can Metaflow help. And this is also the end of episode zero. And from this on, we're going to start with episode one, two, three, four. And if you check out coding in these episodes, now we have everything converted into Metaflow. So I'll do this conversion on the fly with a demo here with you guys. But don't worry if you cannot follow. I'll basically do a conversion from episode zero to episode one to show you how easy it is to convert a vanilla workflow into a Metaflow. If you cannot follow, for example, if you have to watch Zoom and do RStudio at the same time, I know it's not very easy. If you cannot follow, don't worry, just go into this script and source this file. You should be able to run everything in Metaflow smoothly. And then, yeah, so let's go into episode one. Okay. So you see, this is our previous workflow in vanilla R. And you can find the scripts in the workshop content folder, episode one. We want to convert this into the scripts in episode one. Yeah, let's do this together. So the first step is we want to chain everything together into a deck. So I showed this previously. Let's just do it together. Let me just first copy this. Into a, I'll just call this all workspace, because I don't want to tamper with the files. Okay. So I have workspace here. Everything is, so you see, this is, this is not in Metaflow. Let me delete this. First, let's do Metaflow. Let me make the font bigger. Yeah. First, I'll do library Metaflow. And then I'll do Metaflow, followed by the name of the flow. So the name of the flow is very important because this is how you classify different projects in our global data store. So, yeah, so this is the name of the flow and then we'll have this, this symbol, which is a, which is a pipe operator where we used to chain steps together. So the first step, let's call this start step. Start step is not doing anything. We just, we just use this as a transition. The next step will be pull data. So in pull data, in pull data, we need to specify the R function that we need to run. We want to run with this, with this step. The R function will be pull house data. We're going to make some small modification to this, to this function later on. But this is the, this is the script, this is the function we want to use. And then the next step will be compute features. Yeah, by the way, before we're, before we're doing this, we want to make, we want to make sure that these R functions are visible in this R, in this R on the R. Let me just source all the files that we need. Because we're, we're referring to this function, we want to make sure that these things, these things are visible. Yep. So, and then the rest should be, the rest should be the same. Just change all the steps together. In order to save time, let me just copy paste, let me just copy paste the run.R in this episode one, copy paste, going back to workspace to run.R. Yeah, so I'm copy pasting here. So we were at this step, but just to save time, I'm copy pasting for the idea is the same. You put two more steps here, and you put end step. And very importantly, at the, at the final, at the final step, you have to put run here. So by doing this, when you're sourcing this file, you're going to actually run this script. So this script is not yet runnable because the, because the font, we need some final small touch on the R functions. Let's take a look at the R function. Who house data? Yeah. So we've done this part. So each step is they, when they run, they run an isolated environment. So step is a special concept in Metaflow. So the flow runs in steps. And in each step we will, we will run, we'll run tasks. So each step runs in isolated environment. So they can either be a local process or they can be a remote execution. And it's fine that you have some step and running locally and some step running on AWS. And they can, they can very easily work together. And this is the idea of step. It's, it's basically an isolation idea. And you also need some modularization of your project to put everything into this stack. Yeah. So now we're modifying the R functions. The first step is to add the self argument. So all of the Metaflow functions have to have the self arguments. We'll see how, how this is used later on. Let me just add it here. I think this example is build model. Let me just build model. Okay. It's called build GBM model. I'll call this build GBM model. And instead of passing in the data, the previous data frame, I'm passing in self. Self is the, is the original flow object. So what else, what else I need to do? So I need to make sure this function is functional, fully functional. It can run by itself. It has to take care of the dependencies. So I need to source some files and load dependencies. So I need model models to R and I need, I guess I need UTLs, right? Yes. And also load dependency. Yeah. So the reason we have to do this is sometimes we need to run this R function on the cloud. So when we're running on cloud, this function has to take care of all the dependencies because we're using this trend GBM model, which is, which is implemented in this script. And we're using, for example, the package data.table and we need to source the, we need to, we need to import the library in load dependencies. And this is dependency. So basically the R function has to by itself take care of the dependencies. And this is a core concept of Metaflow. And then we have, so we have this concept called the Metaflow artifact. The basic idea is that all the data created in each step they are saved in as immutable data objects. And we call this saved data object as Metaflow artifact. For example, we're, we're training a model and this function pass back a model data. It's a, it's a trend model data. And we are saved it. We're saving this model data as a Metaflow artifact. The way to create a, the way to create a Metaflow artifact is to use this self dollar sign variable syntax. So you can just do this and you will be saving this model as a immutable data artifact in the AWS global data store or in your local data store. And yeah, and the way to read, and the way to read the data is also very simple. You see the features are created in the previous step. We're reading it by just calling self dollar sign features. And by doing this, we're reading the features created in the previous step. So this kind of data reading and writing works across local and remote. We can come back to that later. But the idea is that even if we're running this on the cloud and then we're running this locally, this should still work. And for example, your model, your model data is created as a Metaflow artifact on AWS patch. And you can load back the model in your next step, which runs locally on your laptop. So this is really nice. And this is possible because we have a global data store on AWS S3. So yeah, and this is the idea of Metaflow artifacts. Let's just convert the previous script into to use Metaflow artifacts. So because we don't have previously we're passing in data frame here, but now we don't have it in the function arguments, we have our flow object here. So we'll just do this or fetch the features created by previous step as a data table. And this is the same. We're already taking care of this. So I don't need to worry about it. So instead of because this data is this model file is something really important, why we want to save them for next steps. And by saving, I mean, just creating a Metaflow artifacts to save this data objects. So I can do a fitted model for NGBM model. Yeah, let's just don't worry about it. And then the best thing is that we don't need to, we don't need to worry about this anymore, because the model is saved as a Metaflow artifact. And then, yeah, so there is no need to saving data locally. And then it's also going to be very clear to us for this model, what kind of features, what kind of parameters we're using for this model. Because the model and the code generated, so the data and the code that generates the data are captured together in the step. So when you're looking at the model, you can simply inspect the code to check out what actual parameters we're using to generate this data. That's the reason why we don't need this kind of file name to capture the parameters or the version of file that we're using, which is really a powerful feature of Metaflow. This is part of the data management that we do. I'm just deleting this. I'm also deleting this because I'm summarizing in the next step. Okay, so this is okay, so there's also no need to return the model anymore. Okay, so this is how you convert a non-Metaflow script into a Metaflow script. Very easy, three steps. One, first step is to add self here. Second step is to take care of dependencies. The third step is to take care of data IO using Metaflow artifacts. And yeah, so we're done for the first script. To save the time, I'm just going to copy paste the other script. I'm just going to copy paste other script here in the folder. And then the process should be exactly the same. For example, summarize model, we're going to source the dependencies and then we're using, instead of directly reading from this argument, we are reading from the data artifacts created by a previous step. Exactly the same steps you need to do to change each of these functions. Yes, so to save some time, what I'm going to do is I will simply copy, because after the final editing, the workspace should be the same as episode one. We just need to change the other power functions into Metaflow. So I'm going to delete this and then copy this to the workspace. Copy this as the workspace. Yeah, so right, so this workspace will be basically exactly what it's going to look like after we are done with the editing for the R functions. And now let's run Metaflow end to end. Set working directory workspace. Yep, exciting. It's the first time we're running Metaflow. You can see that the version of Metaflow R 2.0 and Python 2.0, in the future, the version number may diverge. So Metaflow R is building on top of our Python package. So our Python package is really metal tested. It has been used widely inside Netflix and also it has been open sourced for almost a year now. And so our package is building on top of it. And by doing this, we can enjoy the reliability provided by the very solid Python package. So the R package is really just a binding for the Python package. You can see when you're running, we're going to print that the Metaflow is executing the flow name for your user. This is your namespace. And then if you're having some, if you're having problems with, so in some very rare case, your Metaflow may complain that they cannot find the user name or our username. You just need to set the environment variable username or capital in R or in your bash. And then Metaflow will be able to find the proper username to use to execute your flow. So we have it in our docs Q&A. Yeah. So while it's still running, you see that we have a few stages, the start step, it's running. The PID is actually the ID for the process. As I mentioned before, each R function is running in an isolated process. So you don't need to worry about like one process interfering in another process. They are totally isolated. And then the first step, and then you run the second step, pull house data. And then the second step, new features, the full step, and then ending. The end step is summarizing the model. So the printing format is kind of screwed up. But if I make it smaller, you see it looks much nicer. So in order to put in the presentation mode, I want to make the phone bigger. That's why it's kind of messed up. Yeah. So great. So we have the first Metaflow running end to end. So my point is that we just need to make some very small changes to a current R project to convert a R project into a Metaflow flow. Just to recap, what you need to do is first of all, construct this stack using this step. Using this step syntax and using the pipe operator to chain everything together and then modify the R functions. To modify the R functions, there are three things you need to do. First of all, change the arguments to self. And if it's a joint step, I'll talk about later, sometimes it's self, sometimes it should be self and inputs. But most often, if it's a linear flow, then just use self. This is the first thing. The second thing is to pick up the dependencies to make sure that each of the function, R function can run independently in the R process. And then the third step is to take care of the data I owe using Metaflow artifacts. So write the data you think that are important for next steps into self.assign variable and read the data from the self.assign variable created in the previous steps. So that's it. And you should be able to convert your current project into a Metaflow flow. Yeah, so running the flow end to end would look like this. So we just did, it works mostly. And yeah, so here comes the exciting part. We want to run part of this, we want to run part of this on AWS batch. So, yeah, so before I want to make sure that I want to make this fun and then for you who are still like doing this demo together with me, we can do this configuring process and then you can configure the AWS sandbox provided by Netflix. So the sandbox runs on AWS. So Netflix pays the bill. So you don't need to worry about it. And it's like an experimental environment where you can run this flow on AWS and just to feel the power of turning your laptop into a supercomputer. And then the way to do that is very easy. You just need to do this. Metaflow, the font is must be too small for you. Let me make it bigger. So configure sandbox. So I already have an existing profile. I already have an existing sandbox configuration. Let me just override. So now we need a magic screen. So we have the magic screen ready for you. So Sabin, can you help us post the magic screen on Gitter and also maybe in chat. I can paste it here and reconfigure it again. Okay, great. So I see, thank you, Sabin. So I see the screen now. I see the magic screen now. Just need to copy paste here. And then yeah, so we configured successfully. Now, just to make sure that we're running on cloud in RStudio, let's do this. Metaflow gets metadata. You see, so the metadata provider, which is the global data starter that I was referring to, is already on AWS. You see, this is on AWS, US East. Okay, so we're running on cloud. So I want to mention that it's super easy to turn a linear flow working locally or just any Metaflow flow. If a flow works locally, it's super easy to turn your flow into a flow that runs on cloud or partially on cloud. For example, if the build model step is really computationally expensive, then you may want to run this on cloud. And what you need to do is just simply put this decorator here. Let me just do this here together with you. I mean, the workspace, I want to go to run.r. And I just need to I'll do a batch decorator. I'll do a for CPU, a memory, AGB. So that's it. Let's run this. So this is running. Yeah, so the first three steps will be the same as previous steps, which same as previous demo, which runs locally. And then the interesting thing is that the fourth step is kind of run on cloud. And you don't need to worry about copying data to the cloud. Because the first three steps are already writing the Metaflow artifacts to the global data store, AWS S3. So when you're running this, the fourth step on cloud, when we're reading Metaflow artifacts, we automatically figure out where we should read the data. And then the data is also in sync with the current code. So you don't need to worry about the data management and everything is in sync. And you don't need to worry about the specific AWS API that you need to use to read data from different places. And then if you do this yourself, then you have to configure the permissions for the AWS pocket, which is really painful. So we take care of all that infrastructure DevOps work. And then you see, so this is things are getting interesting, the build model stage are running on cloud, you see the status is submitted, and then becomes runnable, and it's starting. So runnable means it's waiting on queue. So we're not requesting a big box, it's just a small box. And then it turned from runnable into starting in five seconds. So it's kind of okay. So I want to mention that if you're requesting a big box, sometimes AWS, it's going to take a few minutes to 10 minutes for AWS to allocate the big box to you. So this would make sense if you're if you're script going to actually run in, for example, three hours. If you're running locally, it's going to take a few days. If you run on a big box on AWS, it's going to take three hours. In that sense, it makes sense to wait here for the instance to get started. If you're if you're script actually only finish in 10 minutes or five minutes, maybe it's not really worth the wait. And then you can just figure out the better way to do this, maybe get a big instance on AWS and just turn that into your workstation. So you don't need to go through this trouble of submitting waiting. Yeah, so this is like a trade-off. So we're waiting here, but the benefit is that we are not locking to any instance. So we can easily tune up and down what kind of instances we need. This is great for this is great for prototyping because in prototyping, we don't exactly know how big the instance actually going to be, because we're, for example, adding in more features, we're trying more complicated models. So as we as our project get more complicated, we want to upgrade to bigger and bigger instances. So if you decide on the big box, then you won't be able to upgrade easily to another bigger box. And then when you're upgrading, you need to copy all the project structure and everything to the bigger instance, which is not another pleasant experience. And very importantly, it's going to cost you a lot of money if you just have a big box running all the time. So our approach of doing things is so-called serverless computing. So you don't have this big box running there all the time. You only use it when you need it. And then those instances got terminated automatically after your program finishes. So literally, you don't have a server running all the time. So that's why it's called serverless computing. And you can see, so we did a few, we waited for some time for the, for the instance to start and then the model train successfully on the cloud. And then so we're printing some of the attributes of the model back locally. So, which is really nice. You can, I'll come back to this later, which means that you can inspect without locally on your, on our notebook in our studio for the data created on the cloud, which is really powerful feature. Okay, let me come back to the slides. Yeah, so seeing my demo is actually took longer to start. Yeah, it's just, I was having, I was unlucky and then there's a very long queue and we're experimenting on different stuff. But in our experiment, it's actually pretty quick. Yeah, so, so great. So we are running part of, part of our flow on the cloud and then with very minimal friction, there's no need to change your functions. You can see that we didn't change anything on our function. The only thing that we changed is to edit this decorator here. And, and you can change it to anything else. And very importantly, this infrastructure requirement is version control together with everything else. So this is called infrastructure as code. This is a very nice, easy way for you, for your colleagues to reproduce, not on the business logic level, but also on the infrastructure level. Because, because by doing this, you don't have to write in the instruction, read me saying that you need to require this kind of instances to run, to reproduce this, to reproduce this project. Yeah. Okay, so I want to pause here and see if we have questions. Please post questions here. Okay, so I saw there's a question on Metaflow install. Oh, I think this one, you just need to do Metaflow install user equals false. And then, and if you want to, if you want, if you run into any other trouble with the installation, check out our docs, docs.metaflow.org, a variance r that slash v slash r and click install installing Metaflow. And on the bottom part, there's a troubleshooting section, you can check out for each kind of arrows. I see for this screenshot that we have on the guitar, this is this arrow cannot perform a dash that user install just do this and should be fine. If not, please let us know. Yep. So let me just go back to the presentation. Okay. So I think I'm in back in presentation mode. Oh, yes. So we really had a great point. So the sandbox is shared among everyone. So don't use any sensitive data. Pick your own username, of course. So your, your stuff, your, all of your experimental, experimental work stays in your own name space. So pick a name space that only, it's kind of unique. And then, yeah, so that's it. So we are, let's move on to the next feature for episode one. So I want to demo this very powerful feature called Resume. This is the very similar idea with Drake. Drake also has this. And then the idea is that because we are capturing everything in the, we're capturing the code and data in the step and the steps are created for each one. So because of this, we're able to reproduce, we were able to rerun the flow from, from any intermediate step, not from scratch. So if you want to resume the flow only from build model, for example, if you, if there's an exception happening in build model, then you can resume from build model and not recomputing from start. And we will automatically reuse the results pre-computed from these two steps and use it directly start build model step. And another situation you want to resume is that you're making some changes. So it's not any exception, but you just made a, made a change to build model and you want to, and you want to rerun this whole flow. And then, and it's in at that, at that time point, because you know that you didn't make any changes in the previous two steps, it doesn't make sense to run the whole, the whole thing from start. And you can just resume from this, the search step. So we have the, we have a feature called resume for this. And in, in Metaflow, you can think about steps as a managed to checkpoints. So we, so this is a managed to checkpoints because it's easier than, than, than having checked, doing checkpoints yourself. If you're, if we're doing manual checkpoints, then you have to take care of the file names. You have to, for example, create a different folder for different experiments. And then in the experiments, you need to create folder for each step for each R function. And then you have to make sure the file name don't override each other. So, so managing a checkpoints is not, it's not that easy to do it right. So using Metaflow, you have, you have a managed to checkpoints automatically because for each step, we are possessing the code and data in the step. Let's just try this. So it's very easy to do resume. There's two ways you can resume. You can just do resume equals true. In this case, we'll figure out the, we will try to figure out the last step where you had an exception and resume from the exception, the arrow step. And then if you just want to resume from a built model step, like in this case, I didn't have any arrow, but I just, I made some changes. I just want to restart rerun from this step. I just need to do this. If you remember, we had this run. We had, we had this run at the end of the declaration here. You see, there's a run. You just need to add the run inside run. You had to add this argument, resume equals true, and this should work. Let me get back to my R studio. Let me get back to the R studio. So make sure, so I'm in my workspace. Okay. So this is run. I'll just do resume. Let's resume from the build model. Just to make it run quicker, I'm going to delete this for now because it's, because it's, it's, we don't need to wait for instances to start running locally, just to make things faster and just to demo resume. Just source this file and some interesting things are going to happen. Let's take a look together. You see, so this is kind of different. It's a gathering required information to resume run. So in the back end, we're, we're comparing the, the current flow with the previous flow. We're figuring out which part we need to copy from the previous run and which part we need to, which step we actually need to resume from. And then same as a cloning result of our previous run. So it's not doing any compute, re-computation of the start step. It's cloning results of pool data as well. It's not doing computation on that step. Cloning result on the third step and build features. And then we're going to start the, actually start building this model. So, yeah, so this is how resume works. This is, wait a, wait a moment. It's going to take some time for us to build a model because it's a GBM model with hundreds of trees. It's going to take some time. Yeah. So it's going to look like this, where you will see that task finished and it will be able to print results as well. You see the step after the resume step is still going to, still going to run. And then we figure out which steps are, are subsequent of the resume step. If, if those steps depend on, if those steps depend, depends on the resume step. We will not clone the result, but we'll refresh it. We'll re-run it because we're assuming that something happening with this step and then, yeah, it will look like this and just check back. Yes. So same as the screen shots, we're running this computation again and it's done. Yeah. So, yeah, so that's, that's resume. And then if you have an exception in build model, you can do this as well. And you can just, for example, like a very typical workflow for my, for me personally when I'm iterating locally is that I would just debug this flow. If something goes wrong, I will fix that step. And then after I fixing that step, I would resume from that step. So this is really a powerful feature for local iteration. And then Drake does this as well very nicely. So I'm going to pause here to see if we have question. They take a look at Peter. Okay. So there's nothing new. Let me go on to the final part of episode one. Yeah. So, so this is the notebook, the result sharing part that I talked previously. So in, in Metaflow, you can actually inspect previous runs in the notebook. You can set namespace. You can, you can do all kinds of stuff to check previous runs and previous steps. I'll demo this later with the notebook. And then I want to just reiterate on the points I mentioned below. I mentioned before. So Metaflow maintains a global data store of all past runs for all teammates. The teammates have their own namespace and they're isolated between each other. But yeah, so, but everyone can switch their namespace like I did here. You can set your namespace to your colleagues' namespace, and you can query their runs. And then, and it's going to be safe. You won't overwrite their results in any way. And then they can continue to run experiments and run their flow in production. So, and you can inspect their production results without worrying about tempering with production runs. So, and your colleagues can actually do this from any machine, not just from their laptop. They can run this from the cloud workstation. And then the same notebook runs everywhere. Because as long as you're connected to Metaflow global data store, this thing runs everywhere. So, this is very different from the experience without Metaflow. If you don't have Metaflow, then what people usually do is that they're reading data from a local directory. And then in that case, this kind of thing won't work necessarily if you run it in a different place. Yes. So, let me just demo this with episode one, switching back to our studio. So, I have the, I have this review. There's our map down file. Just quick, quick review. Oh, I need to, let me just promise. So, first of all, you to import library Metaflow. And then we have, we have a flow client, step client, and run client. Because so, we have different object, we have, we have object hierarchy. So, the higher level object is the flow. Inside flow, we have runs. For every run, you will create a run object. Like every time you source run or R will create a run object. Inside run object, we have step object. So, inside step object, we'll have task objects. So, in most cases, each step will just have one task. But if your step is a found step, we'll talk about later. Sometimes your step may run, may run the same code with different parameters like I talked before. In that case, your step may have multiple tasks. Each task is going to run the same, run the same code with different parameters. And then, yeah, so the top level is the flow client. And then let's just run this. And before executing the second block, let's just check all the previous runs. So, I can just do this. And then I can print all the previous runs. You see, these are the run IDs of previous runs. And let me just do, let me just check run, the most recent run, 31. This is my own namespace. And then I'm running this. I'm creating a run client. And then this is the, this is how you specify a run. This is a flow name slash the run ID. And then we can print the artifacts of the run. And we will query one of the artifacts. And we'll check out the features, we'll inspect the features. And we'll also query the models. And we'll check out the models. Let's just run this. Yep, you see, right. So, nice. So, let's check this out in preview mode. So, namespace, these are the artifacts that we have for the run. See, we have features, we have models, we have DT is the raw dataset that we created. Probably need a better name. Yeah, so, we can query this feature set. So, we're querying the feature set in an immutable manner. So, we don't need to worry about tampering with this feature set that our colleagues generated. And by doing this, we can also stay up to dates with my, with our colleagues' results. We just need to check this and see what's the most recent run. And then print out, check out this, check out the data table. And then we can check out this model that we created before. And print this model. See, this is a gradient boosting model. How many samples, predictors. Yeah, so, we can do some prediction. We can cross-check with the model some holdout data. And in this notebook as well. And I just want to highlight a few more features of our, of our Metaflow client. So, for each run, you can actually check out some of the timestamps. For example, we can check when is it, when is it created. And when is it finished. Finished at, so it's, so you see, so we have snap, we have time shots, we have timestamp of all the runs. And you have timestamp of all the steps as well. And then, yeah, and then you can check if this is run successfully. Yeah, so this is successful. And then what I can do is we actually have a, we have a building function. We have a building function to print a summary of the run. Yeah, so this is a summary of the run. So it is successful. It took 1.4 minutes. And you can do this, you can call summary on run objects, step objects, or the other objects. Yeah, so this is Metaflow client. Can check out more on our docs. Yeah, so we have a section on the Metaflow clients. This is pause here for questions. And this is also the end of episode one. So my colleague Brian will take over and talk about episode two and three. Okay, there's no questions. Now I'm giving over the stage to Brian. Let me just stop sharing, stop sharing. Okay, Brian, feel free to share your swing right now. Brian, can you talk? There we go. All right, that was amazing. So let me share my screen. Everyone can see this. Is that right? And we're good. All right, for a bit of history, I was the first person to take a stab at these, our bindings for Metaflow. And a few years ago, spoke at USAR about it. And then Sabine did last year. And I'm really excited to kind of share this with the community. I've had like a lot of schemes for Metaflow. And I'm excited that other people can fix a lot of my API decisions. Okay, so we're now going to introduce a few concepts. One is branching. And the other is introducing parameters. So branching, as Jason alluded to earlier, was a parallel process that kind of branches out. And then there's a joint step that pulls everything together. And parameters allow you to, at the command line, adjust different variables and is kind of the magic behind a lot of the for each horizontal scaling features that are going to be so powerful. So, great. So, and when we're kind of looking at our flow, there's really not much that we would have to change here. As you can see, we're just introducing another new model. And the only difference towards the end is that we need to specify that there's a joint step going on. And these branches don't necessarily have to kind of be one level deep. They can continually kind of go down. And this example today is using Karat, but the tidy models framework really lends itself well to this. As at the end of the day, these, the re-samples in those set of packages is really just kind of lists and nested lists. So with using kind of like tidy models, it really would just be to specify the interloop of a, the core kind of modeling here. And then it would be relatively easy to like nest those on top of one another. So, and now that our community has embraced the functional workflow of the per package, it shouldn't be too difficult to take a single model and really flex some, some net flow on it. So, to reiterate, we're just introducing a new function for this LASA model. And then at the ends, we just have a joint flag. Cool. And, you know, metaflow has been kind of touted as, you know, machine learning tool for, you know, really intense kind of CPU intensive modeling. But when I was initially working on this, I felt that it was going to be very powerful for, you know, like a workflow scheduler kind of like airflow where at the end of the day, you're really just writing R code. And I've always felt that like airflow is pretty opinionated as to how you set things up. So, the core construct of metaflow is that you're just writing in your native language and all of these best practices are kind of handled automatically behind the scenes without having too much overhead of, you know, kind of moving your functions into a certain paradigm. So, yeah. So, again, we're just adding a new function and kind of joining event. Cool. And going to try to do this live. All right. Hopefully, my RStudio isn't too obnoxious for everybody. All right. So, we're going to files. We're in our branches. So, here's the completed run step. And, you know, these are the new features here. So, if these did not have defaults, they would air out if you were running these as you were before. But the kind of magical bits of, let me do this as a job. So, these changed. And then we have this new last of the model here. And so, to do this parallel fan out or parameter this branching of these different models, you know, it was extremely kind of lightweight in that you just had to add another step of a dag. So, here it is taking R2 parameters. And kind of has the same signature as the other model. But that's not kind of required. So, yeah. And these parameters can be anything kind of at all. And if you were kind of going to run this on the command line, not sourced in, but you would run it as, and then anything that is a parameter immediately comes available to be serviced as a flag. So, it's not going to probably like this because I don't think I have a Python activator. Oh, it did. Cool. So, let me show you again. So, yeah. Nothing really too tricky here. Again, we're just adding another R function to the flow and adding this joint step at the end. And so, and another thing I wanted to kind of highlight about the client API is that these flows are really just nested lists. So, it can be really easy to pull out what you need using per, for example. And I like really agreed with the philosophy of metaflow where you expose the tools and the developer can really take it in a direction and kind of extend it how they want. So, when I was working on this at Netflix, I really thought that there would be a lot of metaflow-powered shiny apps, for example. Since your ETL for a shiny app can basically be done once and then thrown into a flow and kind of scheduled. So, yeah. Anything on parameters or branching. So, Jason did a really great job of introducing it at the beginning. So, all right. And similarly with the 4H fan out, it doesn't take any modification to your actual R code. You're just specifying a few different flags to implement this to really take what you had prototype under your laptop and put it on a bunch of machines in the cloud if you kind of have that set up. I think I'm running from an old one, but let me jump into the 4H. So, this didn't have anything changed outside of this line and I believe that this joint step remained the same as well. So, metaflow really exposes the basic constructs where you can really express fairly complicated things really easily. Just kind of like writing R codes. And in this one, the actual model didn't change at all. So, let me run this guy. Yeah. So, I really think that there'll be a lot of interesting future directions that kind of people take with this. For example, I think that there is a lot that could be done with Shiny and kind of the RStudio API. And for example, kind of using the API to surface a lot of the client exploration as a background job in the viewer pane. And I don't know. I have a lot of out there ideas that I'm really glad that Netflix finally open sourced this and puts, you know, talented people like Jason and it's being honest to take the R package to the next level. So, this one, again, you know, we didn't have to change any code. We just changed our flags here. So, so these parameters here are hard coded. But if we combine this with the parameter from the earlier, or the parameter from the earlier lesson, you know, this could be specified at runtime, which makes it a great tool for kind of more data engineering and type of workloads. All right. So, those are two really powerful features that really align with functional programming. And so, I think that you can do a lot with small composable chunks of our code and let Metaflow kind of take it from there. And I think the opportunity for reproducibility in data science work is it's really going to be kind of a great tool for that. And I kind of felt that it was kind of cheating almost for this kind of work from what the tooling knows, you know, back in 2017. So, all right, we've got a question. I'm going to stop and I think Jason's going to go deep on the AWS integration. Okay, thank you for the nice introduction of branches and 4-Hs. Those are really nice features that are super, that I really, really love about Metaflow, because it allows me to easily scale out even locally. So, for me, it's kind of not easy to write parallel programs in R, but with 4-H and branches, I could always easily do that, even if I'm just on local. So, on the next episode, episode four, I'm going to talk about how to do 4-H branches on AWS. Let me just share my screen right now. Yeah, Jason, we have a question on the Q&A. If we have time for that. Let me take a look. He's asking. Yeah. So, I'm just looking at the chat's Q&A. Oh, okay, this is all there are already a lot of questions for Q&A. So, I missed this when I was presenting. I think the most recent question is what happens when the next step is a vector, like the next step, because it's the input for each vector of the screen. Yes. So, basically, what you do is, let me share my screen right now, and I can show you how this is done. Okay. I have Q&A floating in my another screen, so I can see Q&A as well now. So, to answer the question, I think this is a great question, and let's just go back to the code for branches. Let's go in the scripts. So, we're doing branches in build models. We're building two models, and then in summarizing models, in select models, select model is the joint step. So, what we do is we can do a joint model step. We have this extra argument called inputs. So, you can do inputs, dollar sign, step net. Then that will give you the object for the previous step. And then you can call model, because we have model here for this step, and then you can call dollar sign model to get the model created in that step. And you can create, you can call this inputs dollar sign build lasso model, dollar sign model, and you can get the model built in the lasso step. So, this is how we can refer to each of the steps, each of the branching steps in the joint step. So, this thing is called the joint step. And then we'll call the joint step because we have to specify this to be a joint here, to be a joint here. So, we do joint equals true, we see the select model is a joint step, and then we have to make two arguments, self and inputs, and then we can do input dollar sign to refer to any of the previous steps in the branches. Yes. So, when I was looking at that, I also saw some questions on installation. So, I just want to mention that I know a lot of our users are using our windows. So, Metaflow package was initially developed on macOS and Linux, but we tested out that and we confirmed that it can indeed run on Windows, but only with WSL2 support. WSL is this new Windows 10 features that's basically a building Linux kernel for Windows. And you can install WSL2 and with WSL2, we can install a Ubuntu app from a Windows app store. And inside Ubuntu, then everything is the same as in Windows or macOS. So, we have a guide here in our installation documents, Windows support. Check out this if you're using Windows. Yes. Let me go to the next episode. This is going to be the final episode, also the most exciting one. Yeah. So, here we're going to do this final time AWS batch because this is kind of a very, very frequent use case. We often have in statistics, we either want to do like bootstrapping. We want to do a lot of bootstrapping and then for each bootstrap data, we build a model or we want to tune parameters, parameters for the model, parameters for the feature. So, each of this parameter or each of the bootstrap step actually is going to be a big commutation step because it's almost like a full model building process just with different parameters or with different sample data. And so, I myself find it to be extremely useful to do, at least for example, bootstrapping with Metaflow for each branches. And then when I don't have enough resources locally, sometimes I can do like four branches for my for each branch locally, but my laptop is not big enough. If I want to do hundreds of branches, I have to scale it out to AWS. And then here's this is how we do it. The point I want to make is that it's the same. Super simple to do that. Assume that we already have, let me get into presentation mode. Okay. So, assume that we already have this for each, the for each flow running nicely locally, then we only need to add this decorator to this for each step so that we can run each of the step on AWS remotely. Let me just do this. AWS episode. So, yeah. So, the script is exactly the same as episode three, the for each branch. The only difference is this. If you're editing locally, you just need to add this decorator to your for each step. And then let's source this, we're going to run or this should be able to run on the cloud. Episode four, can that be so forth and source it. Yeah. So, I want to mention that only this step is running on cloud. The previous steps are not running on cloud. And then, right. So, and then this is, this is something that we like to do. Because, for example, in the future computation stage, or in the previous data cleaning stage, we want to do a lot of data cleaning on the explorative analysis locally. And then, and it's probably not requiring that many resources is doesn't doesn't need to be bootstrapped. Doesn't need to be chewing on parameters. So, we just run it locally. Sometimes we would prefer to run this locally because of the overhead of running on a cloud. As you can see, sometimes we need to, as you can see in the previous demo, sometimes we need to wait for like two minutes or five minutes to wait for the cloud instances to start. And there's a small overhead there. Let's just wait a little bit. This is in start step. While we're waiting, I can come back to this. Okay. So, yeah. So, the idea is that in the previous step, you have this compute features. And we assign different parameters to the to the parameter that we want to search over with. And then let me just check. Let's just check in the script. Oh, my RStudio is getting stuck. It's fine. So, let's just wait for a moment. I think it's just this guy is running and then RStudio is getting stuck. Yeah. So, in the for each branches, in the compute feature branches, we are specifying the name of the the name of the parameter that we want to search over with. And there's a for each branch. This is a step previous to the for each step. And when you're running in for each step, we're actually be able to to run this step with multiple tasks. Again, I want to reemphasize the object hierarchy. So, we have a flow object, which is the name of the flow. And this is the top level object. Inside flow object, we have round objects. And then each round, every time you source around the R, you create a round object. They have a round ID. And then and for inside round object, we have step object. And then you see for this guy, this guy, they're all they're all step objects. This is also a step object. But this is different because inside this step object, there are task objects as well. So, we will create five tasks object inside this inside this step, each task taking a different parameter. So, inside the build dbm model, you can do self dollar sign inputs to to get to fetch the parameter that they get assigned to this function. And, and then in the flow in the actual flow structure, we specify lr is the parameter that we want to assign to the for each step. As you can see here. So, this is not for each script. So, I want to pull up for a script, but looks like my RStudio is somehow getting stuck. Yeah, searching for definition, it's just getting stuck. Let me just wait a moment. If it doesn't come back out, I'll just go to the I'll check out the, so it's all of the code are, oh, it's finished. Is it finished? Yes, it's it's actually finished. So, it was stuck for a moment. Let's take a look. Wow, there's a lot of information. Let me just Yeah, I know this is going to be harder for you to read, but I just want to, if you look at the structure, you see for each year old five steps, when you source this, you can see we created five steps. And you can see tasks starting five different tasks. Same step name, but different process ID. And each task is running with a different cloud instance. This is the UID for the cloud instance. So, they get from submitted to vulnerable to starting each of them. So, at this time, we're going to have five instances starting on the cloud at the same time. And then after starting, it's going to start crunching numbers for us. And then, yeah, so these are all the information on setting up tasks, environment, tasks are starting because we have five tasks running. So, this output is kind of fabulous. You see the task, everything finished successfully. And we have the select model. What the select model is doing is that it's checking all of the models, build models, select models is we can do a full loop of the inputs. And you can do for each of the item. So, each item in the input represents a branch. It represents a branch in our step in here. And we can call dollar sign model to get the model actually built in that branch and using that task on the cloud. And we check our square. If it's bigger than the best, then we update the best model. And by doing this, we select the best model according to the R square. And then the final step is summarizing. So, yeah, so this is how I like to do parameter search or bootstraping on the cloud very easily with Metaflow. Yeah, so I wanted to mention that that the basic syntax for us to specify the for each parameter is by doing this. I think Brian may also have mentioned this previously, but just want to re-emphasize a little bit. And then you specify this. And then this could be like a super long list, 100, 100 parameters, maybe. And you do specify for each parameter inside this step for for compute features. You say, I want to search over for each. And the next step is going to do this for each for each computing. And this is after this for each branch. And we'll have a joint branch same with like branching. We have a joint and then we select the best model. And then, yeah, that's it. So this is our very powerful cloud experience. And then you come back to the slides. Right. Yeah, Jason, I just want to add one thing. Can you go back to the joint step code? There's one. Yeah, cool. So what would be really need to do in the kind of like the per way would be to kind of like make these into list columns. So you can imagine your return object is this kind of nice tibble with all the nested objects in it. And so I think metaflow combined with a lot of the per tools, you know, has the chance to be like really, really great for productivity. So yes, so the inputs itself is actually a it's a list. You can just do this and then model and then without r square. So it's a it's a list. And we also have a function called gather, gather inputs. And then we can, for example, there's our metaflow function called gather input. So we can just use that to gather all the results variable from all of the branches. And then that's also something that's available. And then we can check it out, for example, get their inputs. How puts not really I should do this metaflow, metaflow gather inputs. Yeah. So you can see, so this is this input inputs here and then input inputs alpha or in this case it can do input model. And then we'll be able to create a create a list of what we're going to create a like a vector and then off the alpha objects data object data metaflow artifacts for each of the branches. And then you can do subsequent data wrangling for the models created in each of the branches. So yeah, just as Brian has pointed out, we are the inputs is already in somehow like a flat is kind of a just a list and then you can use the all the other diverse packages to play with it. Yep. So post this one out and just go back here. Let me take a look at the Peter. Again, I don't see Q&A anymore. Oh, I see Q&A now. There is no new questions. There's no new questions. So I'm just moving on more about metaflow. So yeah, so a seven has talked about in the beginning. So the the goal of this talk is to simply get you excited about metaflow. It's a like a gentle introduction of the key features. We have some other nice features. For example, you can execute the flow in a full tolerant, full tolerant fashion. And you can use tags as namespaces. For example, you can tag your rounds with different names, like for example, you have an idea for a vision engineering, then you can tag your run with like, say, clever vision engineering idea. Then and then you can query all the rounds that belongs to a certain tag. And the tags will behave will behave somehow similar to a namespace. And then you can check out our docs. Check out our docs here. And if you go back, so this is this section is about the details of metaflow, our language features, those are basics. And let me just make it a little bit bigger. The dealing with failures, we have different ways to deal with failures. Like we have retry. And we have catch to catch those like platform exceptions. We cannot really rely on the try catch inside R because imagine you're running on AWS and your step runs on AWS. If something happens to your AWS instance, let's say the instance got interrupted or AWS had an error, then your code would actually not be able to have a chance to run this try catch because there's nothing wrong with your code. It's just your hardware run into a problem. So we have a higher level of failures. We have a higher level mechanism for dealing with failures, which is a try catch mechanism on the infrastructure level. And this way we can maximize, we will be able to maximize the safety of running things on the cloud. And then in the organizing results section, you can see that the users will have the namespace for their runs because they have the same name, prediction flow, but they belong to different teammates. And that's totally fine. And then you can just, you can switch namespace like this as I talked previously. And you can run stuff with attack, crazy tests. And then you can query all the runs that's labeled with this tag. This is a way for you to organize your runs. Yeah. So I want to, yeah, that's it for our tutorial just to recap our philosophy of Metaflow. So we want to, yeah, so basically by doing this tight integration with AWS, we want to provide this first grade support for data warehouse, compute resource orchestration, job scheduling, project architecture, versioning of model operations. And then we want to move, we want to let data scientists move very fast on the top two layers, model development and feature engineering, which is where they can maximize their value at. Yeah. So we're open source on GitHub. You can learn more about us on our doc. And chat with the team. So this is our GitHub channel. We are almost there 24 seven, almost 24 seven, I would say. We'll try to answer questions as soon as possible. And then, yeah, so with this, I'm handing the stage back to the setting. Let me start. All right. Thanks, Jason and Brian. So we are right on time. But if you guys have any questions, we are happy to take those light. Please use the hand mechanism in Zoom and then we'll unmute you if you have any questions for us. And as Jason and Brian mentioned before, all the documentation is available at Metaflow.org. And it also has links to our Gitter chat channel. So at any point in time, if you feel that Metaflow is something that you can use in your workload, please do reach out to us. We'll be happy to engage with you. If you feel that there are features that Metaflow doesn't address today, but can be helpful for you, please again, reach out to us. We are formulating our road map in the space and your input, your feedback is going to be really, really helpful for us. Let's see if we have any questions. So we do have a bunch of questions in the Q&A. So one question is, is there a quick way to say give me the artifact called model from the most recent successful run? Yes, exactly. So we do have a Metaflow client. And in that Metaflow client, you can very easily reference a specific flow. And then we have pointers to say the latest run or the latest successful run. And then you can very easily access all the artifacts of that flow or a specific artifact. If you are interested more, so there is documentation. So there is a section in the documentation called inspecting flows and results that has all the necessary detail. Next question is, is the global data store required to be on Amazon S3 or is it possible to replace this back in with something else? So currently the back ends that we support are your local file system, or it can be Amazon S3. But then Metaflow has a plugin architecture. So you can in theory bring in your own data store and you can plug into it with that. If you have a specific requirement, please do reach out to us. We'd be happy to engage. There has been some progress on the GCP front as well. There have been numerous individuals, organizations as well, who want an equivalent support for GCP. So if that's something of interest, we do have open GitHub issues where we would like you to weigh in. I have a question on that. So would it be possible to extend this to like MiniIO, which kind of like adheres to the S3 API, but I think is kind of your own thing, right? Yes. So as a matter of fact, you can actually do that for min.io, which provides S3-compatible APIs. We do have a few individuals who are using Metaflow in that way, where rather than using Amazon S3, they essentially authenticate to their min.io cluster and they're able to do that. So all you need to do is in your Metaflow configuration, rather than pointing to an AWS S3 bucket, you just point to min.io cluster. So that it will just work. All right. I see that we don't have any other questions. It was really great for all of you to join us. I know that in certain parts of the world, the time just didn't align well. But as Jason said, as Brian said before, if at any point in time, the team can be helpful to you in any of your Metaflow needs, please give us a shout out. And thank you. And thanks to the organizers at the user community as well for setting this up. All right.