 I'd like to thank everyone who's joining us. Oh, your slide's popped up. There you go. All right. I'd like to thank everyone who's joining us today. Welcome to today's CNCF webinar, Scalable ML Workflows with Advanced Data Management on Kubeflow. I'm George Castro, Community Manager at VMware and a Cloud Native Ambassador. I'll be moderating today's webinar. We would like to welcome our presenter today, Vangelis Cukis, CTO and founder at Erecto. A few housekeeping items before we get started. During the webinar, you'll not be able to talk as an attendee. There's a Q&A box at the bottom of your screen there in the Zoom. It says Q&A. You can click on that and submit your questions. Feel free to drop your questions in there throughout the webinar and we'll get to as many as we can at the end. This is an official webinar of the CNCF and it's such a subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that will be in violation of that Code of Conduct. Basically, please be respectful to all your fellow participants and presenters and I hope we're all ready. So with that, we'll hand it over to Vangelis to kick off today's presentation. Take it away. Thank you, George. Thank you. Today's presentation is about scalable ML workflows with Advanced Data Management on Kubeflow. What does this mean? What's the problem we're going to be talking about? Growing up on an ML stack, a full ML pipeline is very hard and doing it in production is even harder. And then if you want to do multi-cloud, then this is when the difficulty skyrockets. And by multi-cloud, we also mean working on your laptop, working on a local deployment. If someone is working on-prem, in a non-prem Kubernetes cluster and on the cloud, then it's a multi-cloud deployment. So it seems a communist conception is that to build an ML product, you need to focus on ML code. You need to write your code, write your model, train it, and then you're done. And then there's some little details around actually writing ML code that someone will take care of. But the reality is that it's a DevOps problem, configuration, data collection, verification, managing resources, managing processes, serving, introduction, monitoring. These are much more important and take up a much bigger percentage of time compared to actually writing ML. So we need a platform for ML. And this is where Kubeflow comes in. So why Kubeflow? Kubeflow is an open source project started by Google. It's a meant-to-end solution for building ML platforms, ML workflows on Kubernetes. It aims to containerize workloads and allow users to experiment and explore with state of the art AI. It focuses on easy onboarding. We have contributed a code and a packaging for Kubeflow called mini-Kubeflow mini-KF2 is the process of onboarding new users. I'm going to be talking more about this in this presentation. And Kubeflow itself has outstanding community and it is a support. It's a vibrant community, lots of commits, lots of new features. The code is evolving very fast. So what is mini-Kubeflow? What is mini-KF? It's a packaging contributed by us, a Rictor, to the Kubeflow community that packages mini-Kube, Kubeflow and rock our data management platform in a single VM image so you can spin it up very quickly on a single node, presumably your laptop or a very big desktop machine you may have. It does require quite a lot of memory because Kubeflow has lots of components. It features the latest Kubeflow version, 0.6.x, currently 0.6.2. It's the, we think, fastest and easiest way to spin up a full Kubeflow deployment and start playing within minutes. So this presentation is going to be about deploying mini-Kubeflow and then running a well-known example on top of it with advanced data management that comes from rock. So how do you install mini-Kubeflow? There's a previous webinar we did, so you can see the recording, there's an installation video, there's the documentation, but the too long didn't read version is you use Vagrant to initialize a new virtual machine, you start the virtual machine and this is it. And this is what I'll be doing now. So I'll be switching desktops and I'll be spinning up the virtual machine. The machine is up, I go to this URL and I'm shown the mini-Kubeflow landing page. So I start navigating this very nice script, except you... And what happens now is there's a short provisioning phase that essentially manages the virtual machine, provisions mini-Kube as your Kubernetes substrate, all Kubeflow components, sets them up, rock so that you can have storage and manage the storage alongside Kubeflow. And eventually when the whole platform is up, it gives you easy to use links so you can access the Kubeflow dashboard and rock. This will take a few minutes, it takes in the order of five to ten minutes depending on the speed of your machine and your CPU utilization. So I'll switch back to the presentation and come back and see what happens. So this is essentially the process of starting mini-Kubeflow. You initialize a virtual machine, then you start, this is it. Let's focus on what Kubeflow does. This is a page from a well-known Google paper on TFX. So TFX focuses on building an ML workflow with these steps. Data analysis, transformation, validation, training the model, tuning the training of the model with hyperparameters, evaluating the model and eventually serving it. So Kubeflow comes with these libraries in containerized components. So you can use them as they are. And it also comes with a hyperparameter tuner containerized that can run multiple trials. Okay. Then you need an integrated front-end for job management, monitoring, debugging data scientists like notebooks. Kubeflow comes with a dashboard and a notebook-based UI, Jupyter notebooks. The notebook manager is actually part of what we have contributed to Kubeflow. Then you need a shared configuration and job orchestration framework. Kubeflow has taken a hard dependency on Kubernetes for this. So Kubernetes is your shared job orchestration network. What this means is that all notebook servers and all training components and the hyperparameter tuner and whatever you do runs as pods on Kubernetes. And finally, you need a way to actually store your data, control access to your data, and use it in a reproducible way, which means when you have actually run a training pipeline, have a record of having run this pipeline and be able to go back in time, rerun the pipeline. When you have a model that's the output of the pipeline, go back in time and see exactly what you used as input to train the model. To do this, that's where we come in. So we integrate ROC, our data management component, with Kubernetes and Kubeflow to give you an end-to-end platform for reproducible machine learning. This is what we'll be talking about today. If we go back to deployment, we're deploying the database, waiting for it to come up, it's progressing. So moving on, what exactly does ROC do? When you run an ML workflow, different steps of the workflow usually happen at different places. This is the ideal scenario. Data scientists like to experiment on their laptops, on-prem, locally. But for training, you usually need to move to a bigger deployment with GPUs. For example, the Google Cloud, where you can spin up GPU-enabled instances. And production, actually serving the model, happens elsewhere. Sometimes it happens at multiple locations. It happens in self-driving cars. It happens at IoT sensors. It happens at many different places. So how do you move the process from experimentation to training to production and keep it reproducible? First part, we contributed code to Kubeflow to make it data-aware in general. This means we extend Kubeflow components. So they use persistent volume claims provided by Kubernetes. So they store the data in volumes, vendor agnostic volumes provided by Kubernetes. Our software integrates via a mechanism called the Container Storage Interface, CSI with Kubernetes, to provide implementations of PVCs, of persistent volumes. So when Kubeflow stores its data, this data is managed by ROC. And then ROC forms a peer-to-peer network among all locations and allows you essentially to keep your, to start from your laptop, keep your data locally. But then when the time mark comes for training, take a snapshot of this data, move it transparently to the training location, run training. Training will produce a model, take the model and synchronize it with all the serving locations. And because this is snapshot-based, everything is immutable. You'll find similarities with Git for it. There's a live demo later on. So when you have a model, you essentially know the exact input dataset that led to this model being trained in a specific way. If something goes wrong with the model, you can go back in time, see the input to each individual step of the pipeline and experiment with it and explore it. So what's new in the latest mini-Kubeflow based on the latest Kubeflow version? It uses Kubeflow 0.6.2. It enables multi-user authentication. So you can have multiple users spinning up their notebooks in an isolated way. So there's authorization for notebooks as well. If it's integrated with ROG, then you can do near instantaneous restore of snapshots. So when you have a notebook server, you can snapshot it, shut it down, then come back one or two days or one or two weeks or one or two years later and spin it up almost instantly. We have significantly improved the time to perform this snapshot and we also have the ability to snapshot every step of a running pipeline exactly for being able to go back in time and see the input to each pipeline step. This is the landing page you saw earlier. That's how you control mini-Kubeflow. Why do that? And let's see how the deployment progresses. So provisioning, I switched back to the mini-Kubeflow desktop. Provisioning of Kubeflow is almost there. We are now waiting for Kubeflow resources to come up. ROG is already up. When Kubeflow is up and running, this button will also turn green and then I'll be able to connect. So what I did is I started the virtual machine, initialized provisioning, and I'm waiting for the 39 pads of Kubeflow to come up. Kubeflow does have a lot of components. Okay, so let's give it some time. Starting up. Why did we build mini-Kubeflow? Because people need an easy way for onboarding and people like to experiment locally. So by packaging it for a single node deployment inside the virtual machine, we give a very easy way to spin up Kubeflow and start experimenting with it. But because it's still Kubernetes and it's still Kubeflow, you have the exact same user experience. So you can use the same APIs. You can submit the same Kubernetes resources. You can use notebooks exactly as you should on a cloud provider. And when you're ready, because it's Kubernetes plus Kubeflow, you can move everything to a bigger deployment where, for example, you'll train without having to rewrite things. So our goal was a unified user experience. If you're on the cloud, if you're on-prem, if you're on your laptop, you run Kubernetes, you deploy Kubeflow on top, you deploy ROC for managing your volumes, your storage. And this gives you the ability to use the same APIs and find your data where you need them. The adoption has been going on quite well since we first started this around 7th March. It's reached around 4,500 downloads. Please try it out. Let us know how it goes. Give us feedback on the Kubeflow Slack. Initialize the machine. Start the machine. This is it. The requirements are kind of hefty. You'd need at least 12 gigabytes of RAM to get the virtual machine up and running. Kubeflow is quite resource hungry. You need a bit more if you want to run ML training, depending on the size of your data sets. And because it's virtual machine-based, it will run on Linux, Mac OS, or Windows in the same way. It uses Vagrant for managing the virtual machine and VirtualBox for actually running the virtual machine. So let's move. I like how this synchronized. So provisioning is complete. Kubeflow is up. ROC is up. I can connect to it. I'm logging into Kubeflow using a standard IP address accessible only by the local host. So I can log into it. This is the authentication part I mentioned. And this is the Kubeflow dashboard. So I'm logged in in my own namespace, Kubeflow-user. I'm the owner of this namespace. Mini Kubeflow has a single user. Kubeflow in general, in enterprise deployments, supports more than one users. I can access my pipelines, my notebook servers, Jupyter notebooks, hyperparameter tuning, metadata, and ROC as the snapshot store. So what are we going to use Mini Kubeflow for? We're going to use it to run a live demo of the Chicago taxi cab example. This is a TFX example on-prem on our laptop with Mini Kubeflow. What exactly is the Chicago taxi cab example? The sheet of Chicago produced a data set of more than 100 million trips. You can download the original data set here. This data set recorded quite a few details about each individual trip, including how much money the trip cost, when it started, where it started, where it ended, how long it was, what means of payment the rider used, and whether there was any tip. And the result of this example is a trained model that can predict whether a random trip will result in a tip that's more or less 20% of the fare. So we get this data set. We feed it to a model and build a classifier that can decide whether a random trip produces a tip that's more or less 20%. This is going to be the workflow that will be demoed. OK, this is a summary of the input features. Essentially, start and end location, means of payment, and the amount paid. And what is our demo going to be? We'll start from a notebook. We'll use Kubeflow to create this notebook. We'll create a new volume as storage to hold our data. We'll bring in our data set from an external data source. We could download the full data set from the sheet of Chicago site, or we'll download a subset of this data set so we can run the example of completion from GitHub. Once we've done that, we'll take this is ingestion of data. We'll take a snapshot of our notebook volume. Why do that? So we have a data commit, so we know exactly where we started from, so we can rerun the pipeline with exactly the same input data. So we know that whatever the result of the pipeline is, this is where it started from. At this point, we can continue working with this volume. We can continue exploring things in the notebook, and the pipeline is not going to be affected because the pipeline will run from this snapshot. It will run from a clone of this snapshot. This is going to be the starting point for the pipeline. So then, we will run the pipeline. We'll run a series of preprocessing, training, analysis, and eventually serving steps. And we will also snapshot a step in the pipeline, the final step, so we can keep it and explore it by cloning it sometime in the future into a new notebook. So this shows that we can take any individual step of the pipeline, snapshot it, and explore it again using notebooks. And this process can start all over again. I explore the data, make some fixes, ingest new data, adjust my algorithm, adjust parameters, snapshot again, build a new pipeline, start over. And when I have something that works or when I have something that doesn't work, I can go back in time and see exactly where I start from. Essentially, we're using Rock as Git for our data. So this is what we're going to do step by step. We're going to be creating a new notebook and adding a new data volume to it. So we move to the Kupflot dashboard, notebook servers. We can also, I forgot to show you, log into Rock as well. OK. I'm going to be creating a new notebook server. It's going to be named Webinar 1. It's going to have a small volume where I'll be storing my libraries and my code. I'll also be adding a new volume. It's going to be the data volume. I'll make it a 10-digabyte volume. I'll also use a custom image we have just created. And for this, I'll consult my cheat sheet. This is the custom image I'm going to use and launch the notebook. So at this point, I'm launching a notebook, a notebook server using Jupyter Notebooks from a custom image. It might take some time. Oh, OK. This was fast. To actually download the custom image, the notebook is up. It's running as a pod inside Kubernetes. If I used Kupkatl, I would be able to see the pod running. And I can connect to it. It's got a few secret volumes, my data, and my workspace. I'm connecting to it. I'll refresh to give it some time to actually come up. It's complaining. It shouldn't be happening. So I'll give it some time for Istio, the service mesh, to realize that the notebook is up. Or I'll debug it for a while. Let's debug it. I'm going to my namespace, looking at my pods. OK, invalid image name. If the image is invalid, then it shouldn't be showing as running. So this may actually be a bug. Image is shown OK. OK. Yep. I think I messed up. I messed up for including this. And I think it's going to work if I omit the HTTP part. But it shouldn't be showing as running. So this is definitely a bug. It should be running. I'll remove this. Goodbye, Webinar server. And start again. So Webinar 2 it is. I'm going to be using this custom image. I'm making sure to actually omit the HTTP part. OK. Again, I'll have a workspace, a data volume. And we do have a question if you want to address that while this is firing up. Of course. Sure. Swapna asks, why do we need to take a snapshot in the pipeline? It's not mandatory to take a snapshot in the pipeline. But it's best to take a snapshot so that you don't have the notebook operation messing up what the pipeline sees. So if it was just a shared directory, and you had the execution, that's a very good question, by the way, and you had the execution of your pipeline actually messing up the same volume, the same storage that the notebook or notebooks were seeing, then you'd have your work or your colleague's work messing up the result of the pipeline. So you can't just use an NFS share because it's like working the same Git directory if you have more than one person, right? What happens is you can't be sure about the result of the execution. The result will essentially be random as people are changing the input data. While working with Git, you can say, I started from this Git commit, so I can reproduce the bug. I can reproduce the result no matter what because the commute is immutable. It never changes. It's the same thing with a snapshot. If you've started from a snapshot, no matter what your colleagues or you do in a notebook, it won't affect the execution of a pipeline. So this is why we focus on working with snapshots for having reproducible ML. I hope this answers the question. So the pod is shown as running, but let's not believe it. Let's actually go see the pods. The pod is running. Okay, and we have an upstream bad to fix. So the notebook is up. I can connect to it. It's a nice Jupyter lab. And I can close these extra tabs. So this is my data folder. This is where I'll be storing the input data. I have a terminal. I can increase its size. And for the purposes of this demo, what I'm gonna do is use my cheat sheet again to actually download, so for this, a notebook that we'll use. So I won't have to type too many commands. So this is a notebook. And what it does is it has the commands to run individual steps embedded in cells. So what I've done so far is I've created a notebook and added a data volume. I'll now ingest data from an external source. So ingesting the data will be first downloading the actual source of the pipeline. This is the pipeline we will be running later on for training. I'll show you the code. Then I'm the stored here. Then we'll bring in the actual subset of the data set to use for training. So I'm cloning the git repository and then copying the subset of the data I need into my data directory. This is essentially simulating ingesting data from an outside data lake. HDFS, Snowflake, S3, the city of Chicago site, wherever I find my data, assembling it all into my storage. And then I'll have a reproducible snapshot of this storage so I can start training. So I copy the data into my data volume here. I can actually see the data, that's the data. And what we'll do now is move back to the presentation. I have ingested the data. I will compile the Kubeflow pipeline. I'll show you the code and compile it. This is essentially a description of the steps that are going to take place. And then I'm going to snapshot the notebook. So, compiling the pipeline creates a parable. We're using a newer version of Kubeflow pipelines which produces a few warnings. And let's take a look at the pipeline itself. So this is the pipeline specification, the training pipeline that we're going to use in a domain-specific language specified by Kubeflow pipelines. Let me increase the font size a bit. Okay, so this defines individual steps of the pipeline as components, containerized components, mentioning specific container images, taking arguments and producing outputs that live in volumes. So by snapshotting individual volumes, we can snapshot the state of the pipeline at each step. And why is this important? Because the most difficult thing when running a pipeline is essentially having insight at its execution. What happens if the pipeline fails? What happens if a step fails? If I'm in a notebook, I can experiment quickly. If a pipeline step fails, how can I go see what happened? The easiest way is take a snapshot of its input, clone it in a new notebook and go see exactly what happened. I can actually run the exact command that would be used to produce the data and see what fails. So at this point, I have my input data here. I have compiled my pipeline so I can now snapshot my notebook. For this, I'll move to rock. There's a bucket here. Buckets is where you store your files, your snapshots. I can create a new one. Let's call it webinar. This is gonna be where I'll store my snapshot. I'll take a snapshot of my Jupyter lab. It's gonna be a full snapshot of the whole lab. I could take a snapshot of a single dataset, a single volume. I'll snapshot this lab, this notebook server. I'll name my commit initial commit. I'll describe it as, I don't know, just ingested data, compiled pipeline. This will be the input to the pipeline run. And then snapshot it. So what happens is a new asynchronous task starts. This is a group of tasks actually. One task is to snapshot the data. The other task is to snapshot the workspace. Why do we snapshot both volumes, both disks? So we can be reproducible. You have seen multiple times that the result of execution is not just a function of input data. It's a function of the libraries you use as well. There's bugs that only appear with certain libraries. There's behavior, desired or undesired, that only appears with certain libraries. So to be reproducible, to be able to allow a colleague to see exactly what you did, you need to snapshot both your home folder, your workspace where you store your libraries, what you pick install, and your data. And that's what we did here in a few seconds. So I now have a snapshot that contains snapshots of both disks and combine them in a group. So if I go here, this is my snapshot less than a minute ago. It has a snapshot of the data and my workspace. And it has, let me go back here. I can see more information about it. It has all the information I gave when actually created the commit. So I have snapshotted my notebook and created a new snapshot. At this point, I'm ready to create a new pipeline. So what do I do? I go back to the node, I go back to good pipelines actually. This is central dashboard, pipelines. I have a few sample pipelines. I need to upload the compiled pipeline created here. To upload this, I first downloaded to my box and then uploaded to pipelines. This is kind of a cumbersome step and we can also automate it from within the notebook itself and upload it via a command line interface. So I'm now moving to pipelines, upload pipeline, choose a file, downloads, days file, and I call it the webinar pipeline, upload. This is it. So now we'll be creating a run of this pipeline. Run the pipeline. These are all the pipeline steps. First step is to create a volume. Why do we need to create a volume? So we can have a clone of the snapshot we start with and every other step depends on this volume because every other step manages this volume and creates files in this volume. Then we'll validate the data, preprocess, training, deploy, predict, analyze, and we have a final snapshot volume step so we can actually show you how the cycle ends. And we can start over with a new notebook. So I'm at this point, I'm about to create a run. Create a run, run name is going to be, I don't know, webinar run one, description, no description. I'm gonna use the default experiment. It's gonna be a one off run and now we need to specify parameters to this pipeline. One of the parameters is a rock URL which is what is the input data? Where do I start from? Let's choose. This pipeline will start from this dataset. This pipeline will start from my input data. The data I just snapshot it four minutes ago. Okay. And where is the output data going to go? Where is the snapshot going to go? Let's have it here in a new file named, I don't know, webinar snapshot. Okay. I now start the pipeline. So what happens is a new run is created. This is now the run time graph. We'll be seeing the steps exactly as they happen. And this will auto refresh. First step is to create a volume. So we now have a volume that is independent from the volume seen from the notebook. So I can use my terminal, go to data, mess up data, for example, you know, create a new file. This new file will not be part of the pipeline. The fact that I'm now changing the data does not have an impact to the execution on the execution of the pipeline because the pipeline runs from a clone of this data. And this answers the previous under these question. For example, let's touch this file. This will not appear at the end. Okay. Where were we? The pipeline is running. The validation step is complete. The processing runs. So let's take a look at the steps. Eventually the pipeline will snapshot a step into a new snapshot and we'll clone it if we have the time. So the steps are validating the data using TFDV. This is a TFX library. This step has already completed. It runs from a clone of the notebook data. Then we transform the data in a preprocessing step. We transform both the training part of the data set and the evaluation part of the data set, the input that we have kept so we can then evaluate the model. Then comes training, which will lead to a train model, which will also be stored inside the volume. Let's see, is this progressing? The processing is done. We now run training. Then comes the TFMA library, which is going to analyze the model and produce artifacts. We'll be seeing these artifacts in the snapshot of the pipeline that's going to be produced eventually. Prediction and serving. So training is happening and when training is done, we'll have prediction, serving, and we'll end up with a snapshot that we use in subsequent steps. So why are we doing this? We extend Kubeflow. So each component can store data in a vendor agnostic way in persistent volumes, in file systems that are mounted. We have extended, first we created a Jupyter Hub-based spawner with persistent volumes for 0.4. We replaced it with the native notebook manager that you saw in 0.5. And we have also extended the pipeline's domain-specific language so it can access persistent volumes and snapshots. If you look at the pipeline, I can actually show this. This is the pipeline, right? We specify that we want the output to be this volume. The volume is, let me find the volume. Every step is using the volume. This is the volume. So we created extensions to the pipeline's domain-specific language so we can create volumes as clones of existing snapshots and then we can use these volumes for each individual step. So each individual step uses this volume. And eventually, we'll also produce a snapshot. Let me find this as well. So we have extended the DSL so you can also produce snapshots of this volume with dependencies. So when the final steps of the pipeline are done, a snapshot of this volume will be produced as snapshot volume in the location-specific. So let me move back to the pipeline. Training is still happening. This is the most expensive part of the process. Let's give it some more time. Why use ROCK for your pipeline? Step one starts from a notebook. You snapshot it. You start from a snapshot, you clone it, you make some changes, you snapshot it. Step two starts from a clone of step one, make some changes, you snapshot it again. Step three, and so on. If this breaks, you can take this snapshot, examine it in a notebook and start over. Hybrid pipelines. You have a pipeline that you want to run in two locations. You want to run a few steps in location one. Then you want to run serving or training in location two. Step one, two and three run in location one. You snapshot step three. You move the snapshot using ROCK to another location. And at this point, you can run the second half of the pipeline on location two. So this is a hybrid pipeline. It's essentially a meta-pipeline of one step running in location one, another step running in location two, and ROCK bridging the gap by moving data from one location to another. If you want to reproduce a pipeline, you start from the same input data, rerun the pipeline in another location, and you end up with the same results. ROCK is in between, syncing data and state among all locations. Back to the pipeline, it's almost done. No, it is done. The snapshot step is done. Everything is completed successfully. I move to ROCK. This is my snapshot, created less than a minute ago. So the very last step, and then let's move to questions, is I start from this snapshot. So I go back to my notebook servers, create a new server. Let's call it webinar three. And this time I'm going to add a volume that's going to be existing, and we'll be using the snapshot I just created. I'm going to call it data again. Don't say notebook. So this notebook would now start from the snapshot that was created in the pipeline, and I'll be able to explore the output data of the pipeline in a notebook. Let's give it some more time. Behind the scenes, is essentially cloning the output data. Yes, George. Sorry, just reminding you you have four questions in chat there. Okay, let me connect to the notebook. I'll speed it up a bit. Go to data. See there's actually extra files being created. So this is not the input data. This is the data after the execution of the pipeline. Move to analysis, output display. This is the result, let me see. Should appear, it doesn't. Something may be broken here, or let me refresh. Anyway, this is the result of the analysis. I cannot show it right now, but I can see that this is the data that was produced by the pipeline. So this is more or less it. Let's move to questions, and then I'll continue. Sure, so, Jordan asks, how difficult is it to share snapshots in between users? Okay, that's a good question. We didn't cover it in this presentation. We've made it as easy as possible using Rock. You essentially publish a link to a published bucket to another service, we call the registry, and then any other user can subscribe to this link and follow your changes as they happen. What we mean by this is you go to your buckets, you publish them, this essentially gives you a link you can share with others, and others can then create a subscribed bucket and follow your changes. So whenever your pipeline runs and it creates a new snapshot or wherever, or whenever you create a new snapshot of your pipeline or of your notebook, any other follower sees it. So this is how you share snapshots across locations. Okay, next question. Can Hyper-V be used on Windows instead of VirtualBox? MiniCube does support Hyper-V. That's a good question. Right now, MiniCubeflow does not support Hyper-V. We are thinking about using it. We are focusing our efforts to moving MiniCubeflow so it can run on public clouds as well. So yes, Hyper-V is an option, but we're focusing on efforts to giving MiniCubeflow as a package product on a public cloud so you can run without having to worry about your RAM or CPU limitations. But this is feasible, yes. Okay, Joan asks, how enterprise-ready is Cubeflow slash rock stack in regards to identity management and being able to do audits? That's a good question. Cubeflow has authentication. Authorization is not there yet. It is there for notebooks. It's not there yet for pipelines, for hyperparameter tuning, for metadata handling. We are contributing actively to multi-user pipelines right now and we hope you have it as part of the next Cubeflow version. Please come in touch in Cubeflow Slack with us and we can talk more. Maybe you can also test an initial implementation that we presented today at the community meeting. That's what we're doing right now. To be enterprise-ready with full support for authorization, there's quite a few things to be done but we're moving towards this. Is the volume getting created is of type Cinder or some shared storage so that all stages can access it? That's a good question. Rock manages whatever primary storage you provided with. This demo was using local disks. So there is no Cinder or another API underneath. We just gave local disks on each individual node to be managed by rock. Rock carves individual volumes on set disks and you can be as fast as possible and have super low latency because of this. You just work over local disks. You can work over local NVMe. That's what we are targeting. And you can still withstand node failure because you can be snapshot in your notebooks once every five minutes, for example. The snapshot operation, as you saw, is very light and we snapshot thinly. This means we have an internal mechanism that tracks whatever blocks you have touched and only copies these to produce a new snapshot. I hope this answers the question. Please pull up if it doesn't. Yep, some nice, instead of copying a snapshot into a notebook volume, would it work if notebook itself is mounted to an object storage with shared folders, et cetera? I'm not sure I understand this question. I think you mean what would happen if I mounted my object storage as a volume in the notebook? That's what I understand that maybe I could use a user space mounting mechanism to access S3, for example, and S3 bucket as a volume in my notebook. This has two main issues. One is you lose reproducibility because it's like using a shared NFS share. While you're making changes in the bucket, you don't know what pipeline executions are actually using the bucket right now to read data from it. So you cannot, if you want to perform an exploration and run an experimental transform on the data, you can't do it because you don't know who else is using the data at this point in time. You actually have to copy all of the data to make sure you're working with a private copy, but then this becomes too expensive. So this is one thing. And the other thing is if you compare performance and latencies when working with a local volume compared to working with S3 mounted over a notebook, performance difference is abysmal. The latency is skyrocket. We have measured in the order of 800 megabytes a second bandwidth and microsecond latencies because we work with local MVMEs with an internal data path. This is orders of magnitude faster than going to an object store every time. Okay, Vishnu wants to know, what is the storage class used? How about on-prem? Okay, that's a good question. The storage class used is hours. Rock exposes itself as a storage class and we provision volumes using this storage class and rock underneath carves up volumes as you need them. For on-prem, rock is deployed as a demon set. It has its own operator. You spin up a custom CR, a customer resource. I think that's what we've done now as well. Rock cluster, yep, how about we change? Yep, this is it. This is a deployment of the rock cluster CRD in the rock namespace. And our operator, which is this, will take care of spinning up whatever pods are needed. We spin up a demon set on each individual node. So for on-prem deployments, we spin up a pod running rock on each individual node and this could manage your storage. Okay, we're running out of time. Since you're in there, do you support encrypted data at rest since you just happened to be in there? We can knock this question out. Yes, yes, yes. Okay, good. Definitely. The internal data path, we can set it up so it also encrypts data trust, yes. Okay, two more questions. How are large data sets trained as a scheduler work like a Spark scheduler? I'm not sure I understand this question. We've tested this with a few hundred gigabytes of data. We just carve up local disk and then we've made lots of optimizations to make sure we only touch the change blocks. So when you have a very big data set and only change a few parts of it, you can still snapshot it very efficiently. Okay, can this solution work with Spark, Hive and or Presto, I assume? If you ask, could you please elaborate? And that is our last question. I'm not going to type four. I assume it can as long as you can mount a persistent volume as long as you can access a file system from said jobs, then can be used with these solutions. But please come in touch with us in the Kubeflow Slack and then we can talk more about it because you're interested in various cases. Please join the Kubeflow Slack. There's a link to join the Kubeflow Slack on kubeflow.org. Please follow the link. I have actually have a slide about this. So this is how you run KFP on-prem with MiniKubeflow and ROG. This is a nice image. This is what we have for the future. GPU support and support for running public clouds. Please try it out. Join us on the Kubeflow Slack and let us know how it goes. Awesome. Thanks, thank you, Ellis, for a great presentation. Thank you. We are done with the questions and that's all the questions that we have time for today. Sorry, and thanks for joining us. The webinar recording and slides will be online later on today and we are looking forward to seeing you at a future CNCF webinar. And with that, thank you everybody and have a nice day. Thank you. Goodbye.