 Hello, everyone. Welcome to the session on machine learning at scale using Q-Flow with Amazon EKS and Amazon EFS My name is Suman Devnath. I'm a developer advocate with the EFS product team And I'm super excited to share a few of the insights about how you can run your machine learning workflow using an open source tool called Q-Flow So what we are going to do in next 30 minutes or so We are going to talk about what machine why machine learning on containers because that might be a little odd to many of you if you are new to Container space and we will then dive a little bit around Q-Flow and then we will straight away jump into the demo Which is more interesting rather than there's a boring slides. So First thing first why machine learning with containers? So if you think about it, you know It's not only about machine learning We get the flexibility and all the benefits that any application can get with containers So if you look at the whole machine learning stack There are different tools like TensorFlow, PyTorch, and we need different kinds of infrastructure to run this training so Things gets very complicated when you think about different packages dependencies and configurations So what containers helps us to do is you know It packages our training code along with all the dependency in a much modular way and that way, you know Our ML environment gets very lightweighted and it's very much gets portable, right? So you can run your machine learning training jobs or other tasks, you know independent of the platform so One of the reason that why it is even better to run it on Kubernetes is you know first is Composibility so basically you you are You know defining your training jobs or you are defining your machine learning task In a granular way and so that you know, you can always run it At different places and if you want to make some changes, it doesn't affect You know the other jobs in your pipeline the other thing is you can start today on-prem in your Kubernetes environment and you don't have to change it in future if you want to migrate it to AWS and run the same training job on the cloud. So since it is Created as a container on Kubernetes It gets very much portable as we have just discussed a while back and obviously you don't have to think about the scale Because Kubernetes will take care of the infrastructure. So whether you need two instances to run the training job or three Instances or ten instances since the training job is running as a container You don't have to worry about the backend infrastructure, which is very much valuable for any machine learning engineer or even for any infrastructure engineer and The best part about this is you don't have to worry about managing this Kubernetes cluster. So you may like to use Amazon EKS, which is our Managed Kubernetes service on AWS which helps you to give to give you the control plane where you can attach your data plane or did compute nodes and it supports, you know, the you can always get your you know Kubernetes upstream experience and you can Decide which version of Kubernetes you want but you know, you can negatively get the upstream Kubernetes experience. So it's it will look, you know, it will feel exactly the same as how it is if you if you have to Install and configure Kubernetes on-prem or in your managed, you know, EC2 instances And it gets integrated with a lot of other AWS services and which we are going to see in a while where we are going to Run machine learning job You think you flow on ECS and we are going to save our training data sets on EFS and one of the other thing is irrespective of Kubernetes is you don't have to build all those training Container images from scratch. So we do offer a lot of pre-packaged Docker container images Which are fully configured and validated and you know Tested rigorously. So you you are always get the best Configuration and image which you can make use of so you can just what you can do is you can just create your own training script Or use the training script that we have at the template and make the relevant changes based on your need and requirement You can always customize those container images and we support different frameworks like TensorFlow, MXNet, PyTorch and so on And the best part is, you know, you can use these deep learning containers Not only with the EKS, but with ECS Amazon SageMaker and EC2 instances So let's talk a little bit about Qflow before we jump into the demo So Qflow is basically you can think of it as a machine learning toolkit for Kubernetes So it's it comprises of, you know, various other Projects like Jupyter Notebook, Pipeline, training services and inference, you know, or serving. So basically if you have ever seen Amazon SageMaker, it's kind of a similar platform that you will get here It may not have all those fancy features that SageMaker offers But if you want to run your machine learning workflow on Kubernetes and if you want to have a control over your workflow even at a more granular level Then maybe you can make use of Qflow and it's an open source project So you can always Contribute to that and we are going to see in the demo that how you can create a Jupyter Notebook and How you can start a training job on Qflow Now one important advantage of using Qflow on AWS is you are going to get A lot of flexibilities and leverage the integration that AWS has with other services So when you are running Qflow on EKS, you get all the goodness of You know the service integration that we have With EKS. So in this case, we are going to use EFS With Qflow, but if you see here, we have a good ecosystem where you can integrate other managed services that we have For better experience for your machine learning workflow when you are running on Qflow So since we are going to use EFS in our workflow, so let's talk about EFS a little bit It's because we we learned a little bit about EKS, which is a managed service for Kubernetes So let's spend a couple of minutes on EFS So EFS is a simple serverless and set-in-forget kind of file system Which we just created and you don't have to mention the size of the file system and you can just use it You know anywhere so it you can access that file system from your on-premise You know a machine or you can access it from EC2 instance from a Lambda function or from a Kubernetes cluster and we are going to use EFS For saving the training dataset for our machine learning workflow, but it's very much Elastic and performant. We recently announced a submilitec and read latency. So that means In general you will get a latency of 600 You know, my two seconds for your read workload, which is pretty amazing when you think about shared file system on the cloud And it is highly available and durable and we have different classes of storage that you can select from So as we discussed, you know EFS has an integration with You can access EFS from various different compute services. So one is on-prem or ETH2 But we don't stop it there You can always use EFS with your any of your containers which are running on ECS Which is our managed container service or on ECS which we are going to use You know in our demo So how to get started with the EFS with Kubernetes? So now we are not talking about a queue fling general because we are going to do that in the demo But I just want to give you an overview of how you can get started with EFS with Kubernetes So the first thing is you need a Kubernetes cluster. So in this case, you know We are creating an Amazon EKS cluster But this could be very much your own installation of Kubernetes on a bunch of EC2 instances where you are managing the Kubernetes cluster But you know this we are just taking this example where you know, we are creating and managed EKS cluster What that means is you don't have to manage the cluster by yourself all the pack, you know patching You know Version control and all of that update updates is taken care by Amazon So first you need to create that EKS cluster and second if you need to create a security group so that you know the EFS file system can be accessed from the EKS cluster and then you have to create an EFS file system Now we are going to do this through the code in a while But this is how the workflow would look like and the most important thing is you need to install the EFS CSI driver so This storage driver doesn't come out of the box So once you have your EKS cluster, if you want to attach EFS storage, you need to you know install this CSI Driver, right? So it's this is also an open source project if you would like to contribute if you to contribute on this project It's you know, we have done a lot of improvements in our CSI driver in the recent past. So please have a look So once you have created this EKS cluster, you have created the file system and you have installed the CSI driver on the case cluster What you can do is you can define a storage class, right? So in the storage class definition, you will provide that file system ID, which we have just created and that's all right So after this you can run your application by creating a persistent volume claim and you will define the same storage class which we have used Before so here if you see the storage class name that we have given is EFS-SC You just refer to the same storage class in your PVC definition and once that is done you can mount that PVC in your application or in your in your pod, right? So in this case, we are just using the same Persistent volume claim whose name is EFS claim and if you see this This is the same claim name that we have used to create this PVC. So that's all so next time If you want to run another application, you will just create another PVC and use it in your pod So you don't have to go back to a storage over and again, right? So because we are using dynamic provisioning and the CSI driver will take care of creating Those access points which is the technology behind provisioning this These PVCs, right? So that's about, you know, how a little bit about how you can make use of EFS with Kubernetes So before we jump into the demo, this is what the architecture of our demo So we are going to use EFS to Storing our training data set. So in our demo, we are going to download the training data set and we are going to run some training job on Qflow and That a training job is going to access the storage from EFS using the CSI driver and for our training job We are going to build an image and then we will push it to ECR Which is our container registry, which is kind of Docker Hub, but it's within AWS And then we will start the training job on our EKS cluster using Qflow and it is going to pull the image from ECR, run the training job on the training data set, which is saved in EFS Okay, so Let's jump into the demo and see it all in action Okay, so I have already opened my AWS console and if you see I am inside a cloud nine So we are going to use a cloud nine, which is an IDE for writing your code on AWS So it will give you a nice IDE kind of environment which runs on an EC2 instance But it's very easy for you to write code as you go along You don't have the dependency to carry your laptop or you know workstation You can just, you know, write your code from anywhere as long as you have the internet connectivity So if you come here and you can see that I have two environments So I have already opened this environment You can always create your own environment by clicking on create environment And it will ask you just a few questions about type of EC2 instance You need what operating system and you are all set So this is the IDE environment that I have If you come here and click on open IDE, you will land on to this page So I have a lot of code here You know, I have a cloned view of the GitHub repo, which I'm going to share in a while But you will get this kind of interface Okay, so first thing first So I already have a Kubernetes cluster up and running And we can see that If you run Qubectl will get nodes and we have an EKS cluster with five nodes I have also installed Qubeflow And if you want to see that, we can see all the parts running as part of Qubeflow Because Qubeflow, as we have learned a while back, it's a collection of different services So we have all those parts running which supports Which are actually running Qubeflow So the first thing is we need to create an EFS file system At this point, we don't have any EFS file system created And if you see Qubectl get storage class or SC You see that the default storage, which is an EBS volume Which is by default when you create a Kubernetes cluster on AWS Now the first thing we need to do is we need to create the EFS file system And then install the driver and create a storage class So we don't have to do it all manually We have created a script which is located inside this directory And it has some dependencies So if you see this auto EFS setup of script We are using some external libraries So for that we have one requirement or text file So first let's install all the packages So I have it installed So it will just skip all of the installation So next is we need to run this script So let me just run this script And then we can go over what the script is doing So before I hit enter I just want to show you a few parameters that we are passing along One is the region That means in which region we are creating this file system Which is obviously in the same region where I have my EKS cluster So I'm just giving the cluster name Which I've saved in the environmental variable And then the file system name So before I hit enter If you come to EFS console and hit refresh We see that there is no file system So hopefully after this script gets executed We will have a new file system So by the time this runs Let me just quickly show you what the script is doing So if you see the script is doing just three basic things One is it is checking few of the predict with it And what is needed And then it is creating the IAM role So that the cluster can access the EFS file system Then it is going to install that CSI driver Which we just talked about in the presentation That it need to install the CSI driver So that the Kubernetes cluster can talk to EFS And then we are just creating a file system And after that we are creating a storage class And this is the storage class Which we are going to use in the queue flow For training our job Or even for the notebooks To creating different data stores For keeping our training data sets Okay So creating a file system And creating a storage class Is something that you can repeatedly do As and when you need But this setting up the CSI driver And all of that is just one time activity So it will take couple of minutes So let's wait for this to complete Okay, as we can see It has created a file system Also it has created the mount targets And it has provisioned a storage class So to see that We can just run kubectl getsc So we can see that That the storage class Which has been created And it is still not the default one So if we create anything on queue flow Let's say Jupyter notebook It will use this storage But if you explicitly mention this storage class It will carve out storage from here Okay So now before we go ahead If we go to EFS and click on refresh You will see that the file system got created My EFS 1 And this is exactly the name that we have given While parsing the EFS file system name in the script So we are all good now And if you come inside this file system And come to access point Which is the entry point For the application to EFS You see that there is no access point created Because we haven't We have just created the storage class But there is no PVC We have claimed or created So the next thing would be Let's go ahead and create a Jupyter notebook So to do so First we need to Run the I think it's already running But let me just stop and Start it again So this is the dashboard service Or basically a state law service And now if I just go to preview And open the app in a separate Not here maybe So I'm just closing this off I'm going back here So now the dashboard is up and running And we can see that here And if you see here We have notebook, tensorboard, volumes And all of that pipelines, everything So this is kind of if you have seen SageMaker on AWS It's kind of same But it has SageMaker has a lot of different Flexibility and features But this is kind of a nice environment For you to manage inside out So you have more granularity And it's all running on Kubernetes Which is great So now if you see here We don't have any volumes So let's create a volume And this is the volume which we are going to use For keeping our training data set So let's give some name And let's give some size Let's say 100 GB And here we can select The storage class And we can have access models Maybe rewrite many Which EFS supports That means you can access this volume From multiple pods So let's create it And it is going to be in pending state Because we have not yet We have not yet attached This volume to any of the Jupyter Notebook Or any of the other training jobs So let's go ahead and create a Jupyter Notebook Let's give some name Let's say notebook one And here we can select the image But let it be the default And it is going to create A volume For its own use Basically the home directory for this notebook And since our default Storage class Is EBS So it is going to create that PVC from this storage volume Or GP2 type of storage class But we are going to attach The one external volume Which is going to be The dataset EFS volume Which we have just created So let's attach this And click on launch So if you see it's already got created And if I come to volume now You will see that the dataset is also The status is now bonded And we also have another volume Which is for the home directory And this is coming from GP2 And now if I come to EFS And click on refresh here You will see one access point Which got created dynamically By the CSI driver So let's go back to our notebook And connect to our notebook And what we are going to do now is We are going to run a training job But before that we need to have some dataset So what we are going to do is We are going to create a Jupyter notebook And we are going to download some dataset So I already have The location of the dataset And it is basically a simple dataset Which contains images of different flowers And what we are going to do is We will run a CNN job Or basically a deep learning training job Which will identify the type of flowers Given an image of a flower So this is a very tiny training dataset And the focus is not on machine learning part But the idea is What I really want to give you Is how you can make use of Qflow To run your training jobs So if you get inside this dataset And this dataset is coming from the EFS storage If you get inside this You will see different types of flowers Rows, sunflower and so on So let's wait for this to get downloaded And once this is over What we will do is We will go to Qflow And start a training job So as you can see The training dataset Has been downloaded And we have images Saved inside this EFS dataset Share So we are all good to start the training job So let me close this Jupyter notebook Because we don't need that The only thing we wanted to do Is to download the dataset Which is now stored in this EFS file system By this access point So now let's go back to our console And let's start the training job So just to recap Where are we now If we open this architecture We have saved the training dataset on EFS And all we need to do Is run this training job And for this I already created a deep learning image Which contains the code For running the training job And we have built it locally On Cloud9 instance And then I pushed it to ECR repository Which I am going to show you And then we can simply go ahead And run a training job on Qflow Where I have specified the training dataset location As the dataset which we have created And the image to use Is the same image which is in ECR So let me show you the image first So if we run docker image ls You will see that we have one image or a repository Inside that we have this image saved And all this code Like all the docker file for this It's there in the GitHub repo Which I am going to share with you Towards the end But if you want to quickly look into the docker file It's simply We are just pulling one tensorflow Base image We are copying this training script Which is located here And we are just giving this as an entry point That's all It's nothing fancy And inside this training script We are running that machine learning training So let's go to ECR And this is my repository My repo And if you see here This is the same repository Right in my account And inside that we have our image This is the image that we are going to use So let's run it So the training job is basically We are going to run on Qflow So it's also defined as a YAML file So this is inside This notebook Inside this Inside this training samples In the tfjob.tml So if I just open this up You will see This is a tensorflow job This is the name of the job And we are going to create two replicas That means when we execute this You will see two parts getting created For this training job And if you see here This is the image we are using And the most important part The training dataset Because we have This training will be running on some dataset Right And this is the same dataset Which we have downloaded a while back Inside this dataset PVC So if we run this Qubectl Get PVC Let me just grab the Name space If you see this This is the dataset PVC And this is the same PVC Which we are mounting on this Training job Right So this training job Is going to create a pod Right And that pod Will have our EFS storage Attached and mounted And it will be mounted inside this You know train directory And in our training script We have mentioned You know that Go and read this directory For your training job So let me open this training dot Py And if you see it here We are mentioning where is our training A dataset Look at Okay So let me go back To our CLI And let's run this job So we are inside this ML folder And the training job Is inside this training samples A directory So we can simply run this Cubectl apply And the training You know and the location of our Definition file So now if we see The training job It is now In the pod creation phase It is yet to start the training But we see that There are two pods Which it created So we can even run Cubectl Get pod Minus n The namespace And you will see that these two These are the two pods Which are running And these are the exact same two pods Which we have named That we have given image classification pvc So it's worker 0 and worker 1 Because the replica is 2 So now what we can do is We can even see the logs By getting into one of these workers So let's run this And if you see The training job is already completed Because it was a tiny data set And we just ran it for two epochs And if you see that the The accuracy is not at all great But that's okay So the idea is basically To show you how you can make use of You know Qflow to run your training job You know without any hindrance So our training job got over And now if we see The pods You see that it's in not ready state I mean it's not running now So it's already over And what we can do is We can even you know delete This whole deployment So by just you know We can just copy this And maybe to delete this job We can just say delete And let's copy this And run it Right So if you see here We what Qflow allows us to do is It allows us to scale our Machine learning workflow You know dynamically So we don't have to worry about the infrastructure That is needed for your ML training And with EFS you get the flexibility To attach or use the storage For your team For different data scientists Or maybe for different users Saving your training data set In one central location Which can be accessed by You know different people Right So if you see here In the EFS This is the place where I have My training data set So you can not only access this from your You know Qflow users But also you can Let's say for troubleshooting You want to you know attach this to An EC2 instance And want to explore you know Something So maybe you want to see You know what the training data set is For some add up you know a task So you can always click on attach And you know copy this command And mount this file system As an NFS storage Into your EC2 instance And provided that you have all the permissions granted So the idea here is When you use EFS The same storage you can access from You know containers From your EC2 instances Lambda functions And all of that Which gives you a lot of flexibility Right So and it You don't have to provision it beforehand So if you mention If you see here Nowhere you know we We mentioned the size of the file system Right So it will scale up and scale down You know automatically Although when we created this volume We have we We mentioned the size Just to make Kubernetes happy Because for Qflow Or for Kubernetes in general You need to mention the size of the PVCs But that is not From the EKS standpoint From EFS standpoint You know it is just ignored Because there is no requirement for size Right So that's all about it And if you want to go through this whole demo This is the Location So you can get into Amazon EFS developer zone Inside that We have a machine learning with Qflow On EKS with EFS section And you know this is This tutorial Will guide you through Setting up the whole environment on cloud 9 And also The training job and few other things So if you want to try out Feel free to go over here And give it a try And to come to this page You can go into this Learning page Amazon EFS developer zone And if you scroll down You will get some information about EFS Like what it is, how it works In little bit of details And if you scroll down You will see a section of different integration So this is Amazon EFS with containers And here you can see Machine learning at scale Using Qflow So you can always click here And you will go to that page Which we have just seen And you know do it In your account Okay So that's about it So let's go back to our slide All right So now that you've learned A little bit about You know how You can make use of EFS with EKS For Qflow There are plenty of other You know Kubernetes Or container specific Tutorials Which are available On Amazon EFS developer zone Which we have just seen A while back during the demo But feel free to access this page And you know share your experience And if you would like to contribute Maybe you can Send a PR request with your demo And we will add that In the repository So thank you so much for your time I hope you learned a little bit About Qflow, EKS and EFS And I look forward to Hear from you about your feedback Once you Do this in your account And share your experience Thank you so much once again And have a wonderful day ahead