 To kick off the series, we have Guillaume Mutier and Landon LeSmith from Red Hat here to talk about Jupyter Hub as well as other tools running on Open Data Hub with OpenShift Container Storage. If you have any questions, again, put them in the chat and Landon, please take it away. So my name is Landon LeSmith. I am working on the team that is working to provide Open Data Hub. So I'm just going to give a quick introduction about Open Data Hub before I turn it over to Guillaume. So the Open Data Hub. Open Data Hub is a project to demonstrate how easy it is to run your AI and ML workflows on OpenShift. So it is a reference architecture, a project, a community. We host all of our information on opendatahub.io where you can find blogs, tutorials guides, for the different components that we provide as part of the Open Data Hub. So the main entry point into Open Data Hub is the Open Data Hub meta operator. So we call it a meta operator because it is an operator that deploys other operators or components, which can include different products that are available in the Open Data Hub. So Jupyter Hub will be one of the main ones featured during this talk. It's also a project that we're running internally at Red Hat for our internal data science and AI platform. So this is to make it easier to control the entire lifecycle from data ingestion to data transformation, to modeling, to kind of make it easy for data engineers and data scientists from their workflows on OpenShift. So it's an entire Indian process. So we have support for object storage through CEPH object storage. We've hosted many workshops and workflows that are storage of choice just to make it hybrid data. So Spark and Jupyter Hub with TensorFlow support are main projects that are being used by the data scientists. And we're also bringing it in line with KootFlow. So we want to make sure that it's compatible with upstream products that the community is using. So this is just an overview of kind of the main features of Open Data Hub and a breakdown of kind of the different components that satisfy needs for target audience. So you'll see for storage, we support all different types of storage versions for CEPH object storage, Postgres, MySQL, database access. You can interact with those using our data catalog components. They're super set. And for the data scientists, we have for a lot of the major libraries that are being used in their workflow. So one of the issues that Open Data Hub looks to resolve are teams of data scientists, developers, engineers working together. So a common platform that takes care of all the deployment headaches, upgrades, with running these different products. We want to make sure it's easy and intuitive to use while eliminating a lot of the maintenance needs. So here are a list of like the major components that are available in Open Data Hub. The current version is 051. This is available in the OpenShift operator hub. So you can deploy it now and test it out. We have Prometheus and Grafana for monitoring. Selden for serving your models, Apache Spark for data engineering, and Data Analytics. Jupiter Hub is one of our core components to support multi-user Jupyter notebooks. Ceph for object storage, Kafka for streaming events, and Argo for our pipelines. Additionally, with the most recent version, we've added support superset for data exploration. And our data catalog, which combines multiple components so that you can query, visualize, and analyze your data. And here are a few of the upstream communities that we're working. Open Data Hub, and specifically Jupiter Hub, has support for GPU workloads. So on opendatahub.io, we have docs for enabling your GPU nodes in your OpenShift cluster. We're actively working to bring Open Data Hub in line with Kubeflow. So you can stay tuned for more information about that. And in upstream components, what we're calling upstream are things that are outside of Red Hat that we want to work with. So we don't customize them or attempt to not customize them. We're working with the pure upstream components. And again, this is on Operator Hub. So in your OpenShift cluster, you could find us and deploy the operator to test it out. Coming use cases. So Jupiter as a service is our main entry point into the Open Data Hub. Jupiter notebooks are what a lot of the teams of data scientists, data engineers are using to interact with their data. They had it coming in from multiple sources from Kafka events to SAP object notifications. And they want to be able to do some model training on that using GPUs. So internally, we're running this. We have 40 plus concurrent Jupiter instances across multiple GPU nodes. And on any given time, we have 13,000 peak events per second in Kafka and daily kind of 350 plus gigabytes of data that are being transmitted. This is all using SAP Object Storage. So it's self-service with our team of data scientists and data engineers. We can just point them to a link to access a Jupiter notebook. It's completely customizable. So if we have a team that wants to use TensorFlow notebooks, SciPy notebooks, or all of the same notebooks with GPU enablement or GPU access, they can. So they can go through the whole development lifecycle model test and iterations with full access to resources that are available in that cluster on Open. And there's limited support for multitasking. So the Open Data Hub runs within a namespace. So if you have different needs, different restrictions, or capabilities for different teams, you can run these in independent namespaces within your cluster. And pulling of sharing a resource, the OpenShift model to you can request resources that are available in the cluster specific to your needs. Jupiter Hub and Open Data Hub have full support for that. And it's multi-tenant. So that's kind of a quick intro to Open Data Hub. If you want more information, please go to opendatahub.io. We have a lot of getting started guides, tutorials on GPU enablement, deploying kind of self-object storage in your cluster, and using the different components of Open Data Hub. Our Open Data Hub group is currently available on GitLab. So you can follow that, get more information about development of the operator. We have a Kubeflow, ODH Kubeflow GitHub project that we're currently transitioning to. You can find more information about that initiative. And join our mailing list if you want to get updates whenever we release new versions of the operator or even the project in general. Feel free to sign up for that. And we give a lot of talks, workshops, conferences at conferences. So you can find most of our videos for different conferences on the AI ML playlist on the OpenShift Commons YouTube channel. So with that, I will turn it over to Guillaume. Thank you London. Hi everyone. Okay, so as London explained, Open Data Hub is a fantastic tool for your data science platforms. But of course, if you want to play with data science, you have to get data. So that means you have to store it somewhere. And that's where OpenShift Container Storage can come into play and help you for this. OpenShift Container Storage was released a few weeks ago. And basically what it does, it brings you all the different types of storage you may need directly inside the OpenShift. So what it is, in fact, is Rook that is controlling and deploying SAF underneath, running directly on your OpenShift cluster. Plus Nubas, the multi-cloud gateway that brings you object storage with other fancy features, which I engage you to look at. So what we have to retain here for this demo is that with OCS deployed, you can directly have blocks, files and object storage directly from within OpenShift and using the same tools as you usually do with an OpenShift that is making claims, or using standard ML files to provision all your different types of data that you need. So what we'll do here is leverage OCS to provide different types of storage directly inside OpenData and especially inside the Jupyter app. Of course, you can do it manually, that is providing object storage for one of each user. But what I wanted to do in this demo is to push it a little bit further to demonstrate how everything can be fully automated in such a platform. So before someone asks, all the code here is available here at this repo. So that means that you will be able to reproduce it and our text bits and pieces are what I did to suit your case. What I want to do here is to have my Jupyter environment using two types of storage. For my standard files that we see here, so there are some notebooks and some other files, I would want to use, let's call it standard storage. So I will make a persistent volume claim that will use a storage class provided directly by OCS. So here it would be block storage, that will be a top vertical provision for a new user. But at the same time, I want also to provide the object storage to my user. So I will make an object bucket claim which will automatically create a bucket through OCS. And what I will do as a fancy thing is display it directly inside my Jupyter environment and everything will be fully automated. So that means the user won't have to manipulate any access keys, secret keys. This is what we're going to do. So for that, only two prerequisites to have OCS installed, of course. And the history endpoint that you will use, you have to take note of it. And a project where OpenData is deployed, which is quite easy to do, I almost do it twice a week right now, because of the operator that is so great. It's a very easy way to deploy your data science platform. Once you have both of this, what we will do is use a custom Jupyter op config. This is some configuration that will be appended at the end of the Jupyter deployment. And this code, what it will do is each time a user logs in and launch his notebooks, it will create a new object bucket claim if there is a non-present. It will retrieve the configuration, the access and secret keys for the specific user and inject everything as environment variables in the user's code. Then we will deploy OpenData Hub itself with some specific configuration. Here we will use this custom config map that we created before and that will do all the things I expand. We will enter the history endpoint URL. So if you have deployed standard OCS, it's as simple as s3.openshift.storage, which is the namespace in which OCS is deployed. And we will indicate the storage class to use when creating standard PVs for users. Of course, here is the command line for the code that is available in the repo. Also, we will have to create some roles and role bindings because our special code will create new config maps and will have to get access to some secrets where the access keys and secret keys will be stored for the users. So it's only about here defining a new role which will allow the Jupyter app to get secrets and to create object bucket claims. And then we bind the role to the service account on which a Jupyter app is running. So only a few commands to run. We will also use custom notebooks. Well, here this is not mandatory. Here, custom notebooks is if you want to display directly the object bucket storage inside your notebook. So we use here the hybrid content manager and the s3 content manager, which is a very interesting open source project. And that is what allows us to show at the same time standard PVs and object storage from within the same notebook environment. And if everything works well, what we should see is when we spawn a new notebook, we will have our standard PV connected and providing this kind of file system. But also we will connect here to the standard, to the object bucket that we created. So let's see it in action. So here I have in my project on DH an open shift. That's open data hub and we can see that there is the operator already deployed and a Jupyter app instance. So it's ready for us, ready for us to use. If I go to Jupyter app, so it's the root that was created when the Jupyter app was deployed, I can sign in the open shift and here I created a bunch of pick users. So how we start with a new one, Nicole, who has never connected to open data hub. So I have to do hello there. And here you can see that we have the different notebooks, images that we can choose. And we will choose the custom ones that we have provisioned before, which I call the S2 minimal S3 because we add some connection, automatic connection to S3. Here I don't have to enter anything because it will be automatically provisioned and injected inside the environment. And then I will spawn my notebook. It will take a few seconds. If we go back here, we can see that there is a new container creating. That's the notebook environment for the user. But the container is creating here. We'll have to wait a few seconds. Okay, so now it's running. And there we have it. That's the environment that was just created for Nicole. And we can see that there is nothing, no files yet because it's a brand new PVC. But there is already a connection to an object bucket here, which is called Data Lake with the name of the bucket. It's not very fancy. I should definitely change the code to have a better display. But this is the object storage. And of course you can go to it. So if we take a look at what happened behind the scene, we can see in the storage that there has been a new PVC created for Nicole, okay, which is there with the default of two gigabytes. That is provision. So that's the standard claim that was made to storage to provision Nicole with a new storage space. Okay, what also happened is that an object bucket was created. And we can see it here in the config map. So we have all the H bucket, Nicole, that was the claim that was made with the config map. And we can see that there is a bucket that was created. And also a secret. A secret is all the informations that are required to connect to this specific bucket. And that's exactly the environment variables that have been injected inside the notebook. Okay, so here we can have them. So we have the access key, the secret key. So that's what allows the notebook to directly connect to the object storage and retrieve the information. If we take a look at Nuba here, we have a list of the different buckets that have been created. And here we can see at the end for 015. That's the bucket on which we are connected. Okay, so here there is nothing provision. So what I will do is just change users. I will just put my server here, which will just have done the environment. So here we can see that the pod is being terminated. And I will log out and log in again with another user, where it's Frank. Frank has already connected before. So his storage has already been provisioned, both the PVCs and the object bucket claim. And just wait a few seconds. So here it's launching. The container is created. Okay. And then we will have access to his workspace. So here we can see that Frank has already been working on some notebooks, doing some Keras training model and things like that. And of course, we have reconnected him directly to his PV. But we have also reconnected him to this data lake environment. Okay, so here is the file with the bucket that has been created especially for him. What I did also is a little trick because I wanted to create some object storage space that could be shared between each and every user. Okay. So what I did in Nuba is create a bucket which I called shared data. And of course you can do everything like this programatically. So here I have allowed access to this bucket to this account, which is the one from Frank. And while I'm there, I will also connect Nicole to it. Okay. And that's what allowed me to directly show this bucket because the special code that we injected inside Jupyter Robotic does is list all the buckets to which every user has access and links them and show all those buckets directly inside the environment. So here that means that Frank can go to the shared data folder and see that there are already some files that he can use. Those are here images to train his model for no mega detection. There is a credit card CSV file. So that's a very great way to have some central point where all the users can share data sets that allows you not to have people copying over and over the same data sets for for training and everything. There are some standard tools and files that you want to be able to share between people. And that's the great way to do it. So here I'm going to again log out of these environments and go back to Nicole because as you remember, I now have allowed her to have access to to the shared object store. So if I launch her environment again, wait a few seconds, okay, running so she can very soon. Okay, so we see that now she has her own pvs with file, no file yet in there, her own objects to write, but she also has now access to the shared to the shared data object store. Okay, and that's again a pretty neat way to to to set up your data science platform so that everyone can can collaborate. Okay, so in this quick demo that of course you can reproduce as the as I said the code is available and then I will show you again all the resources here. I just showed you that it's quite easy to set up a full data science platform with a fully automated storage provisioning for your users, both with standard block storage and object storage. We could do the same with leveraging CIFFS with shared third file systems and everything can be totally automated using standard Kubernetes and NOC commands. So that was it for me. Back to you Karina and I think we have some time left for questions. Thank you Guillaume and Landon. I am Karina Angel and again this is the first in the series of all things data for OpenShift Commons briefings and to look at the briefing calendar go to commons.openshift.org and we'll add more there. Diane runs so many great briefings so make sure to look at the YouTube and watch previous briefings. If anybody has any other questions please put them in the chat now. Thomas, are there decks available? Yes, we'll follow up with you afterwards so thank you for your question. Great, thank you everybody and we will see you hopefully next week.