 Hello, everyone. And thank you so much for coming to my talk today. It's titled data science and cloud native development is awesome. Great. So just to kick things off, my name is Michael Clifford. I'm a data scientist and a manager and a project lead with the office of the CTO at Red Hat. I'm primarily focusing on applying machine learning to problems in like the IT domain, as well as developing some best practices around cloud native ML ops. Also interested in kind of like what tools and structures we can put in place to help support teams of data scientists work together more effectively in this kind of environment. So great. If you have any questions or want to connect with me, you can reach me on my email here or even better on GitHub. So cool. So today is going to be about what is I guess like in my opinion, the benefits of embracing a cloud native approach to your data science workflow and infrastructure and some of the open source tools that are available to support this kind of workflow. So by the end of this talk, you should hopefully know some more about cloud native data science and be set up to get going with some data science yourself in the cloud if you want to. Great. So let's just jump right into it. So before we talk about cloud native data science, let's just get really clear on what we mean when we say data science and what is the data science workflow. So I'm not going to go through this chart in detail. I hope most of you have seen something very similar before, but just to get started, I thought it'd be good to just kind of ground everyone in this kind of loosely agreed upon data science workflow. So we start with some high level business problem that we need to formulate into an engineering problem with some clear and quantitative success metrics like as best we can. We then move on to data collection and exploration. We will go look at doing some feature engineering, then model training, validation, deployment and monitoring. And then there of course are a couple of sections where we might need to kind of loop back to the previous step, given the results of a later step and this is just kind of the iterative nature of data science development. So what I hope you all see here is that kind of in my opinion, this is the responsibility of a single data scientist or a small team of data scientists to build the ML service all the way from problem definition to data wrangling to model training, surveying monitoring. So some other diagrams I've seen might kind of divide this into three separate roles, the first being a data engineer, the middle being a data scientist and the end an application engineer. However, the kind of overhead and developmental friction that can be incurred from three separate roles involved here is something that I think can be avoided. But how right there's obviously a lot of special skills and specializations needed at each stage that merit individual specialists. So what can be done to kind of empower a data scientists to own a project from inception to climate. Well, I think it really comes down to having the right tools and infrastructure in place to support a team of data scientists. So what does that look like? Well, it kind of sort of looks like this graph here. So this is an example of how we address this problem on our team. So we wrap the whole workflow into the cloud and support it with a suite of cloud native data science tools. So you might be telling yourself, well, that's great, but like clouds are expensive and complicated. And I don't know how to set one up or operate my own cloud infrastructure, nor do I have any interest in doing that. Well, that's actually fine because the operate first project, which is an initiative that I'm involved with is provides a free and open cloud environment to anyone like including everyone here to use. So there's currently an existing community hybrid cloud with clusters that exist in the US. We have some in Germany as well as some on AWS and we're working on extending this community cloud even further. So this is all managed through our operate first organization publicly on GitHub. So there's also like, so not only is this currently running, there's also an associated community of data scientists that work on open source projects together within within this environment. Cool. So what do these clusters do, right? They run OpenShift. And on them we use GitOps best practices in public repos to to manage everything. We generate operational data that we can make public as well. But for the purposes of this talk, the thing to note is that we run applications on top on top of everything. And the main application we are currently operating is the open data hub or ODH. And I'll talk about that in a minute, but the main takeaway I want you all to have from this slide is that there is a cloud already set up and waiting for you. And at the end of this talk, I'll go through a demo to how to actually get up and running. So what is open data hub, right? The open data hub is a project that we refer to as a meta project that does integrate some open source tools to provide an end to end machine learning platform on top of OpenShift or Kubernetes. It's often referred to as like a meta project that's aimed at integrating multiple open source projects into one project that can be easily deployed by users. So this particular meta project comprises things like Selden, Alira, Kafka, Spark, Grafana, and Prometheus, as well as Kubeflow, among others. I mean, still it's an evolving project. And all these things are kind of taken together to provide a, like I said, a comprehensive end to end machine learning development platform. So when we say we're running data hub, it really means that we're running all these services for data scientists. So data hub is like a huge project in and of itself. And you're interested in getting more involved there. I encourage you to go to their website. Cool. So as a data scientist, right? The open data hub service that is operated by the Operated First Cloud represents our main toolkit for doing cloud native data science development. But why bother with any of this stuff, right? Why don't I just use my local machine? Well, many reasons, right? So as a data scientist, there are a few items that are important for me if I'm really going to be able to own my model from inception to inference and deployment. And like, what are these things? So there's compute resources I need to do my work. Not everything can or should be run on my local machine. I need a solid way of collaborating with other data scientists to share notebooks and reproducible content. I need a way of creating this reproducible content reliably and ensuring that experiments and environments are kind of always the same. I need an easy path for me to actually get my models onto the world, like with simple deployments and inference models. Or excuse me, simple deployments of my inference models and ways to quickly and easily build, like, interactive dashboards to share results with stakeholders. So obviously this is not a complete comprehensive list, but I think it does a pretty good job of just kind of covering some of those major touch points needed to complete the data science workflow. So are these things met by taking a cloud native approach? Well, I certainly think so, right, especially if you're using a tool like Open Data Hub that provides us with Jupyter Hub, which gives an elastic compute environment with data science friendly IDE where you can develop your work and run experiments. We also here use GitHub as a place to share our work and collaborate. But beyond that, it also we kind of employ a particular tool called the AICOECI, which was developed by our team but can be integrated into really any GitHub repository that does continuous integration and delivery based on tecton pipelines. These tools also help us build custom images so that we can publish and share reproducible experiments and other content just by making GitHub releases. We can also use tools like Ellyra and Kubeflow to construct modular pipelines and machine learning workflows directly with our notebooks. We can also use tools like Selden for simple application deployment and model serving. And finally, we can use things like Treno and SuperSet to share interactive data content. So cool. So as you can see from this list, right, if you want to drive data science development from inception to deployment, there are a lot of cloud native tools available that really support this. Awesome. So that was like really quick and pretty high level introduction to the benefits of embracing a cloud native data science approach and some of the tools and resources that are actually available now if you want to get involved. But there's a lot of content and detailed tutorials around each of these tools and topics that you can find here at GitHub, AICOE, AIOps, Data Science workflows if you're interested in digging a little bit deeper. So cool. So now that I've hopefully piqued your interest in cloud native data science, let me show you how to actually access the Operate First Community Cloud and get started. Cool. So how are we going to do that? Let's do a demo. So I'm going to do a demo here and all of you should be able to follow along without much issue. This demo is not going to go all the way through the process up to model serving just simply for ease of demo and time constraints. But if you are interested in that, like I said, go into the workflows GitHub repository or even on the Operate First.cloud, you can find more content and demos that will walk you through all of that stuff. So cool. So let's get started. So the first thing we're all going to do is go to operate-first.cloud. So get out of here. We'll go here. So great. So this is kind of our main space on the web. If you're interested in learning more about the project in general, even with the cloud management side of things or the data science side of things, this is the place to come, particularly if you're interested in data science, you can come here. Like I said, the data science workflow content also exists here, so you're able to take a look at that. But today we are interested in initially accessing Jupyter Hub. So how do we go about doing that? Let's go to the Community Handbook. Let's go to Get Up, Stocks, Documentation. And this will take us to a Jupyter book that has a lot of information about how the Operate First environment is managed. And then we can come here to the Open Data Hub section. We should obviously the service that we've been talking about for during this presentation. Then we can see the currently deployed component, so at least some of them here. So the main tool by which we kind of like interact with the Operate First environment and do our development work is through Jupyter Hub. So let's go here. And great. So this will ask us to log in. If you have a Google account or a GitHub account, both of those should be able to be used for authentication here. So I'm just going to go ahead and select Create First to log in. So what you're seeing here is the Jupyter Hub Spawner page. I'm assuming most people here are familiar with like Jupyter Notebooks and Jupyter Lab, but you might not be familiar with Jupyter Hub. Essentially, Jupyter Hub is like an open shift and Kubernetes tool for deploying Jupyter servers for users on top of cloud infrastructure. So the first thing you need to do when you're spinning up your Jupyter server is select the notebook image that you want to use. So our team kind of develops and maintains two different types of images. So there's one that we refer to as like a content image, which is something like audio decoder demo notebook image. So this essentially is an image that there's a GitHub repo that has all of the requirements for building the image as well as all the content. That's like notebooks and data or ways to access the data, all of that kind of packaged together. And when you run this image, it will come with all that stuff ready to go. So without error, you should be able to run all of the notebooks, get access to the data and actually interact with this audio decoder demo without issue. So since it contains content, we refer to it as a content image. And that really makes it's one of the ways in which we effectively kind of share our content with others, make it reproducible, and so on. So that's one type of image. The other type of image that we work with is the just like a development image. So that's something like a Lyra notebooks, minimal Python or minimal Python with PyTorch. So essentially, these don't come with any content on board. It's just a like well-defined stack of packages, essentially, that we want to work on. So every time we spin up a new image, every time I spin up the Lyra notebook image, I know I'm starting from the same base. So again, that makes it really easy to ensure that I am kind of working in the same development environment with another colleague. Spins up another Lyra notebook image and starts working and they're also starting from the same base. So it's really useful. Cool. So today, because that's probably most interesting, let's go ahead and spin up a content image for you to take a look at. So we'll look at the OpenShift CI analysis notebook image. This is going to have some information or some content from a particular project that we're working on. I won't get too into the details, but it's loosely focused on using some openly available CI data. That's based on the operate first development and doing some analysis on it. So let's go ahead and we'll select that. The other thing that we want to do because it's like a shared compute environment is that we want to define the size of the pod that we're actually going to spin up. So maybe I'm just doing some code reviews. I don't actually need too much resources. I can pick a small. Maybe I'm actually going to need to do some model training, pick a large instance for whatever the time is. Whatever I need to do. So I can go ahead and do that. And then finally, there's the choice of selecting environment variables. So if I'm going to interact with S3 or some other remote storage and I don't want to put my credentials directly into my notebook or get them somehow accidentally uploaded to GitHub, you can add them here so they get injected into your pod as environment variables. So let's go ahead and start our server. So the main thing to note is this is spinning up is that it is basically starting from scratch build deploying the image that I selected, but then it's also connecting to something called a persistent volume claim, which is kind of like your own little personal drive on in in the cluster basically. So anytime you spin up a Jupyter hub server, it will always connect to your PVC. So even though the server is like a femoral and starts from scratch each time, any work that you do like any code you write or data you generate or anything like that will actually persist between sessions. Great. So now that our pod has spun up, it deploys us into a Jupyter lab environment. I'm hoping that this looks pretty familiar to everyone here. So now that this is up, I can basically interact with it in the way that I would, you know, on my local machine or any other way. So I can deploy a notebook consoles. We this particular image and it uses a Lyra, which is one of the tools I talked about earlier, which can be used for generating pipelines so we can take our notebooks and kind of string them together in a basic local graph and generate little little pipelines from them. She's very useful tool. Mike also spent up a terminal. Great. So here's also all the data that I am using. This is my PVC, all the projects that I'm working on are here. And we can see from two minutes ago when we initially spun up the pod that we actually got the content that we're interested in. So now we have this OCP CI analysis repository that we want to take a look at. And so one thing that you'll see here is that it's got a lot of folders and files and things. But the main thing to notice is that we use a project template in our group so that we always have, so as we move between projects, we know where the notebooks are supposed to be, where the data is and all of that. So that's something that you can find in our repositories as well that makes things really easy. It also comes with some particular configuration files that we use for our CI and other things. So if we use pre-commit or prow, any of those things, they're all configured and set up through this template. Cool. So we also use pip files, our main way of managing the dependencies. And so we're not going to have to, if we didn't use a content image, we would have to run some way of installing all these pip files using either pip end or themos, which is an environment management tool that our team developed. But we don't have to do that because it should all work. So now let's see what we have here. Let's go to notebooks. Let's go to data sources and go to test grid and test grid EDA. So now we can go ahead and start to look at a particular notebook. Great. So again, not going to go into the particulars here, but assuming everything worked, we should be able to just click run and everything here will work. So let's go ahead and restart the kernel and run all the cells. So this should just work seems like it is. So it's going to go ahead and run hopefully successfully. Great. Let's see. Great. And we went ahead and looks like it's downloading some data, which will be expected. And cool. So we know that that worked. So just go ahead and stop that. Don't need more data being downloaded. But cool. So let's say I want to make a change or I just want to see the changes that have been made to the data, right? So great. So one of the kind of difficult sticking points with notebooks, right? Is that they don't actually like you can't look at a diff the same way as like a Python file. It has to do with the fact that the notebooks are actually kind of rendered and what they're the raw output is something like this. So a lot of things can change like the execute time or cell count or whatever that really don't have anything to do with with the development. So if you want to see what actually changed, there's a way to do that. Another thing to think about with notebooks is the code might not change that much, but maybe the underlying data changes. And so the outputs are changing. So if you want to make a very small change to the code, like picking a different data set, which wouldn't necessarily seem substantial, but then the outputs can change dramatically. So if I want to do a code review or collaborate with a colleague, how would I go about doing that? Well, luckily we can use this git tool as part of JupyterLab to see the differences between the head and the changes that I just made. So here we can go through this and you just see simply because time has moved forward, we have a slightly larger data set to access some of the outputs have changed and so on. So cool. So we can use this to make some changes. So say we did make a change and we're interested in actually pushing it to GitHub and contributing it upstream. That's great. And so at this point, there are basically two things that will happen and I'll share those with you now. The first thing being that when you push your changes and open up a PR, it will kick off this CI process. So I'm not going to push a change here because there's nothing really that needs to be changed. But let me show you one that exists. So if you come to AICOE, AOPS, OCPCI analysis as to where this is managed, you can look at the pull requests and see we have reduced log classification service resource. And we have a pull request where Sashita, just one of the bots that we use and part of the AICOE CI process is here telling us that it's not approved for some reason. And then we have a number of tests that get run right so we have both a build check and a pre-commit on this one at least. And the pre-commit is running on prow so we can go ahead and see the errors that occurred. We're going to address that. We could also look at the build check so it actually will run the code beforehand. These aren't saved indefinitely but we can look at a pipeline run and look at a build request and kind of make sure everything built correctly. So this is really useful for us. Before the code gets merged in any way, these are automatically kicked off and it's checked before any human intervention. So that's really nice. So what is the other thing that we can do at this point? So let's say I actually want to share this with somebody in the world. How do I do that? So I can actually quickly build an image and add it to that list of content images on Operate First. So if we come to Project Meteor, this is shower.meteor.zone. This allows you to put in any GitHub repository and it will go ahead and kick off the build pipeline. Build it as both a Jupyter book if it's configured correctly and a Jupyter Hub image that can then be deployed on our Jupyter Hub instance. So let me just show you an example. You can do a series analysis. You can pick a particular branch and determine the components that you want to have deployed. So cool. And so we have a number of available meteors for us. Let's say that we're interested in the OCPCI analysis. You can go here, see one rendered website with all the content. So this can just be used to share with the folks. And then we can also go to Jupyter Hub. So I'm not going to click on that because it will just take me to Jupyter Hub because I'm already logged in. But ideally it will take you back to the Spawner page where you'll be able to select the image Meteor KZ9KW and it will act just like a content image. Cool. So that is mostly everything that I wanted to go over with everyone. I hope that was helpful. The idea here is that everyone should have been able to start from nothing, get some compute resources, access a Jupyter Hub, understand how to interact with a content image, know how to look at rich diffs in notebooks. And once a change that you're interested in making is made, you have the AI process and GitHub is automatically kicked off. And if you want to share your content, it's very simple to build an image using the Meteor project. So cool. So I hope that was interesting and useful to people. If you have any questions, I think now's the time. Thanks so much.