 So look at it. So Anish welcome. I was telling people very briefly We're right in feudal login and you know, you know, we have a punchy relationship But I certainly gave you credit for the fact that we were a bit early moving into your talk But so Anish was my intern. What was what did we figure at three years ago now? Something about yeah You know as much as you give me old jokes that three you're like, I don't even know anymore ever since I graduated time Learn has no meaning But yes, he was an intern over when I was a TMM for OpenShift Help me out of time learning about our partners and OpenShifty stuff So I'm very glad he came back to Red Hat That's really cool that our interns managed to stick around and it gave me this contact to reach out For this talk because like I said, we needed some kind of ML talk There was really no getting away from a white John Python track without that So with that said, I'm gonna turn it over to you to to get started All right, thanks Jason. Okay, I think we have it shared Yep, sharing looks good and voice on both you looks good sounds good. Fantastic. All right, great Yeah, I'll kick us off then Yeah, everyone. I'm Anish. I'm a software engineer with the I Center of Excellence And I've been on the team for about two years now I'm primarily focused on the open data hub and running it internally at scale And I'm Chad Roberts. I've been a redheader for a bit over nine years now In a little over a year. I've been in the AI COE Working with the open data hub and AI library projects So to give a brief overview of what we'll be talking about today We'll be discussing some of the problems that data science folks was Faced today and how the open data hub project started internally to address them We will then be introducing what the open data hub is and where it's used and Finally, we will do a deeper dive and do some of the technical details as well as a roadmap for the near future so one of our main focuses is Discussing intelligent applications, right? How you develop them and what role the platform plays in these in this full picture Intelligent application sounds like a buzzword, right? But basically, they're really just any other application except that they rely on machine learning in some way or the other and Despite the fact that this ML Like portion of your code may be a very small Contribution to the entire code base. There are a number of additional considerations that you have to make during development Specifically you need to consider both your data pipeline as well as your machine learning pipeline in addition to your standard, you know develop or application pipeline and If you're not careful these three somewhat independent pipelines can start, you know conflict Conflicting it with each other and giving you a lot of headaches So it's always best to keep them aligned from the beginning, right? So the data pipeline is really just about gathering and storing data, right? This data could be data sets which are curated ahead of time It's data that you're scraping from online sources. It is something you came up with on your own machine And this data needs to be stored, you know could be in set buckets spread out across clusters in different regions And if you actually use this data, you need to clean it you know perform some data transformations and then Store it in place that your team can access it The machine learning pipeline is the second stage of this whole picture and it has its own challenges You usually start with an exploratory researchy kind of phase where The stuff you build or investigate doesn't end up being used in your final product You'll have expensive training jobs that you have to run Sometimes once sometimes on a repeated schedule again depends on the application and finally you have some sort of trained model that gets included in your application, right? And that then shifts into the application pipeline Even then you will still have to monitor the performance of your model and your application Obviously, right and then make changes as you need to So our focus is really on developing and supporting the development of intelligent applications and that generally requires unifying these three pipelines One thing I want to call out is that we don't want to do this as like a classic machine learning application but ones that are actually cloud native or Intelligent applications for the hybrid cloud So I'm sure everyone here is familiar with the notion of the cloud, right? At Red Hat we have a strong focus on cloud development So on top of everything I just discussed in the last slide We want to make sure that the intelligent applications we are building and the pipelines We are developing are built from the onset to run as a set of containers or microservices So they can be deployed directly into just for any cloud environment Why would you want to do this? Well, it feels a little bit silly talking about the benefits of the cloud in 2020 but The key points are that it provides us with greater portability, agility and scalability for applications With the addition of the hybrid context We could scale different parts of our application with different cloud providers or in our own infrastructures Based on any number of criteria we or the business function theme important So I give me all these benefits, right? How do you go about actually building one of these intelligent applications for the hybrid cloud? How do you do it? Well, a bunch of containers don't make an application on their own And for that you really do need something like a container orchestration platform such as Kubernetes or OpenShift These guys contain that last bit of magic sauce for your intelligent apps So this is a bit where I'd ask people who have heard of Kubernetes to raise their hands, but You know, thanks change Thanks, Brad. So I'm assuming this is about 70-30, but I mean, it's actually give the full picture Kubernetes is an open source project originally developed by Google to manage their application development scaling and management It's referred to as a container orchestration platform The core conceit of Kubernetes is that the basic unit of an application is a container and You know as most of us are aware containers are great But they can cause a lot of chaos Especially if you try to manually manage all the bits that go into having an application, right? How am I linking these containers together or how am I scheduling them? And this is where container Orchestration platforms come in they abstract all that hard work away for you and give you one unified view for managing them This makes it easier for us to control the security and resources for applications and I'm sure it kind of goes without saying but For people who are familiar with that development goes without saying Kubernetes is very popular in this space now But it's just suffers from a few of the same issues that other open source projects suffer from that can keep them from getting White thread adoption in the enterprise space So to address some of these Issues right now decided to develop our own container orchestration platform called open shift Which really is an enterprise spin or version of Kubernetes the platform itself is open source so, you know anyone can go to okd.io and try it for yourself and In addition to everything, you know all the nice stuff that Kubernetes gets you open shift has a few extra bits and bobs for security for CSED and for DevOps Now I have hopefully given you a quick intro to hybrid cloud Intelligent applications and the platforms in develop on I'd like to shift gears a bit and talk about how these platforms can benefit data scientists So has anyone here just tried e-mailing or using slack to drag and drop there You know, I buy turn took their whatever notebooks to share with a colleague. Right. Well, how old is that usually work? Well, usually you're going to come back with oh, yeah, you know, do you have these ABC dependencies? Install on your machine, you know, are there the correct versions or I will shoot. I forgot that You know, you don't have a GPU on your machine. So this notebook won't work or I don't make sure to forget Or make sure you don't forget to change the code to report the correct directory for your data Oh, yeah, I make sure you've downloaded all the data on your system well with big data, that's not really realistic either and Even when you do get it ported over to your colleagues environment, it's really just a hassle that should have to exist There isn't an obvious solution to the problem of sharing reproducible notebooks seamlessly and that's just between two data scientists Imagine you have a large team of data scientists who need to work on the same problem or if you need to hand over your application code or ML code to some software engineer to You know, production is it? This iterative process is just so much slower and agile than it could be Similar note or what the need for GPUs or spark clusters or specialized, you know, compute requirements Well companies are not probably not going to buy every data scientist. They have a high and deep learning rig with all the latest, you know gadgets However, they would be more willing to Invest in shared hardware that entire teams or you know organizations can use and that's one of the core principles of the cloud right elastic computing and That's a big advantage of using cloud your The final thing I kind of want to touch on is time to value I think in recent as you can slowly moving away from the previous paradigm, but in previous yours a common workflow has been, you know, some Business function comes in and tells the data scientist. Hey, here's a problem Look into it for two or three weeks and see if you come up with something, right? And then the data scientist goes into a cave Come up with some sort of model and then try to hand it off to a software engineer who builds and applies an application that uses that model and It's slow. It's prone to errors when the software engineer does some stuff because they're trying to just get it working somehow and they may lose, you know key information or insights along the way and We need to shorten that time between notebook POC and deploy prototype is like very important So these are some of the main problems that data scientists and machine learning as a whole face in enterprise fear and I'd like to argue that a containerized approach to data science Using open shift can help alleviate most if not all of these issues so Yeah, so how can containers and open shift help me a data scientist? Solve some of these problems Well, as far as collaboration goes No, because notebooks become very portable and need to be reproducible if you're running those containers Right a team of developers can be working off the same shared storage the same shared data sets And if I want to forward my notebook to someone I can just share the image with them And it should well, I say sure it will be guaranteed to run correctly You could also point them at exposed routes where your notebook is hosted so they can just directly edit it When it comes to infrastructure again shared environments the cloud you have CPUs you have spark clusters and this lets you perform You know actions or whatever you want to call it which are which go far beyond what your laptop can do and Both of these help really reduce that time-to-value piece because all your ML code should be able Should be run as a container anyways And if it can it's very easy to plug into larger applications such as software engineers have already developed right or even if you just have to devil something to your boss and that again Makes it much easier for us to go from POC to like an actual deployed application Next I'd like to dive a little bit into the open data hub So initially when we started we were aiming to solve some internal problems at Red Hat around machine learning Instead of each team maintaining their own infrastructures and developing their own support mechanisms for AI We decided that joining forces intonally to create the data hub as a central place to share data as well as to run AI and ML workloads Make more sense This way end users could be free from worrying about the complexity of these tasks and really focus on delivering Experimentation and using the results of this experiment creation to try value As we start building out building it out We quickly found that data scientists and data engineers have very different requirements from the tools then you know your DevOps engineers do They generally prefer workflows that are UI driven that avoid using the terminal and they expect the tools to include all their favorite sometimes very specific ML libraries as The data scientists continue using the data hub intonally We started building out the open data hub project based on addressing their needs and the challenges they were facing We quickly realized that a lot of these challenges were also faced by our customers with their own AI ML teams For ML projects, you'll always have team of data scientists data engineers DevOps SREs Software engineers product owners business developers, you know the whole gamut, right? And all need to collaborate and work together Sharing and collaboration is very difficult as I mentioned previously and I think having an operator address these problems really helped speed up our Velocity That's right. So now to step back a little bit and talk about developing an AI ML platform For most companies that are just getting started with adding AI to their products they start with small teams that are tasked with investigating AI tools and platforms to use The easiest path is generally to use about a well-known large proprietary cloud platform, right? these platforms have most of the tools needed and They're a very simple point of entry for it users This was great for that initial prototyping phase But Bruce we pretty expensive when you move to the production phase of your AML lifecycle Users also find that their work and their processes are locked into the specific cloud. They've chosen To kind of leave me at this one of the guiding principles of the open data hub was to give users more flexibility By using open source tools and technologies are possible and allowing users to install the open data hub in the hybrid cloud or the hybrid cloud So what is it, right? The open data hub is a meta project to integrate open source tools and provide an end-to-end AI ML platform on OpenShift to break that out further a meta project integrates multiple open source projects into one project That is easily deployed and managed by users Now if you look at the diagram, you can see that three there's three main personas or sets of requirements or Pipelines we're addressing. The first is that of the data engineer, right? All your data pipelines The AI ML workflow starts with prepping and ETL-ing the data into some data lake or storage system This data needs to be stored efficiently if you're operating with tons of data because otherwise it gets very expensive And it needs to be easily accessible by data scientists The next phase is the actual model development your ML pipeline This includes exploration and analysis for incoming data feature selection model creation training and validation in a cycle Once you've gone through it a few times and you're happy with what you You know come up with you actually have to solve your ML model in the production environment This phase should not be a static fire and forget to blow it but cost it sees of optimization Once the model is solved, you need to continue monitoring the performance of the model and make your optimizations when necessary Or in some cases, you know scrapping the model and coming up with something from scratch I've never had that happen to me, but I've heard stories of this happening This cycle of monitoring optimizing and solving requires input and collaboration from Everyone that wants it so far So next I'll talk about where we're using it As I mentioned earlier, we have a lot of internal teams at Red Hat using the data hub open data hub to run AI ML workloads We have tons of data scientists You know what to with their own data sets on projects that I'm not necessarily privy to but we provide the platform for them and Using open shift and operators has really simplified our lives I made it easier for that platform to keep running and for us to adjust things where customers need, you know Hey, I need 64 gigs of memory for something or you know adjusting where we need to Talking about some specific use cases the products and technology dev-op stream has or have product release pipelines that generate a ton of runtime logs and all these logs are stored in the data hub and You know, we're engaged with them for some Anomaly detection projects to figure out what logs we actually care about since I think most of us generally ignore logs We also have operational metrics for open shift for clusters stored in the data hub This data is used for a number of AI initiatives as well as you know, just simply for PMs or team leads to just make more informed decisions about what work they need to prioritizing We also have customer support data. That's used in much the same way things like source reports and customer feedback forms Now we don't just have it running internally at Red Hat We also have the open data hub installed in the mass open cloud or MOC for short The MOC is a collaborative effort amongst a number of universities in the greater Boston area As well as some partners in industry to provide a public cloud based on the open cloud exchange model It's also being used by many data science students and some researchers for the research work and serves as another proven point That open source systems can be built for these large AI workloads We also have a number of field deployments and verticals such as telco finance or oil and gas Which have been very successful at what they would try to hit Next we'll have Chad talk through a little bit about the technical details All right, thanks Anish So I'll start here with an open data hub centric view To give some context and help you see maybe where and how open data hub fits into the bigger picture So yeah, be an open and flexible Enables open data hub to work well with a wide variety of products in real life use cases When we're working with an ever-growing list of Red Hat certified partners On integrating their technologies into open data hub to provide the highest quality tool set possible Onto more of the the techie side of things I guess here Open data hub relies on an open shift operator to manage the deployment of the components But for those of you, maybe not too familiar with the concept of operators on open shift mirrors Super bare bones description. So briefly Operators can live on your cluster And when they will constantly monitor your desired state in our case, that's a set of AI and ML component installations You know like your Jupiter hub, etc It'll compare those against the actual state. What is currently running on your cluster? And from there it will manage the life cycle of your installation including scaling up or down As required based on what you've defined in your custom resource To change your deployment you can edit your custom resource and the operator will spring into action and instantiate those changes Okay, so some of you may already be familiar with open data hub Maybe fewer here than at some other places. We've talked If you use it prior to about June of this year you used what we now call our our legacy version Which is based on our ansible operator Even though we have a new version, which I'll be talking about It's worth mentioning that our legacy series is still available in operator hub And it continues to work as it always has One important thing that I'll note here is that the legacy version Supports deployment to a single namespace Which differs from the newer offering that I'll be talking about Going forward. It's unlikely that we'll add additional features Here But we still certainly would do bug fixes and release those A little background for the next part Around the end of last year 2019. We made the decision to take part in the kubeflow project By taking their manifests In the man by manifests, I mean Like the the cookbook of how to install the components and making sure that the components could be ran on open shift As part of that we generated a set of manifests that would run kubeflow on open shift and contributed Those docks back up to the upstream kubeflow And we saw it as a good logical fit for us given the similarities between open data hub and kubeflow Both provide great tool sets for running containerized scalable AI and ML workloads And it turns out that a lot of people that we heard from that were interested in open data hub We're also interested in kubeflow So we took that and used it as the inspiration for reworking open data hub a bit So starting back With our spring release, but I think was in June this year open data hub is a downstream project of kubeflow This relationship really lets us provide the best most robust set of the in-demand technologies You know through embracing the power of open source communities really We still carry forward the great set of open data hub components that the ODH users have been accustomed to And we're also able to provide the additional components That are part of kubeflow And we're able to do it on open shift Which is a big win for those that need the enterprise level everything that comes with open shift In typical red hat fashion our intent is to be as active as possible in the upstream communities Just as we have been since the inception of ODH You know providing insight and solutions so that everyone can benefit from those So given all that We're now up to open data hub operator 0.8, which is based around the kubeflow operator With that manages the lifecycle of a kubeflow or ODH installation Under the covers it uses a customized as customized with a K Against a set of the manifest files That to find the elements that make up your your component The images being run environments other configuration secrets, what have you Development of those manifests is fairly straightforward. You can develop locally and just run the kfctl CLI locally And you're probably in pretty good shape when you try out the operator with them since they both rely on the same underlying Customized code Our intent is to do all the development in the upstream community And only use our own repo for ODH branding and we've already been able to make some Pretty significant upstream contributions. So the upstream downstream Relationship seems to be off to a healthy good start And I noted a couple minutes back when I was talking about the legacy version That it only supports the single namespace deployment So here I'll mention that This the new operator currently only supports a cluster-wide Development so that that operator will watch all namespaces not just a single one And if you're in the operator hub in the open shift console And you go to install it this version is the the beta channel when you go to install it and I'll talk more about that later Our first steps were to convert all of the existing ODH components so that they could be managed with the new operator And as of today, I think we've accomplished that So you've probably seen pictures similar to this several times But I think it's an effective way To visualize how the various components of open data hub and kube flow Can come together to cover a wide range of users and use cases Of course, if you don't see a project That you think maybe you'd like to see here or should be here Drop by our community meetings and let us know All right, so since this is a Python focused track I should talk about the part of ODH that might be of maybe the most interest Jupiter hub in the notebooks Our recent release of ODH includes some enhancements to the Jupiter hub installation. That's part of ODH The ODH Jupiter hub install is now nicely integrated with OpenShift and OpenShift authentication You can use your own custom notebook images if you have some Or you could use our curated and tested images That feature up-to-date versions of popular tools and libraries things that you would expect to find Like you know sci-pi, TensorFlow, Spark As well as the the newest addition I think which is Alira But if you're into running pipelines That's that's a great place to look there. You'll be interested in using that As you can imagine we've been super busy getting out our recent release And we still have lots to do We've recently beefed up our Testing being able to run inside the OpenShift CI system for all the PRs that you might submit in our nightly runs We plan to add OpenShift container storage. That'll be great And right now we have Kubeflow 1.0 running on OpenShift and we plan to keep updating that All the manifest so you can keep getting the latest Kubeflow bits as well And now you can also mix components from ODH and Kubeflow We have instructions how to do that So if you want to have your PyTorch operator running with the other ODH components, you can definitely do that And for our next release at the end of October We'll have all of the ODH components running on the UBI based images the universal base image And then disconnected deployment the air-gapped deployment. That's an often requested feature That's on the list as well And of course upstream contributions wherever possible If you want to give it a try it's pretty straightforward You can go to the operator hub inside your OpenShift console search for open data hub You'll click install then subscribe and you'll want to make sure that the beta channel is selected I think it's the default You'll create that kf-def object The one from our repo is here on the slide it installs everything But you can make a custom one easily By just removing the components that you don't need Then you can head right over to opendatahub.io And try out a tutorial So here's the the wrap-up slide I guess To sum it up ODH. It's of course an open source community and we welcome Engagement and contribution our meetings are Every two weeks on Mondays at noon eastern us time It would make my day to see you there If you can't make it there Hit us up on the mailing list or through github. We try to be super responsive And there's our top level site open data hub.io There's all the docs and tutorials. Thanks so much