 Hello, and thank you for coming to our talk today. My name is Michael Clifford, and I am a data scientist at Red Hat, and I'll be co-presenting with my colleague Tom Cofill, who will be jumping in for the second half of the talk. So great, so if we're going to talk about building an end-to-end analytics application using Data Hub, let's first make sure everyone knows what Data Hub is. So the Open Data Hub is an open source project that's been developed by members of our team in the Office of the CTO. They've been working on it for a while, and its goal is to help integrate multiple open source projects into an end-to-end AI ML platform for OpenShift. So really you can think of it as a blueprint for building an AI as a service platform using Red Hat's Kubernetes-based OpenShift container platform and Ceph Storage. So as you can see from the myriad of logos here, Open Data Hub ties together a number of different technologies as Spark, TensorFlow, Cubeflow, Prometheus, Grafana. But today we'll be focusing on the tools that we actually use for our specific application, and that's Ceph, Jupyter, Argo, SuperSet, and Hue. But that's all just to say that Data Hub is really flexible, it has a number of elements, and it's more than just specifically what we're gonna be talking about today. So I encourage you guys all to go to opendatahub.io and take a closer look at it. Cool, so now that we know what Data Hub is, why would we want to use it for developing machine learning applications? So in my experience as a data scientist, I have found that nailing down exactly how to integrate machine learning work into the application development lifecycle is still a little bit of an outstanding question. And this is partially due to the fact that often as a data scientist, we'll want to work in a Jupyter notebook environment, which is really good for running experiments, for exploring data, for sharing and explaining how the specific kind of like algorithms or implementation that we're working on can contribute to some meaningful results in an analytics pipeline. However, while writing notebooks, it's often the case that data scientists are focused more on kind of the experimental results instead of implementing software development best practices. And I think that's generally totally fine, but when it comes time to actually integrate that machine learning code written in the notebook into the larger application, it either becomes a responsibility of the data scientist or another software engineer to kind of port that code into a production ready format. And in either case, there's some translation that is happening here and, you know, things get lost in translation. And this is just a place where I found errors are more most prone to arise. So given the tools that we have at our disposal with data hub, right, we wanted to see if we could smooth out this rough patch in the development process and do so by just simply avoiding this porting step altogether and making the notebooks themselves the core unit of code that's ultimately consumed by our application. And again, why would we wanna do something like this? So primarily, right, to shorten the time to production for machine learning based application. So even if this doesn't result in a completely production ready code, it will very likely produce a really high quality proof of concepts to get feedback on quickly. It also keeps the data scientists really directly involved in the iterative application development process while maintaining their native Jupyter environment. And like I said, just in general, those two things together will really help avoid some of these common pitfalls in the ML application development process. Cool, so again, and just to be really clear, right, this talk is more about the process and the tooling that we've kind of developed for working on analytics applications. But I always think that it's really useful to have a concrete example to help explain what it is exactly that we're doing or how we did it. So for this talk, we went ahead and developed a natural language processing application that regularly consumes data from the Fedora developer's email list, runs analysis on that data and stores it in SEF so that the results can be displayed in an interactive public dashboard. Cool, so what does that application actually look like? At a very high level. I think this image gives us a good sense of how the different pieces fit together. We have a Fedora hyperkitty as our external data source that we want to pull from regularly. We then have Jupyter Hub, which is where all of our Jupyter notebooks and our code development lives. Here's where we're writing all of our data engineering, data processing and analysis code. One thing we need to do here is make sure that we're writing small, single purpose modular notebooks for simple automation configuration and updating any necessary code changes along the way. And just about every one of these notebooks are either pulling or pushing data from our remote storage in SEF for use in a future analysis or to be visualized by superset. And then Argo is kind of the tool that is wrapped around all of these notebooks in Jupyter Hub. It actually orchestrates the application workflow defining how often to run our analyses and how the notebook should interact with each other. Finally, we have Apache Hive and Superset that work in tandem to serve as the front end and user interaction layer of the application. So as long as the Argo workflow runs and the notebooks are pushing new results to our SEF remote storage, the superset dashboard should remain up to date. And of course, this is all running on top of OpenShift. Cool, so these are the tools that we need to develop this application, but what is the application itself actually doing? So Tom will go over this graph in far more detail in a minute, but I just wanted to show it here so you can all get a quick sense to see what the actual Argo workflow looks like and how each of our notebooks interact to compose our application. So just about every one of these green circles represents a Jupyter notebook that has some specific job, whether that's downloading data, pre-processing it or running some analysis. And by organizing the notebooks in this way, it makes debugging issues or extending our application much easier, right? So if we want to fix an issue with the metadata parsing step, we just make a change in that notebook, push it toward get repo and that's that. If you want to add a third analysis, say, to the end of this workflow, it's as simple as writing a new notebook, adding it to the workflow and again, pushing it to get, that's it. So by breaking the notebooks up into smaller modularized parts, we can make our application much more maintainable and extensible while still using the Jupyter notebooks as our core units of development. The only thing that's not apparent from this graph is what we showed earlier on is that the last two notebooks at the end, what it is they're doing, right? So for this workflow to function properly, they must push the analyzed data up to Ceph for visualization. Cool, so at this point, I hope the benefit of this approach is starting to become more apparent. However, it's also the case, right? That just getting data into Ceph is not enough for visualization and interaction. So how do we actually get it into a dashboard? So we use a combination of Hive, Hue and SuperSet to provide interactive user access to our analytics results. Now, this really only needs to be done once for each new analysis, but it's worth noting the process here, right? So once you have your data stored in Ceph, you wanna create an external Hive table, which is a data table that's backed by the data stored in Ceph. So as your Ceph bucket gets new data, the table will also be updated automatically to reflect those changes. So all you have to do to do this is create the table and define the schema and run the create table command like I've shown here and make sure that the location points the correct artifacts and Ceph. Now that we have our table, we wanna make sure that it's accessible in SuperSet. And to do that, we need to import that table definition by using the table definition form again shown here and simply provide the correct database and table names. At this point, our data pipeline is complete and we have an automated system that pulls data from our mailing list archive and pushes results to our interactive dashboarding tool. The last thing we need to do is define the charts that we wanna share with people and publish a dashboard again to share with end users. So here is an example of a data exploration page in SuperSet. Here we are just simply grouping by word and creating a table that's showing the count of these words. But SuperSet allows us to do all sorts of different analyses like bar charts, word clouds, heat maps, a whole bunch of different stuff. So once you're happy, so you can kind of play with your data here and once you're happy with the particular analysis, you can add it to a dashboard. And here's an example of a dashboard that we're working on. So for the keyword analysis, the output is this word cloud. We could also use the table at the right or the additional filter box that's actually not seen here to look for a specific time period, a particular month or time range or specific words that you might be interested in in looking at. So this all to say this is interactive for the end user. And we've also included a couple of line charts here indicating some user engagement statistics in the mailing list over time. This again too is interactive, so you can filter by specific users or time periods. Great, so just to quickly reiterate, right? We want to find a way to develop natural language processing application while using our Jupyter Notebooks as the core development unit to display monthly trends in keywords and user engagement with the Fedora mailing list. And we were able to do so by tying together Jupyter Hub, Ceph, Argo, Hue, and Superset, all of which are part of the Open Data Hub project. And I hope that it's now slightly more apparent the benefits of this approach and how it could be modified to accommodate many other analytics projects. And if the implementation details are still a little bit fuzzy, I'll hand it over to Tom now who will discuss the automation bits in greater detail. So thanks so much and go ahead, Tom. Thank you, Michael. Hi, I'm Tom and in the next couple of minutes, I'll walk you through all the nitty-gritty details of this analytics workflow and how we're actually producing the data that we can showcase later on a Superset dashboard. You already saw this image. It shows you an Argo workflow, a task graph which we use for this particle use case. This is based on a template. We use for multiple different automated analysis runs. The workflow itself consists of two groups of tasks. The first group is for the data collection which essentially updates our data set with the newest data available. It collects the raw input data and parses it to a format consumable by our analytics notebooks. In parallel, we also download the historical parts of the data set to a shared location from where the complete data set can be accessed locally on the pod by each of the analytics steps later. The other part here is the analysis step. Analysis group, this one, which basically executes all the analytics notebooks in parallel. Each of the analytics notebooks performs the analysis then uploads the results to an S3 location from where it's consumed by the Apache Hive. What is also important to mention again is that each of the workflow steps is executing a Jupyter notebook. Therefore, once we execute a notebook, we also store it with the rendered output cells as a workflow artifact. That means we have all the execution results available as notebooks as well. So what it takes to build such end-to-end flow to deliver new analysis results from notebooks on a set schedule. We have defined these four pillars for ourselves. This is no rocket science, but it's important to explain. First, we need a container image with all the notebooks, its dependencies, and the baseline time baked in. We have the second pillar here. We need to maintain some consistency across all the notebooks. We exchange sensitive data by environment variables using well-known common names. We have an automation-specific flag to enable or disable sections of the notebook based on the environment where it's run. Third pillar. We have a common short storage available to all the notebooks, so they can exchange large amounts of data locally without any overhead network traffic and depletive data downloads which would only slow us down and introduce new points of failure. And as a fourth pillar, we rely on a single-purpose notebook's single-purpose steps, so we know which notebook is required to be run before which other notebook, and we aim to accomplish a single task by that given notebook. And now on how we actually implement those four pillars. For the container images, we use a source-to-image strategy using our S2I Custom Notebook Builder Image which respects the cookie cutter for data science repository layout and produces images spawnable in Jupyter Hub in Open Data Hub and as well usable in our automation. We also rely on PIP-AMF and Micro PIP-AMF for dependency management. Then we use a reliable GitOps-centric CICD processes for release management and build pipelines which delivers the image on a GitHub release. And these images are then available in a query repository, scanned for security issues, and so on. Now, as you can see in the bottom left, we use environment variables in the notebooks so we can have automation-specific behavior and point the notebooks to the right credentials and share location, mount points, et cetera. This leads me to the image above. Each step in our group workflows is an abstraction on top of a Kubernetes port so it allows us to define volume claims, mount them to the individual steps as a shared storage and common exchange points. And on the right, you can see how we define our steps within the Arc workflow. Here, you're looking at the data collection section of our workflow and you can see we have three tasks shown here. Each is executing a different notebook which is passed as a parameter to the notebook executor template. We can also define dependencies of a particle task as you can see on the third task here, and so on. So how it all fits together. For this particle use case, we define a recurring workflow with a current schedule. This workflow executes a set of steps which each refers to a notebook executor template. This template consumes few parameters, one of which is the notebook which should be executed and other parameters allows us, for example, to scale the port resource requirements for that specific task and so on. This executor template runs a command which is a tiny by script wrapper around the pay-per-mail command, and pay-per-mail then executes the notebook. So what Argo brings to us here. We can do periodic analysis runs either on a current schedule or with, for example, K-native eventing we can trigger it based on a message bus, triggers, whatever. We have a single, minimal, secure and versioned container image project. We manage this deployment in a Kubernetes way so we can roll back to previous releases and so on. Also, for workflow steps we use a common base, however, each step can be parametrized so we can scale more intensive tasks, for example, and other parametrization is available. This is how we do it currently. What's the future? There's another project called Alira and I recommend anybody with a similar use case to check it out as well. It is a Jupyter Lab extension which allows you to define user-friendly notebook pipelines and is very end-user-centric. There are a few differences between our requirements and what Alira currently offers but we already started collaborating on a couple of those and we'll see where it will go. To draw a quick comparison for you here now our approach is declarative, GitOps-centric. We produce a single image that can work in an isolated environment since all the requirements are already baked in and we can run the workflow on a schedule. Currently, Alira is great for running experiments in comparison, our approach on the other hand is fully autonomous one setup and compatible with our SRE requirements. So that was it. I hope I'm on time. Thank you very much for attending our talk. I hope it was useful to you. However, if you still feel we only scratch the surface, you're right. We'd gladly welcome you on our website and GitHub repository to learn more about this particle analytic setup, including demo videos, Jupyter Notebook workflows, source code and so on. Thank you very much for your attention and if you have any questions, please now is the time or reach to us anytime via email. Thank you very much.