 after this, I figure, let's get started on time and then hopefully we can end in time as well, so that the next set of speakers has the time that was pre-allocated and that, of course, you can leave the venue on time, right? Because this evening there's the event at Steps Barbecue, which I guess, for those of you who are not from Austin, is something special, you should definitely check it out. So my name is Patrick Titzler. I'm a developer advocate with IBM. I'm based in San Jose, California, where right now it's colder than it is here, so I don't know how you can deal with the heat, but for me, it's pretty devastating. We had recording this morning on the rooftop of the hotel, which was quite a challenge, because even that early in the morning, it was already pretty hot. Okay, so what am I going to cover in my session today? And I don't know why this right now is kind of truncated here. Let me switch this to, okay, this is better. So I mentioned I'm a developer advocate with the Center for Open Source Data and AI Technology. We are a group of developers and data scientists working in combination on a variety of open source projects, most of them focus on data and AI as the name of our team implies. We are fairly diverse group. I, for instance, have been working on software for 20 years, started working on business intelligence systems, database tooling, then moved on to cloud services, and for the past five years or so, I have actively contributed to open source software as part of my role in this team. I'm currently one of the maintainers of the Lyra project, which I'm going to introduce you in this session too. If you have any questions, feel free to stop me in between. Hopefully we also have a couple of minutes at the end should you have any other additional questions. So what is the Lyra? Lyra is an open source project that extends to the lab. I'm not sure how many of you are data scientists, but I'm pretty sure that those of you who have used Jupyter Notebooks, at least you have heard of it or used it occasionally. It's probably one of the most well-liked web-based integrated development environments for working with Jupyter Notebooks. Of course, there are other development environments that can do a much better job when it comes to, for example, Python programming, stuff like that, but Jupyter Lab has been around for quite some time. You might be familiar with its predecessor, which was Jupyter Notebook, which basically focused exclusively on working with Notebooks, right? Whereas Jupyter Lab here takes this a step further. It adds capabilities that you would expect from an integrated development environment, right? So you have a bunch of additional editors that you won't find in the Notebook editor. You have data visualization capabilities, and what is special about Jupyter Lab in comparison to Jupyter Notebook is extensible via extensions. And this is what we take advantage of because Elaro extends Jupyter Lab through two means. Number one, we add more AI-centric capabilities that natively aren't supported in Jupyter Lab, and we're also pre-installing a set of third-party extensions that we find extremely useful, and those extensions are delivering those capabilities that you would expect to find in a full-fledged integrated development, but that are not included in the base installation of Jupyter Lab. Two of those extensions that you probably might be aware of is one of them is the Git extension, which provides you with access to Git-based repositories so that you can sync the workspace of your Jupyter Lab installation with the content of one or more repositories, extremely useful for various day-to-day tasks, and the other extension that we are including by default when we are installing Elaro is the language server protocol support, which probably doesn't mean anything to you, but if I mention, if you have used any type of editor, for example, for Python or for Java, you're familiar with features like automatic code completion, on-the-fly error checking, stuff like that, this is what the language server protocol provides, and by installing it, it's available in all of the components within Jupyter Lab, right? So if you open up the notebook editor, if you open up the Python editor, you have access to all of those features, which you all probably have come to like and love over the years. Elaro is a fairly new project, so we released the first version about two years ago. We are running a fairly aggressive schedule, so we are trying to release a minor version every month, which of course opens up a lot of opportunities for us, because as an open source project, we are reliant on a lot of feedback from the community, what to work on next, so we can respond in a very timely manner to most of the requests that we are receiving. As of earlier this week, we had about 50 distinct contributors to the project, and there are about eight of us who are permanent maintainers of this project. The contributors contribute anything from source codes or new features, bug fixes, documentation. We even had contributors who have written tutorials for us, so it all boils down to having a core set of people who do the bulk of the work, and then the community chipping in for those kind of work items that we can take up or that are so specific that we can't actually deliver them and have to rely on the community to do this work. The most popular feature, and that's the one that I'm going to cover in my presentation today, is the ability to create pipelines from Jupyter Notebooks, from Python scripts, from R scripts, and even custom components. So, no worries, I'm not going to show you a lot of slides here. I'm just going to go through the brief motivation, then I'll switch to the demo, because I believe when you actually see this in action, it will be very self-explanatory how this can add value to your day-to-day activities. So, what are some of the goals that we started out with when we started the project? Traditionally, when you have Jupyter app installed, at least when you're starting out, you are running it locally on your laptop, on your desktop, which of course has limited resources, right? So, it might be okay to run this notebook in your local environment, as long as you have a notebook that doesn't require a lot of resources, but sooner or later you're going to hit the point where it's just not enough, right? You might need access to specialized hardware, you might need access to a lot more main memory, disk space, simply because all of the ML operations that you're running are so resource intensive. And so what we're trying to do is give users from within the context of Jupyter Lab and the notebook editor the ability to run these notebooks in remote environments where you have access to those resources, right? So, it's something that can be done in other tools, but we are taking a unique approach and you'll see in a minute how we actually accomplish this. The second bullet here, at least when I started with data science and when I started working on Jupyter notebooks, what I did is I did my regular development, right? And my notebook kept growing and growing and growing and was doing all the things I needed it to do, right? It was loading the data, it was cleansing the data, it was looking for the features, was doing the training and by the end I was done. I had a notebook that was huge, right? I had a monolithic notebook, which is nice if I just want to share it with somebody, right, but if I want to take this into a more production-like environment, right, where I want to reuse assets, that's going to be a problem, right? Because now I have this notebook that does all these things, but in another project, I might only need a small part of this notebook, notebook's capability to execute, right? And so, if you break down your notebooks from one big monolithic notebook into several smaller notebooks, suddenly you have this need that you want to run a set of notebooks at the same time or in a sequence, right? And so, in essence, what you want to be able to do is create a pipeline that you can just submit, and then as part of this pipeline, all of the notebooks or scripts, whatever you want to incorporate in your machine learning workflow will be executed and this is something you will be able to do with the library pipeline feature. Then once you have really done with your, let's say, initial development and you move this over into a QA environment or production environment, of course, you don't want to sit in front of your machine and trigger these pipeline runs manually, right? So you don't have the ability to run this periodically and we give you different means to actually do this, so that, for example, if you're running this in a CI CD environment, your tools can trigger a pipeline run and then you can use the regular tools to monitor the execution process of those ones. And then last but not least, and it's also one of the key features here, we didn't want to reinvent the wheel, right? There are obviously already other tools out there who can execute workflows, right? So we wanted to take advantage of them. We didn't want to re-implement everything from scratch because we wanted to provide value to other open source projects, right? We didn't mean to actually compete with them. And you'll see on this next slide here what those orchestrators are, but before I go into that, on the left-hand side, what you can see is, so what do we do to support those goals, right? And this is probably a little bit too small to see in the back, so if you download the slides, you can actually see the screen capture more clearly. On the top left, you can see the Notebook Editor and the Python Editor, which we have extended with a button that is right here, run as pipeline. If you click that button, you select which runtime environment you want to run this notebook in and then we do the rest for you. We have implemented the Visual Pipeline Editor, which is the name implies, enables you to create pipelines from your notebooks and scripts and that's what I'm going to show you in my demo. And then we have the CLI, which you can use to automate some of those tasks once you're ready to work with those pipelines in a CRCD environment. So currently we are supporting three different types of runtime environments. Cubeflow Pipelines, which is part of the Cubeflow ML platform, which is based on Kubernetes. It's also a fairly new project, but very, very popular. And then we have Apache Airflow, which is even more popular, but it's more of a general purpose workflow, workflow execution engine. Both of these environments or orchestration frameworks have in common that they're running on top of Kubernetes, which makes them extremely scalable. And they both also have in common that they provide a programmatic interface. So in essence, as a user, in order to create a workflow and run it, you have to implement some source code that then is executed by those runtimes. And with the tooling that we have, we do that for you. So you can focus on just building the pipelines and retake care of all the nitty-gritty details that have to be taken care of when you actually run this. The last runtime that I'm mentioning here, the local execution is something we are providing for primarily development purposes or for smaller pipelines. The way this works is we simply run the pipeline in your local environment, which of course doesn't solve all of the problems that I mentioned initially, but just to get started, and for example, for demo purposes, this is more than enough. So let me switch over to my demo and show you how we would go about creating a pipeline. What you are seeing here is my JupyterLab installation. You can see here in this section that there's a new category called Alira. And at this category, you have different tiles. I'm going to start off with the generic pipeline editor. Unfortunately, I can't increase the font size any further because then things start to look really funky. So I'm trying to keep it fairly simple here. So we have our canvas, we have our palette on the left-hand side, which allows me to pick from a notebook, a Python script, or an R script to keep things easy. I can also simply drag and drop the notebooks that I have, and we just open this one up. So this is something you're probably very familiar with, write just a regular notebook. Here we have the button that I mentioned, we are not going to use it here, but the experience that you have is the same that you would have today when you're just working with individual notebooks. So let me add a couple of those notebooks to my pipeline here by dragging and dropping them in here, canvas. And of course, this works the same way irrespective of whether you have a Python script or an R script. I'm just using it to keep things easy for me. And now what I need to do is I need to create or define the dependencies between those notebooks. The way I'm going to do this is I connect the different notes that I have added here, and I probably don't have to explain what the execution would look like in this case. If you read this from left to right, we would first load the data, we'd then run the data cleansing notebook after this first notebook has completed execution, and then once this notebook has finished processing, these two notebooks would be run in parallel. So this gives you this ability to create a graph, a dependency graph between the different steps in your machine learning workflow that you're trying to complete. And now all that's left for me to do is configure the individual notes because each one of those notes has properties. And the properties, and that's what you see here on the right-hand side, they're specific to the type of note that you're working with, right? So the properties that you see for a Jupyter notebook might be different from, for example, a custom note that I'm going to show you later on. There are some generic properties. So for example, if I don't like the name of the note because it's called load data, right? I can name it anything I want, but since I'm not very creative, I'm just going to call it load data here and you can see this is then automatically reflected. And then these custom properties, the way we execute these pipelines under the covers ultimately using the orchestration framework requires us to specify a container image in which the notebook will be executed. The motivation for this is that when you run something like this in production, of course, an important conservation is you want things to be repeatable, right? And the only way really to make things repeatable is by providing a runtime environment that is constant across time, right? So if I specify, for example, I want to run this notebook in a container image that has TensorFlow 2.8 installed, right? Then every time I run this pipeline, it's predefined already. What are all the packages that are installed in this image? And if I run this pipeline 10 times over there, let's say, a span of several months, I should always get the same results, right? Because ideally I have no variability in the environment in which this notebook is being executed. Depending on the needs of the individual notebooks or scripts that I'm running, I can specify additional hardware resources that I might need, right? So a training job obviously would require some access to GPU hardware to accelerate the processing. I can specify file dependencies and other dependencies that the notebook might have to function. I can customize notebooks using environment variables, extremely useful if you want to generate, if you want to create notebooks that are reusable, right, where the only difference between two executions, for example, is just the value of an environment variable, you define the value in your pipeline, you run the pipeline, then change the value, run the pipeline again, you never have to touch the notebook itself. In those cases where your notebooks or scripts require access to confidential information or sensitive information in general, like user IDs, passwords, API keys, you can take advantage of Kubernetes secrets, which in essence need to be predefined in the environment where you're running the pipeline, and then we take care of making the information that's encapsulated by the secret available to the notebook, so that when somebody else looks at the pipeline or looks at the pipeline in the editor or looks at the source code, they will not have access to that confidential information, right? It's only available at the time when the notebook is executed. And then last but not least, I have to specify at least for notebooks what the output files are, so that downstream notes, which are notes that are executed after the note that's currently actually have access to any output that the upstream note might have created. And then data volumes is something that we added fairly recently based on community feedback because one problem that we have seen is the current approach that we've been taking when it comes to the data exchange was limited when you were dealing with large data sets, right? If you have data that needs to be exchanged between multiple notes that ranges in the gigabyte range, you don't want to send everything over the wire, right? In that case, what you would want to do is mount a data volume on each one of the notes so that you reduce the network traffic. So you define these properties for each one of the notes in your flow, but since this can get very tedious, what we have delivered in one of the more recent releases is pipeline defaults where in essence you can specify one default value that will be applied to all the notes and then you can override the defaults as necessary, right? So that can be a major time saver and you can see this at least for notebooks includes most of the properties that those types of notes require. So that's really all there is to building a pipeline from notebooks or scripts. Now, you obviously want to be good citizens, right? So again, I'm not here creative, but you want to document your pipeline so that if you build the pipeline, then somebody else, for example, after you've uploaded to your source control system, if somebody else then downloads the pipeline, you want to give them the relevant information that they need to understand what is the pipeline doing, what are some of the constraints, so you can add comments to the pipeline both here and in the pipeline file itself that is then available, very similar to what you would be doing in Jupyter Notebook. So this is a generic pipeline and we call it such because this type of pipeline can be executed in every runtime environment that you support. So careful pipelines on Apache Airflow and in the local environment. Before I run this pipeline, I'm going to show you one more thing and that is how you would work with custom components. For that, I just need to switch to different directory. So what I have here now open is again, the pipeline editor but this time, this pipeline editor is specific to Kubeflow Pipelines and you can see the palette now not only has the entries that you have seen before which enable me to run notebook scripts in this pipeline but also these custom components, right? And I'm just going to pick download file here I'm actually not going to run anything but the way I'm building the pipeline is exactly the same. I choose the components, I define the dependencies and then I define the properties of these components and this is where I probably need to explain kind of when would you use a component whereas when would you use a notebook? The advantage of a notebook of course is you have access to the source code, right? Which is ideal when you are working in a development mode on a project but at some point probably you're ready to move this development code into production, right? And then running this in a notebook isn't necessarily the best way to do it but that's when you probably start thinking about well, how can I extract this code, move it into some self-contained piece of executable that I can then make available to users, right? So that I can reuse this over and over in my projects. So and this is in essence what I have done here in this example, I have created a very simple component called download file that file expects as an input and that you can see here on the right hand side a URL that points to a data file and that component will download the file for me and I don't have to know what actually happens under the covers, right? All I need to know is I need to provide this input and then this component will do whatever I want it to do. Now granted this is a simple scenario that I'm showing but think about those cases where you have implemented code that trains a model, right? If you, maybe for example, implemented an NLP model, right? And you want to make this available to other users, right? You as the expert can create this component, implement it in your environment, test it, make sure it's working and then you can make this component available to all the other users in your company, right? And they can use that component without actually having the same kind of knowledge that you have, right? Simply because all they need to know is what are the inputs, what are the outputs that this component is producing and then they can create their own pipelines just like I have done here today, right? So in a way it provides you with an abstraction layer that gives you a lot more flexibility down the road when you are trying to build more pipelines that reuse the same type of code. Now how you are actually implementing these components is specific to the runtime environment that you're working with. In my presentation, so I have a bunch of links that point to the relevant resources. So I'm not trying to give you here a marketing presentation, right? So all of the resources should point you to the appropriate topics in those documentations for the appropriate open source projects. I've already mentioned that these custom components implement basically a machine learning specific task in the general sense, but of course your custom components can implement anything, right? There is no restriction whatsoever what they actually do, right? If you want the component to draw some smileys, well, it's all up to you, right? So you can be creative when it comes to building those components. What is relevant is that we as a Lyra don't really know up front what are the components that are available in your environment, right? Because all we are providing is the editor. You are the ones who are building the notebooks. You are the ones who are building the components. And we have no insight and knowledge ahead of time what the inputs are, what the outputs are, right? And so the way this problem is solved is each one of those components explicitly or implicitly defines a specification. And that specification is, so an example of that is what you see here on the left-hand side, even though it's kind of hard to read, it defines the name of the component, a description that's then visualized in the UI, the inputs and the outputs, and that is then rendered, and that's what you were showing in the demo earlier, and that's who we're seeing earlier in the demo. It's then rendered in the UI. So the quality of your specification has an impact on the user experience that they will have when they actually try to use this component here. Last slide that I'm showing here for custom components. I mentioned that Elira doesn't include any components on its own, right? So you have to provide them. We use the concept of catalogs to actually do this. Think of a catalog as a collection of pieces of code that access to VR connectors. These connectors, we have implemented some of them for the generic purpose, so like file system directory URL should be fairly self-explanatory, right, so if you have components stored locally, if you have them stored somewhere on a public website, you can download them. For Apache Airflow, we have specific connectors, and we also have a couple of connectors that were supplied by the community and are supported by the community. The MLX connector is a connector that provides you with access to the machine learning exchange. It's an LFAI Sandbox project that's also fairly new. The focus of that project is to provide a data and AI assets catalog, an execution engine. We have currently been going on towards the support of the factory as a repository. GitHub is coming down the line, probably as three storage as well. Components, the support for components is fairly new. We've only had it for about six months, so we're still working with the community to identify what are the types of systems or catalogs that are frequently being used so that we can work on enabling those connectors. And what you see here on the bottom is how it works conceptually. You have the catalog, which contains the definitions. You have the connector that knows how to read from the catalog, and then all of this is visualized in the palette which you were seeing earlier. Okay, we've already covered how to go about configuring notes. So let's talk about running the pipeline. Running a pipeline is as simple as choosing the runtime environment in which you want to execute your pipeline. You can do this either in the pipeline editor, which I'm going to show you, or using the command line interface, again, for those scenarios where you want to automate certain processes. We also support exporting a pipeline. In order for this to make sense, I probably have to explain that the pipeline representation that we are creating is not something that any of the runtime environments that we are supporting understands, right? So we have to convert it into something that Kubeflow Pipelines understands, or that Apache Airflow understands. And when you export this, this is actually what's happening. So we create artifacts that you can then manually import into those systems if you don't want to run the pipelines right away. So let me switch back to the demo and back to my first pipeline, actually, in order to speed things up. Just going to open one that has been fully configured. Again, following best practices here. I click the Run Pipeline button. Since this is a generic pipeline, I can choose between the runtime environments. I'm just going to run this on Kubeflow Pipelines for now. You can see I have two environments pre-configured that I can use. Normally, you would have a setup for your dev environment, for your query environment, for your production environment. I'm just going to use my development environment. I'm going to kick this run off. What's happening is we are compiling the pipeline into something that Kubeflow Pipelines understands. We are creating the artifacts. We are uploading them to a location that is accessible to Kubeflow Pipelines, and the job is now handed over to Kubeflow. So for those of you who are Kubeflow Pipelines users, this dashboard here should look very familiar. This is the central dashboard. On the left-hand side, you have access to the different services that Kubeflow provides. And what we see here executing is our first notebook, the one that loads the data. We can see that right now this is a work in progress. If we look at the log file, we can see what's actually happening under the covers. Since we don't really have time to watch this to run to completion, I've already done this in preparation for the session here. So we can see the pipeline graph, which in most cases is going to look identical to what we have created in a Lyra. But in some instances, it might look slightly differently because under the covers, Kubeflow is running some optimizations. The other thing I should probably mention is that for certain types of node, Kubeflow Pipelines can take advantage of caching. So if you, for example, have run a notebook before, it won't execute it again, which can save you valuable resources, especially for those long-running jobs. You might be wondering, can I do monitoring in the Lyra UI itself? The answer is no. We don't provide this. The main reason is we have limited resources ourselves. That's a small open source team. So we need to focus our energy on those high value features that you can't get anywhere else. So that's why we didn't want to duplicate this type of monitoring user interface. And therefore, we are delegating this task to the runtimes that we are supporting. One thing that I haven't mentioned is, so even though we are just right now supporting Kubeflow Pipelines and Airflow, because this is open source software, we have designed it in such a way that, in theory, you could add support for other runtimes as well. And we have received several requests already from users who are using other workflow orchestration frameworks. And they wanted us to implement it. In most cases, unfortunately, we have to respectfully decline and have to rely on the community to do this work, because our man was also limited. And we can unfortunately do everything that people need. And so once this pipeline has executed, what we can do is access the results of this execution. You can see here that over time, and this is awfully small here, what I'm showing you here is our object storage browser that we invoked. I'm just going to pick one of them. What happens is everything runs remotely. So we are not pulling down the results to your local machine. We store that on cloud storage. And then what you see for each one of those nodes that we executed, we store the completed notebook. We store, for example, an HTML version of the notebook. That's what I'm going to download here. Let me just open it up quickly, and we can see that this notebook has executed. So this is how I would access the results without impacting anything that I have stored locally. So that's all there is to running a pipeline in a remote environment. So let me pause you briefly. Any questions about what I was showing you so far? How to build a pipeline and how to run it? What it didn't show us, how do you create this runtime configuration? In essence, it's a very simple step that you would complete ahead of time. And I've done this in preparation for the demo. If I switch to the runtime to you, I had to pre-configure the connectivity information for the systems that I have access to. And that makes it available to all of my pipelines. OK, if there are no questions, need to go back to my previewer. OK, let's see if we can cover it. So what are some of the deployment options that you have if you would be interested in exploring this? So you can build a Lyra from source code, which is what we as the maintainers typically do. We publish both on CondaForge and on PyPy. So you can install it within just a few seconds. We also publish pre-built and custom container images that you can use to run Lyra, for example, from the Docker desktop. You can use those images if you have a multi-user Jupyter Hub deployment, right? You just configure the image that you want to use. And you're good to go. We are also providing a custom Kubeflow notebook server image, which you can use if you're using Kubeflow as your orchestration framework. Kubeflow provides this ability to run notebooks within their environment. And with this image, you can very easily do this. And that's what the screen capture here is showing. I configured a server with our image. I spin it up, and then it looks the same way it did in my little demo here. And then last but not least, we have incorporated Lyra into Open Data Hub. Open Data Hub is an initiative that's led by Red Hat. The idea is that they want to provide a blueprint that's completely based on open source that allows you to set up, basically, a data science environment near Lyra as one of those components, along with Jupyter Hub. Last slide that I have is just some useful links. If you'd like to try it out, we have invested a lot of time into the documentation. On the right-hand side, you can just see a small summary of what that entails. We have a variety of community channels that we monitor. So of course, in GitHub, if you open up an issue with a question, we have a discussion from there. We have a GitHub chat community, many different means to reach out to us and get help or ask questions. And then last but not least, if you're interested, I have to try this out and maybe want to also contribute, we have information in the GitHub repository on how to do that. All right, one last chance for questions. We have about three minutes left here. For me, it's a very interesting question. It's one that probably takes a while to answer in general. But obviously, I don't want to waste your time. So I've mentioned that I've worked on enterprise software for almost all of my career. And this change to open source has been fundamental to me. It's a very different experience. It's very different because the motivations of your users is very different. There is nobody really who is forced to use the project that you're working on. If they find something better, they're going to use it. So you're always competing with somebody else. So you have to put up your best game. You have to be really responsive. We don't have a sales team. Nobody buys this. Everything is free. If you don't like what we do, you can just modify it. And then this is just, for me, was just the natural next step. It's definitely a challenge on a day-to-day basis, especially when it comes to finding volunteers in the community to help with some of the work. And I was the same way initially. You tend to use open source software. You might appreciate it. You might give it a bit GitHub star. But you don't really think necessarily about what are the people who are contributing their time to making this happen. So you kind of have to build up a bit of a tough skin. And people start, let's say, voicing what I would consider unreasonable requests, especially if those requests are coming from big companies and they're not willing to contribute. It's a game of giving and sharing. So it only works both ways, because ultimately, I guess, most of the open source projects will not be able to succeed. And I think that, for me, was one of the biggest high-open assets, because, like I said, I never really thought about this in my previous jobs. Very good question. Anything else that you would like to know? We have a question from a virtual attendee. DCO asks, is it possible to have unit tests as part of the pipeline CICD? Very good question. So the short answer is no, there's nothing built in that would support this. The way we are actually doing it as part of our daily development is we're using PyTest to actually verify the functionality. The way, as a customer, I would approach this is by using the CLI, the command line interface, because there you can run those pipelines in whatever fashion might be required, and then you can analyze the results of those pipeline runs. But native support, unfortunately, we don't have all those types of tests. One more question there. Earlier, you mentioned that folks can download the file so that way it can be an executable or some sort of binary to provide. But do you provide signing of that executable? Not at this point. But it's a very good question slash comment, because obviously you can't necessarily trust everybody. At this point, the best answer I can really give is you should only download components from trusted sources. I mentioned in between some of the comments I made during the demo that components are fairly new still. So even though there are some repositories out there that have components, it's not really very prevalent yet. In fact, so we are working with the Kubeflow folks on coming up with more of a standard for those components. How do you store them? And I would assume that part of those discussions, we also have to think about those aspects. It's a natural progression. Initially, you start building small, but then at some point when you start getting into the enterprise domain, those types of questions become a lot more important. OK, good. So we are out of time. If there are any other questions, I'll be around for a little bit longer. You can, again, reach me via any of the community support channels that we have. And with that, thank you for coming to my presentation. And hopefully you enjoy the rest of the conference.