 Welcome back to The Attic. What a third day we are having on this big thing conference, very interesting and intense. We already had two of our keynotes, we're going to jump on to our third, I hope you're ready, because this promises to be very interesting as well. Yesterday, if you were at The Attic with us, you might remember with Vieceslav Kopolajewski, and Gonzalo Gaska, how it was possible to turn the popular Jupiter notebooks into production grade code. Now, we turn to Ellyra, a set of open-source AI-centric extensions to Jupiter Lab. Our next speaker believes that Ellyra has a great capability that he wants to share with us, as another way to take notebooks right down the pipeline. Let's welcome Nick Pentred. He's the principal engineer at IBM. Nick, welcome. It's a pleasure to have you, to have IBM on this big things conference 2020. You know that we have the chance to offer the audience to ask you questions through the platform, so remember that the last five minutes, we hope to receive the questions through my iPad and they're sending them through the chat. So, I remind the audience to write you or send the questions for you to answer within the next 35 minutes. Nick, is that okay with you? Perfect. Sounds great. Excellent. So, whenever you're ready, we're looking forward to listening to you. Okay. Thanks very much. So, welcome to this big things talk on notebook-based AI pipelines with Ellyra and Kubeflow. Certainly, a very different experience from the previous Big Data Spain conferences that I've been to. But certainly, I hope that you all had a great time and thank you for joining me today. I'm Nick Pentred. I'm a principal engineer at IBM, where I work for a team called the Center for Open Source Data and AI Technologies, or CODE. I focus on machine learning and AI applications within the open source ecosystem. I've been involved with the Apache Spark project for a long time, where I'm a committer and PMC member and I'm also of the Book Machine Learning with Spark. And I present for various conferences, meetups, webinars, and more recently, a lot of online events around the world about the intersection of machine learning, data science, AI, and open source. Before we start, a little bit about CODE, the Center for Open Source Data and AI Technologies. We're a team within IBM of over 30 open source developers, advocates, designers. And our focus is on improving the enterprise AI lifecycle in the open source ecosystem. We work on foundational technologies for IBM's data and AI offerings. And this includes the Python data science stack. Apache Spark is obviously a large part of that. Deep learning frameworks like TensorFlow, PyTorch, AI fairness and ethics frameworks that have been released by our team coming from IBM research, orchestration platforms and workflow engines like Kubeflow, model deployment standards and frameworks including KF serving, which I'll mention today. Open projects for sharing open data sets and open source deep learning models, the data asset exchange, and model asset exchange. And of course, the Python data science stack, including Jupyter, scikit-learn, pandas, and others, is a core component of that. And that's what we'll be really focusing on today is the intersection of Jupyter and Ellyra. So we'll start with an overview of the machine learning workflow, talk about Jupyter notebooks, Jupyter lab, and Ellyra and give an introduction to Ellyra, then hopefully the most interesting part of our live demo and we'll wrap up. So starting with the machine learning workflow, this typically starts with data. And we have a lot of data lying around in various places, most of it in fairly raw and somewhat messy formats. And it's not particularly useful in that format. So the first step is to analyze that data. Typically we need to process it into a format amenable for machine learning models. It does not arrive in a nice format of vectors or tensors that we can just simply feed into an algorithm. We have to apply a lot of feature extraction, preprocessing, and various other transformation steps before we can actually train a model. We then have the training phase, which itself is a typical sub workflow where we are trying out different models, different pipelines, training different models to find the one that performs the best for a particular problem and our particular data set. And then it's not much good to have a machine learning model without it actually being used for something of health in the real world. So we don't need to deploy that model. And this workflow actually in reality is a loop because typically one set model is out there. We are predicting on new data. We still need to maintain it. And then new data is coming in or the model is creating a new training data of its own if it's interacting, for example, with users or the world around it. And that really takes us back to the beginning of this process. So it closes the loop. Now this workflow spans teams. And the first of those is typically our data engineers who are working with the data storage, providing access to other teams, so access control, data provenance, and governance, and making sure that all the other teams and parts of this pipeline can actually access the data they need. Then the data scientists and research teams are going to be typically working on the middle part of this flow, which is taking in the data from all these sources, running their analytics, their preprocessing, and their machine learning pipelines. And then the deployment and maintenance phase is typically the provenance of your machine learning and your production engineers. It also spans many different tools. So various data formats, ways of storing data in the data silos. For each phase of the data science workflow, you have a forer of different tools, frameworks, and approaches. And then when deploying the number of different mechanisms, formats, we're exporting many, many different ways of deploying these models. So any medium to large size organization is going to have to deal with all of these. A core part of this workflow, and in particular this middle bit, which is the data science and machine learning research type of workflow, is that of iteration and experimentation. Typically, don't have a team or data scientist who can just take the data and immediately throw it into a pipeline and get a model out. Most of the time, you have to start analyzing what is the makeup of the data, what kind of variables, features, or the characteristics and distribution of the data, or the different data sources that we can potentially combine. So the data scientist goes through this sub workflow of loading, cleaning data, exploring it, and then interpreting that data. Now, that itself is typically not a one-off workflow. Most of the time, data science does not happen just in isolation. It's part of a business need, and it's looking to fulfill a business goal. So most of the time, that's going to be trying to answer questions of the business or solve a particular business problem. In many cases, these questions are somewhat ill-defined, and as answers to the questions arise, so more questions are actually asked, so we honestly do more exploration. So we have this refining step where that analysis in the pipeline, the whole pipeline, is carried out again, incorporating new data, incorporating different ways of analyzing it, updating it, and so on. So there's a lot of iterative process that happens here. The same is true during the machine learning workflow phase. So here we have another sub workflow, which is typically taking the raw data, extracting features, doing pre-processing and transformation, training a model, and then evaluating it on some sort of evaluation or test a dataset to try and get a sense of how that model might perform in the real world when deployed. But again, this is not a one-off process. There's a lot of iteration experimentation that happens here. So the machine learning researchers and data scientists are going to try out different pipelines, different models, different ways of extracting features, pre-processing them, different parameters, different combinations of all of these components in the pipeline. And over time, are going to refine these different components and refine the process to get better and better models. And of course, as new data comes in, as the business landscape will be the problem domain changes, as the underlying data distribution changes over time, custom behavior, and so on, there's a constant refining and iteration of this process. So that needs to be supported in these workflows. And we've seen notebooks, in particular Jupyter notebooks, the standard for this kind of interactive and iterative workflows. That is also very content-rich. So notebooks have emerged to allow experimentation of data visualization, a lot of interactivity, and kind of web-based functionality that is available in the notebooks. So this is great, and it solves many of the requirements and problems for an executive workflow. But it does have some drawbacks, in particular, when trying to move things to production. So notebooks, and the process by which notebooks are iterated on, typically result in a large, monolithic piece of code. So one notebook, which starts small as an exploratory notebook, may end up having hundreds and hundreds of cells, and doing everything effectively the same pipeline in one place. So it becomes very difficult to actually work with that, to modularize those notebooks, and you can't just kind of throw that monolith over the wall to production. Scaling notebooks typically will be written on a local machine, not always easy to scale out notebook-based pipelines and make use of cluster resources, GPUs, and so on. So this is really what Ellyra is trying to solve. Ellyra is a set of AI-centric extensions to JupyterLab notebooks. Ellyra is one of the moons of Jupyter, and hence the name Ellyra comes from that adaptation of that, and you can see here Ellyra is a moon orbiting in the Jupyter ecosystem. So some of the key features of Ellyra. Firstly, the visual pipeline editor. So this is a canvas workspace for creating pipelines that are composed of multiple notebooks and Python scripts. It's effectively a DAG, a directed acyclic graph, that can be built up to represent workflows, and we can see here an example where we're loading data, doing some data cleansing, and then branching out into multiple types of downstream tasks, including potential machine learning and analytics. So this allows us to compose these Python scripts and notebooks into a workflow that can actually then be executed on various targets. So Ellyra allows you to run these locally, just in the local environment, but you can also run remotely on Kubeflow pipelines. At the moment that is the main backing target, they may be more in the future, but this allows you to essentially scale up your resources. Similarly, you can submit an individual notebook script as a batch job running on Kubeflow pipelines, and this is effectively a single node, single stage pipeline. It allows effectively a push-to-cloud resources from your local machine. You can edit and execute Python scripts against these local or cloud-based resources. There's a few other features like automatically generating table of contents, which is a triple lab extension, a library of code snippets that allow you to edit and maintain code snippets that can be dropped into your notebooks, and get integration, which allows you to check out projects directly from Git within Kubeflow lab and do updates and push changes and collaborate with your team. So we've seen how this workflow is made up of these different phases, and the goal here is to really create more modular pipelines with notebooks and scripts by Ellyra. So instead of having this huge model of which becomes unwieldy, we split up the work into these kind of logical units, and we stitch them together using the Ellyra pipelines. So just briefly about Kubeflow, the idea behind using one of these workflow orchestration engines is to be able to scale up this modular pipeline that we saw before, and allow us to run effectively down limited size jobs. And Kubeflow is a very popular and emerging common way of doing this, and it's a machine learning platform that runs in Kubernetes. So this means that it's standardized and can be deployed on-premise within a hybrid cloud or various public clouds, provides support for running Jupyter notebooks as nodes, as well as various other components. And Kubeflow Pipelines is a platform built on top of Kubeflow that allows you to deploy these machine learning workflows and scale them up. So you define a set of pipeline artifacts which is compiled into a workflow specification. This is directed acyclic graph effectively. You can upload that pipeline and then run it on Kubeflow communities. So this allows you to scale things up across a cluster, it allows parallel execution of nodes that can be executed in parallel, so the graph structures allows one of these things to happen. So the idea with Elire is that once we use this special pipeline editor to create our pipeline, Elire will take care of packaging that all up and creating the artifacts and a pipeline definition that will then be executed on Kubeflow. You can also obviously run them locally and we'll see some examples of that. So I'm now going to switch over to the demo. So we've seen a bit of the presentation around what Elire is from about, but really hopefully the most interesting way of seeing how this works is seeing it in practice. So Elire runs within JupyterLab and you can see this is the standard JupyterLab launcher and here we've got a set of Elire specific components, the main one being the pipeline editor, but you can also edit Python files and we've got a link to the documentation. So this is a demo project which is available on GitHub at www.conforts-codevoids-flightdelaynodepbooks, so you can go and check it out and the link will also be available in the slides. But here I've got the project and you can go and have a look at the pipelines that are here. This is a view of the visual pipeline editor and as you As you can see here, we have a set of nodes, including pattern scripts and notebooks. And they're connected together to represent this workflow, which is in the form of a DAG. You can also add comments to each node to get more context. So for example, here, a common use case is to modularize and abstract out common patterns, such as loading data from specific data sources. So here, we've got two load data nodes, which are actually using exactly the same scripts, but they're just going to have different arguments. And one is going to load a set of data about flight delays, and the other is going to load weather data. So our goal here is to try to analyze and predict flight delays and the potential causes of those flight delays or factors involved in whether a flight will be delayed. And we also want to try and enrich that analysis in the model building process with some extra data in the form of weather. So we're going to be using two data sets that are hosted on the Data Asset Exchange, which is a project that I mentioned earlier. It's an open repository for freely available and open data sets, many of which come from IBM Research. And you can find that at IBM.bizvoids.data-exchange. These particular two are not IBM specific data sets. They're common public data sets involving flight delays for US airlines. And in particular, we're going to be looking at JFK airport. So we've got a data set for the weather for JFK airport. Each of these nodes has a set of properties. And here, we're going to see the file name. We have a runtime image. Now, this is applying for when we execute on the Kupfer platform. This one is going to be running with pandas. We've got one common images, predefined images, for most of the common frameworks. And you can also create your own images and use them. You can see that we can pass in some information in the form of environment variables. So for example, here we have the data set URL where the data is going to be downloaded from. And we can define output files. Now, these output files are not really used in local execution. But when we're running on Kupfer, we need to define the outputs of a node. And that means that these outputs are going to be available to all downstream tasks. As you can see here, these two nodes are essentially exactly the same, except for the data set URL. Once we've done the data loading, we then move on to processing each data set. And the output of each of those processing steps is going to be put into the merge node. And that's where we're going to combine these two data sets, joining them together, and extract the features that we then want to use for our downstream tasks. And those tasks are, number one, analyzing flight delays. We're also doing some exploratory analysis and visualization and predicting flight delays for building machine learning models. Now, in order to execute these, you can just simply click on the for run button. And this gives us a run pipeline dialog. So we just need to insert a pipeline name and we can then run in place locally or we can set up a Kubeflow installation. So this could be local. In this case, I'm just running locally, but this could also be a cloud place to on-premises Kubeflow cluster. So for now, we can just execute locally. You can see here that the pipeline is being executed locally by a Jupyter Lab. So it's running in the same process, in the same environment as you're running on Jupyter Lab. As that's running, you can see that it's going to be creating the downloading data from an external location, performing processing. And when running locally, that's all going to land up actually here in a directory on our local machine. But of course, when you're running a Kubeflow, Kubeflow will take care of that. So while it's running, you can take a look at some of the notebooks. So these are going to be fairly typical steps in a pipeline. So involving reading the raw data, performing some cleanup operations, some basic transformation, faltering, and then saving that process data for downstream tasks. Similarly, when we're merging, we now have access to that data, which is coming from a previous step in the pipeline. We load both of these datasets, perform the merge by joining them together, and potentially do some further processing. In this case, we can simply save it out. You then have an analyzing data notebook, and again, following this similar pattern. So reading in the data and starting to perform some basic analysis. So this particular dataset is a small subset of the entire dataset, which is many gigabytes, over 80 gigabytes, and compressed. And we're working with a much smaller sample for demo purposes. And it's further restricted to just the JFK airport departures. And we're looking at the departure delay and whether that can be predicted. So in this case, the delay is predicted is when a flight is more than 15 minutes late. You can see that 80% of flights are on time, 20% of delayed. And if we start digging into a typical analysis that we might want to do as a data science project, we could look at delays over time to see if there's any trends. And we can start analyzing the relative departure delay by different factors in our dataset. So we can look at aspects of the flight itself, such as the day of the week, and these are all the great tools within the Python UK system that are at our disposal. And then we can also look at departure time brackets, airlines to see if there's any particular airlines that seem to be more delayed than others. Destination airports and so on and so on. So what we really want to do here is also incorporate multiple different datasets into the analysis. So here we can start looking at the different weather features. For example, you can see that does drizzle, snow, mist, thunderstorms have an impact on delay. And it becomes immediately obvious that certain weather features are potentially interesting. For example, thunderstorms and snow, sometimes mist may have an impact. And the final step is going to be actually predicting these flight delays. So can we take this data and create a model where we can actually predict the flight delays? So this particular notebook shows your typical approach of creating your training and test data splits, encoding your categorical variables, miracle variables, training and evaluate models using standards, your cycle and cross validation. And again, we can visualize the results, create our classification reports and do our analysis of things like feature importance and so on. Okay, so you can see here that pipeline is still running locally, almost finished. But we can actually, if you want to kick off a pipeline that runs on Kubeflow. And that's as simple as just selecting the Kubeflow pipeline here. And we can see here that Eli is taking care of processing all the dependencies, packaging them all up and it's going to send them to Kubeflow runs on. So we'll be able to have a look at that running. It starts, but for now we can go and have a look at a previous run. So this is the Kubeflow pipeline's UI. And you can see here that our pipeline very similar to the way that, because we actually see that our local execution was successful and we've also got run details for remote execution. We can see here that our Kubeflow pipeline run is going to be beginning. And it's in progress at the moment. So here we can see that the completed run looks very similar to this part of the salario pipeline, of course, we can then go and have a look at the things like the logs. And of course, if anything goes wrong, this is where we would spend our time to try and figure out what are the issues that happened. What's interesting here is that, and this is a bit of a sneak preview of forthcoming features in the Lyra. So this isn't yet available in the main branch, but should be there in the future release pretty soon. And this is that as we saw in our predicting flight delays notebook, we often have a set of outputs, for example, visualizations or scores. So we may have, for example, our metrics, our confusion metrics and the various classification model accuracy metrics that we want to analyze. Now, when we're running on Kubeflow pipelines, bear in mind that here the notebooks get updated when we run locally in place. But when we're running on Kubeflow pipelines, they actually get stored to an object storage location. So here we can see an example of that as a local object storage, but this could be S3 or Google Cloud Storage or IBM Cloud Object Storage, whatever you want to be using. All the artifacts are going to be sent here. So again, we can then have a look at, for example, predicting flight delays, output, we can download that, and then we can open it up. And you can see that it's essentially giving us the same results as you saw in our local execution. So this is the output from our Kubeflow pipeline that executed. But this still means that we have to go and find the results in object storage. And we may want to just have a quick look at what happened and try and get a sense of the outputs, the metrics and so on for this pipeline. So Ellyrio will soon allow you to export metrics as well as visualizations to the Kubeflow pipelines UI. Here we can see we've got some metrics. We have one scores or a Cs. We've got a confusion matrix which we can pop out and have a look, kind of analyze. So this allows us to, within the Kubeflow pipeline UI, to have a quick look at and see whether things are, as you expect, without having to necessarily dig through our object storage and find the correct run and download the data and so on. Okay, so that's the basic pipeline. One more thing which is interesting to look at is there's another version of the pipeline that we've created which is also encompassing deploying the model. So model deployment is obviously a whole field in itself, but here we have a node that is going to actually allow you to deploy the model to Kubeflow serving, KF serving. And here's an example of a run which actually did that. So it's got the deploy model phase and if you have KF serving, Kubeflow serving running in your Kubeflow cluster, it allows you to deploy it to that service. You can also do this locally if you're running a local deployment. So this is going to take the output of the train model from the last node of the pipeline, deploy it, and then here we see an example where we actually tested this model here in the notebook to check that it's working. We can also, if we want to send some data on the command line and get back a set of predictions, and you can see that this inference service is running in KF serving. So that information is all in the GitHub repo and you can go and have a look at running the Aliyah pipeline with model deployment as a set of instructions here. That'll tell you, it's a little bit more involved to get KF serving up and running, but this set of instructions here for how to do that. Okay, so I hope that that has given an overview of the power of Aliyah and the way that we can really bridge both the iteration and experimentation and the flexibility that comes with notebooks, as well as the scalability and compute power and parallelization that comes from a orchestration, a workflow engine like Qt flow. Okay, so go back to slides. So it's really easy to get started with Aliyah. You can try Aliyah from Binder from the browser without doing any installation. You can run it from a pre-built Docker container on your laptop and you can also do a full install on your local machine. So you can check out that IBM.biz. port slash Aliyah demo. Check out GitHub, Aliyah on GitHub. GitHub repo for the demo that I've shown today is at github.com port slash coday port slash flight dash delay dash notebooks. There's also another really interesting project that my colleagues in the team have created, which is very similar using an IRA for an analytics pipeline focused on the COVID-19 analysis. So that's at github.com port slash coday port slash COVID dash notebooks. I encourage you to get active and get involved in the community, find us on GitHub, find us on GitHub. There's many, many different ways you can get involved in the project from suggesting improvements, trying it out, submitting bad reports, helping with reviews and joining the community. So thanks very much for joining me today. It's been a real honor to be part of big things. You can find out more about what we do as coday at coday.org. You can follow us on Twitter, GitHub, developer.ibm.com. I've mentioned the data asset exchange where we use a couple of the datasets in the demo today. I encourage you to check out the data asset exchange on IBM developer, as well as IBM Cloud where you can actually, you can run Kipflat pipelines and connect to Lyra to that. Also, don't forget that IBM has a virtual booth at the conference. I encourage you to go and check that out. We can find a lot of folks to talk to around about data and AI and ask questions of the various teams that are on that booth. Thanks very much. Thank you so much, Nick, for that in-depth explanation. They were actually asking you if you could share your presentation, but I believe you're already given the link, right? I don't know if we could go back a couple of slides to show it again. Sure. So people can have enough time to take a picture. There's the link, and I'm not sure if the chat actually works yet, but I can certainly put it in the chat. Excellent. We can see it now. So we're going to give some time for people to take a picture or write it down. I don't think it's necessary to drop it in the chat, but if you're willing to do so, fantastic. Also, Nick, they're asking about the sneak preview that you mentioned when it's going to be available. I don't think you said the date. I didn't mention the date. As far as I recall, it is not in the next release that's coming up, but it should definitely be in the 2.0. I think we're on 1.4.1 on the Lyra now. I'd say it's not a very long, I think, a few weeks. But in 2020, then. But I can't give an exact date. 2020? I hope so, yes, I hope so. Excellent. I hope so. Nick Penthrit, it's been fantastic having you, Principal Engineer at IBM. Thank you so much for your time, for this in-depth explanation. I think you've given us all your secrets, almost. So nevertheless, we'll stay tuned, and thank you for all those links. We'll follow suit. And we hope to see you here on the next edition of Big Things Conference. Until then, all our best regards to you and to IBM. And see you very soon. Thank you, Nick.