 Welcome to our joint presentation that was prepared for you through a collaboration between Shell and Erecto. Here we will share the story of our journey while building a Kubernetes and Kubeflow-based machine learning platform at Shell. Let's introduce the speakers. My name is Alex Jankowski. I am a technical leader for ML Orchestration at Shell New Energies and I'm also a Docker captain. And I'm Vangelis Koukis, CTO and co-founder at Erecto. Before we get into the content I am required to show you this disclaimer slide which points out that when we're discussing net-zero emissions we use emissions or estimations and projections. The actual results will depend on many factors including the choices that society and our customers make. In today's presentation we will first take a look at the business context and use cases we aspire to solve. Then we'll discuss the technical challenges associated with these use cases and we will share some lessons learned. We will dive deeper into a few of the technical details and of course we'll look forward to showing you an interactive demo of an end-to-end machine learning workflow. And last but not least we'll have a live Q&A session. Let me start by introducing Shell Global. We're a group of energy companies with more than 80,000 employees in over 70 countries. We use advanced technologies to innovate and help build a sustainable energy future. In 2015 in Paris, France, 195 countries signed an accord to reduce carbon emissions in order to limit the rise of global temperatures to under 2 degrees Celsius above pre-industrial levels. To achieve this goal humanity must drastically reduce greenhouse gas emissions reaching a point of net zero emissions within the second half of this century. Experts agree that global energy demand is likely to double by 2050 compared to the demand in the year 2000. At the same time greenhouse gas emissions must be drastically reduced in order to get climate change under control. That is why Shell has set itself an ambition to become by 2050 or sooner a net zero emissions energy business. This is accompanied by our ambition to provide a reliable electricity supply to 100 million people in the developing world by 2030. The world's energy system is changing. Shell is investing in more lower carbon technology. This includes renewables such as wind and solar, new mobility options such as electric vehicle charging and hydrogen and an interconnected power grid. Shell is investing up to two billion dollars a year in its new energies business which focuses on developing more and cleaner energy solutions. I work at Shell New Energies building the energy platform and digital foundation for the business. I invite you to follow this link on the slide here for a short introduction video. In the context of Shell New Energies digital foundation there are several areas where machine learning can shine when applied to solving business challenges. Some of these areas are listed here. They range from operational optimization to energy trading to increasing value for our customers. Where there is machine learning there is need for machine learning orchestration and with that come a number of technical challenges that we have been working on solving together with Erectile. Next let us discuss some of these technical challenges and our approach to solving them. In the next couple of slides we'll enumerate nine different groups of challenges. I'll elaborate on the context and ask my co-presenter to comment on our solutions. Infrastructure needs to be cloud native but also cloud agnostic so we don't have to rewrite code if our platform needs to run on a different cloud or even across clouds at the same time. Deployments need to be reproducible, auditable and reversible because we want to know exactly what runs in our systems and how it got there. Scale needs to cover the entire spectrum from micro to hyperscale. Our software must run well in a single host laptop as well as at a large scale on the cloud. Tooling should be web-based so we don't require anyone to install software locally and self-service so we can keep our ops teams focused on automation rather than repetitive manual tasks. Compute must be treated as an ephemeral resource that is available when needed but goes away when not in use. The workloads running on these compute resources must be resilient and reproducible, able to serve their purpose even if underlying infrastructure changes. Over to you Langellis. So Shell came to us with these challenges and here are the solutions we came up with and have implemented together. For infrastructure we opted for auto-provisioned Kubernetes clusters and used the Kubernetes API everywhere. The Kubernetes API is the common language we used to orchestrate workloads regardless of the cloud region we're speaking to. Then we built on Kubeflow as the de facto way of running machine learning workloads on top of Kubernetes. And more specifically, the Arikto Kubeflow stack, which combines Kubeflow with Rock, our data management software, we'll talk more about Rock later on. How do we deploy Kubeflow on top of different Kubernetes clusters? We follow a GitOps-based methodology. Everything starts from a Git repository. We have opted to steer away from the KF cattle tool so we can deploy in the simplest Kubernetes-native way of applying manifests. And hence, we can support seamless upgrades with rowbacks. On scale, Kubernetes can run from a single laptop, for example, there's micro Kubernetes. Arikto delivers its own single node deployment of Kubeflow called Mini Kubeflow. We encourage you to try it out. With Shell, we run on managed Kubernetes services, more specifically EKS on AWS, where we can autoscale seamlessly either up or down based on customer demand. On tooling, we follow DevOps practices and try to shift left. Users run their own code servers, for example, Visual Studio Code. They run their own Jupyter Labs. They manage their own code on a Git repository, for example, provided by GitLab. Eventually, they run their own workloads on Kubeflow. And finally, on Compute, being able to run with reproducible results means using containers based on Docker images inside Kubernetes pods. But this is not the end-to-end story on reproducibility because what about your actual data, the things that you use as input? More on this later on. So, more challenges, Alex? Sure. Storage needs to be fast and cost-efficient so we can decouple Compute and storage challenges. Data should be secured all times, yet available for authorized users and versioned so we know what changes happen over time and can scroll back in time as needed. Security must be end-to-end enterprise-grade and integrated with our corporate identities. And finally, orchestration needs to be transparent and non-disruptive to our user's work. We also want for any user to be able to easily orchestrate reproducible workloads as opposed to relying on a dedicated orchestration engineer. So, for storage, we moved from a shared file system over NFS, that is EFS on AWS, to local super-fast mounted file systems provided by ROC. We will focus more on this in the next slide, but the basic idea is local storage is super flexible, we're going to just access local files and the performance is orders of magnitude faster. But what about data management? So, ROC sits on the side of this local storage and gives you thousands of point-in-time snapshots. Think of a time machine. Let's say we snapshot once every 10 minutes for your notebook servers so you can go back in time and reproduce the data for all experiments you run. Or a snapshot happening at each step of a pipeline so you can know exactly what series of events led to the creation of a specific model, and then you can investigate any biases. This functionality gives end-to-end reproducibility for workloads. On security, we have implemented single sign-on, single logout, centralized authentication, authorization against different namespaces. What we generally do is we create a private namespace per user, but we can also support shared namespaces where different users share access to the same set of resources, for example, the same pipeline runs. And finally, on orchestration, we implement ML apps. We combine the power of Git, Kale, our open source tool which converts notebooks into production pipelines, and then orchestrates hundreds or thousands of these pipelines for hyperparameter tuning and eventually serving. The glue that brings all of this together is the metadata we record at each phase of the workflow. If you want to know more about ML apps, Arikto has a joint workshop with Google named from notebook to pipelines to cave serving, the data science obviously. It's on November 20th, 10 past 12 Pacific, and I encourage you to attend it. So let's take a deeper dive into three of the points we just mentioned. GitOps-based deployments, storage, and orchestration of end-to-end workflows. What is GitOps? Here's a deployer on our left-hand side, and here's a Kubernetes cluster where they want to deploy Kubeflow on the right-hand side. GitOps is all about a Git repository. It sits in the middle. The deployer commits their desired state of the cluster as YAML manifests, and they only apply committed manifests to their Kubernetes cluster. They can use standard KubeKatl apply or even customize. We'll talk more about customize later on. Why is this important? Because they treat their infrastructure as code. The state of their infrastructure corresponds to a commit in the repository, and their infrastructure goes from commit to commit. But most of the time, the actual manifests come from a vendor. In this case, it's us, Eryctor, who produced the Eryctor Kubeflow stack. So let's use my SQL as an example. Eryctor itself implements GitOps. We publish generic vendor manifests in a vendor repository. The deployer clones this repository into a local repository. Then the deployer creates deployment-specific commits. We call them customizations with a K because we use customize with them. On top of the vendor commits, eventually the deployer uses customize to combine these manifests into the final desired state which they apply onto their Kubernetes cluster. Using GitOps makes upgrades a breeze. Let's assume the vendor is at version V2, and the deployer has committed deployment-specific configuration as commit D1 on top of V2. At some point, the vendor produces V3. The deployer pulls and rebases their changes, so they now sit on top of V3 as a new commit, D1 prime. They have essentially upgraded their infrastructure and can now reapply. Next, let's talk about storage and data management. Shell used to run over EFS on AWS, a managed file system over NFS. This solution has two main drawbacks. There is no data management, no backups. Anyone can change the shared state, so there really is no way to reproduce an experiment after a while because the data has moved on. And secondly, performance suffers. Baseline performance on EFS depends on its size and its 50 kilobytes per second per gigabyte. So let's use a 100 gigabyte EFS file system as an example. Compare this to running with Rock, which uses the local storage as it comes with your instances. Rock solves the reproducibility problem by giving you automated, thin application-consistent snapshots. So you can go back in time and reproduce your results. Every snapshot is essentially a git commit for your data, not just your code. Rock archives these snapshots into object storage, for example, S3 in the case of Amazon. On performance, if we compare a standard M5D.4 extra large instance, the difference is huge. Local NVMe gives you 16 times the IO operations per second for reads, more than 400 times the IO operations per second for writes. Bandwidth is 18 to 21 times better. And the aggregate numbers scale with a number of instances. Let's take a minute to talk about how Rock works underneath. Here's a single Kubernetes node. Rock runs as a pod on the side of your workload, also running as pods. These pods have a direct path, path A, via the kernel to local storage. Rock sits on the side, it monitors the IO operations, so it can retrieve the changed data and produce a new snapshot, which it archives into S3. This is path B. At a later time, Rock can restore data from this snapshot, path C on the slide. This is Rock running on a Kubernetes cluster on multiple nodes, all coordinating access to an object store within the same region. But it really becomes interesting when you look at Rock running on multiple regions as independent clusters. In this case, each one of the Kubernetes clusters accesses a local independent object store. So let's look at this example where Rock runs on an Amazon zone, top left, a Google Cloud zone, bottom left, on-prem, top right, even on a laptop, bottom right. Our architecture brings all of these regions together via the Rock registry, and it allows them to synchronize their snapshots in a peer-to-peer fashion over the blue links. So why would you need to synchronize your snapshots, that is your data commits? Because generally, different parts of a data science workflow run on different locations. For example, you experiment on one location, locally in this case, then you move to the cloud. For example, Amazon, the train at scale. Then you run inference in production, and this happens at multiple locations. We, Aricto, have extended Kubeflow, so it works with Kubernetes volumes directly underneath. And then we take care of synchronizing data commits of these volumes across locations. And with this, let's move to Alex, so we can talk about security and isolation. Thanks, Rangelis. Enterprise standards require that all services and applications are secured. We need to use the same user identity throughout all secured assets. Therefore, we've implemented single sign-on. One lesson we learned in the process is that it is not necessary to integrate all of our applications with the enterprise identity provider. Instead, we can use a self-hosted OIDC provider, which in turn federates with the enterprise IDP. We use GitLab, but other OIDC providers can be used as needed. In the case of Kubeflow, this means that we're able to provide each user their own isolated namespace, as well as a shared namespace where users can collaborate on projects. To have end-to-end enterprise-grade security, the journey continues a step further. In addition to single sign-on, we have implemented a single logout functionality, which allows us to logout directly from our application screens. I'll ask Rangelis to explain how we're able to do this within Kubeflow and then dive into the orchestration details. Thanks, Alex. So, we have extended Kubeflow, so it becomes one more OIDC client via component we call the OIDC auth service. This component integrates closely with Istio, and more specifically, the Envoy proxy inside Istio. I encourage you to learn more about this architecture by following the Kubeflow docs and the number of blog posts we have made. Finally, let's look at an end-to-end workflow made possible with our Kubeflow stack. Here's a data scientist, and they have a description of their ML pipeline inside a JunterLab notebook. Each one of the cells or a group of cells corresponds to a different step of the pipeline. For example, data preprocessing, model building, model training, model evaluation, steps one to four. Kale, our open source tool, packages a notebook or even a Python script, which is what we'll demo later on into a Kubeflow pipeline. Here is the compiled Kubeflow pipeline. Kale then orchestrates this compiled pipeline for hyperparameter tuning. It spawns hundreds or thousands of runs to find the best combination of hyperparameters. Finally, it chooses the best model produced from hyperparameter tuning, and it serves it via Kubeflow serving. Note how we maintain metadata for each part of the process inside Kubeflow's MLMD component. Also note how every step uses rock-provided volumes to maintain all of its input and output data. This is important because rock can then snapshot these volumes at each step of the workflow for each one of the individual pipeline runs, thinly, and then it can maintain thousands of snapshots. This means we can reproduce, I'm sorry, each and every one of these pipeline runs forever. So why did we choose to work with Kubeflow? Well, the reasons to choose Kubeflow are obvious. It runs natively on Kubernetes in a scalable and reliable way. It is secure and integrates with external identity providers. It runs both on your laptop and in cloud. You can use the open source version or purchase a commercial version with enterprise support. It integrates with the way you currently do your work, streamlines, and accelerates your ML workflows. One example we'd like to share here is a case when we needed to build 10,000 models. Normally, the time required to do this work manually would involve about two weeks of coding and four weeks of execution. With Kubeflow, we were able to write the code in less than a day and build all models within two hours. Seeing is believing, so let us jump into my favorite part of this presentation. In the demo, we will use roleplay to demonstrate an end-to-end machine learning pipeline orchestration workflow. I will be a data scientist, and Valgeles will be an ML Ops expert. Hi, Valgeles. Hey, Alex. How have you been? Good. I've been working on this really cool project, and I need your help with it. Okay, interesting. What does the project do? It's a data science project about renewable energy. The short of it is that the state of California has a goal to generate 60% of its electrical energy from renewable sources by 2030 and become the first carbon neutral state in the U.S. by 2045. I'm really excited about this and built a project that predicts what percentage of California's energy will be generated from renewable sources over the next 30 years. You can see all the details in the Git repo that I shared with you. Wow, that's pretty cool. How can I help? I need this project to be orchestrated as a machine learning pipeline, so I can run it when I want to check on the progress towards achieving these carbon neutrality goals. I've heard that we need to sit together and iterate on building a container so you can then build the pipeline, then use the pipeline to run the code on a schedule, and if something is not working, we need to go back to step one. I'm really nervous because this sounds like it will take a while and I need to be done before 6 p.m. today. Well, don't worry because what you just described is how we used to build orchestrations, but you don't have to do this anymore. With the new arictoconflow stack, you can do everything very quickly by yourself. Oh, that sounds great. Would you walk me through it? Sure. It's easy. So let's start by logging on to Kubeflow using your single sign-on to do that. So now, are you logged in? Yes, you are. There's a shared namespace. Top left, the shared namespace is Kubeflow-Kubecon, yes. So we share it between us, then go to notebooks. You can always create a new server, but I've already created one. I've closed your repo, so let's connect to it. Your code should be there. Oh, I see that my code is already here. So have a look at it. What is this pipeline branch? So I've created a pipeline branch to show you how you can transform your code into a pipeline, and I've already sent you a pull request that shows the changes you need to make. Let's take a look at this. So what I see here is that you've imported the pipeline and step decorators from the KL SDK, and then just use them in the code to annotate my functions. This is it, nothing else. You just annotate the functions in your existing code. What's your already run? All right, but wait, will this, will my existing code still work the old way I used to run it? Just run it locally. That's all you need to do. All right, let me test that. So at this point, your code runs locally inside your Jupyter lab. I see that it looks a little different, but I think that's because it's being run by KL. That's pretty cool. It ran, and what do I need to do to run this in Kubeflow? You just run the exact same thing, but add dash dash kfp at the end of the command line. This is it. Okay. Seems to be running. What is it doing? You're now running against Kubeflow pipelines. At this point, KL compiles your code into a Kubeflow pipeline, and it pushes it to Kubeflow pipelines. It uses rock to take a snapshot of your Jupyter lab volumes automatically. So your pipeline can be completely reproducible in terms of both your code and your data. You can actually go to Kubeflow and see the result, see the pipeline run. And I'd like to know more about how your code actually solves the problem, right? So let's take a look at the run. I see that the pipeline was generated here, and I recognize all the steps from my workflow. So first, we scrape the data from Kaiser's public reports on energy usage, and then we pre-process that data. We split the data set into training and testing set. Then we use a few techniques to build models and rank them, and then we train the best model with the full data set and predict 30 years into the future. Let's see. Let's see the actual pipeline run. So this is the actual pipeline, the one that you just submitted running. I like this a lot. I feel like I'm in complete control of running my code the same way outside of Kubeflow and inside Kubeflow, and the only thing I need to do to control that is add that minus minus KFP flag to my command. Exactly. Cool. Now that the pipeline run is finished, we can check the results. What do these results mean? Well, the model seems to show that the goals are pretty ambitious, but the margin of error is very large because we're predicting pretty far into the future. I think the outcome will depend on the little strides that are done every day towards achieving California's renewable energy goals. It would be interesting to watch and see how this prediction changes over time, and that's why I'll schedule this pipeline to run automatically every month. Angelis, I can't believe how easy this whole process was, and I'm so happy it didn't take much time at all. I'll be orchestrating all my ML workflows only this way going forward. Thank you so much. I'm happy to help, Alex. I think you can now log off and we call this project done way ahead of your headline. And this is the kind of experience our teams love. Today, we shared challenges and lessons learned while building scalable and highly available ML ops infrastructure applied to real world use cases. We discussed how Kubeflow and the ErectoStack help us solve these challenges and demonstrated a data science workflow from zero to hero made possible by Kubernetes and Kubeflow. We want to say a big thank you to our teams. Quickly share a few references and let you know that you can reach us offline here with any follow up questions. But now we have a few minutes to take live questions from the audience. Thank you.