 So for the next presentation, we have Akshay and Rasek from Apple. Thank you guys. Thanks everyone for coming to our talk today. Akshay and I will tell you about scheduling notebooks on Kubernetes. I think you've seen a lot of Jupyter in different presentations, so we'll tell you a little more about Jupyter and tell you a little more how we use it a little bit differently. So to introduce ourselves, Akshay is a senior engineer and is key to the foundation and evolution of our notebooks platform used at Apple. And I lead the interactive data science team in our data platform, and our charter is to provide notebooks and complementary technologies for use across Apple, allowing us to gain value from large data. So for the agenda today, we'll talk about Jupyter notebooks, our usage of it, executing notebooks on Kubernetes, airflow on Kubernetes, and then scheduling notebooks with airflow. So to talk more about how we use Jupyter notebooks, we benefit from the open source Jupyter ecosystem by providing a rich user experience for data activities with notebooks at the core via Jupyter Lab. Essentially we make interactive data development, a data development environment available at the fingertips of data engineers, data analysts and data scientists. Our environment allows data engineers to prototype production code for their needs, as well as providing analysts and scientists the ability to write queries, code, and experiment. So what's great about this, Jupyter Lab is essentially an in-browser IDE allowing users to use notebooks for building narratives and explainable documents using both code and text. Our users can dynamically launch interactive environments to use Python and or Spark to power their notebooks, and we have good extension points in the Jupyter ecosystem to integrate with other services, expanding the user experience. Today our users can collaborate by sharing notebooks via Git or our internal sharing service. So to talk a little bit about our data tool set, there are a variety of data sources including HDFS, S3, and Cassandra. Additionally, there is a Python and Scala ecosystem, like I mentioned, and we also support multiple flavors of Kubernetes. So there are quite a few technologies at play. Our users are able to launch small or large on-demand interactive clusters for writing coder queries using Python and or Spark. Given this tool set, our users' workflow can be non-trivial, which makes sharing and collaboration a critical component of our user experience. So now to talk a little bit about notebooks at cloud scale and complexity. We have a diverse user base using large amounts of compute to access and process large amounts of data using multiple clouds. Essentially we provide a streamlined cloud-based interactive data development environment powered by Jupyter and hundreds or thousands of Kubernetes pods worth of CPU memory and disk available at their fingertips. But what's not great today? As notebooks have mostly been used in an interactive way, it's been a manual process to share code outputs and knowledge buried in the multiple notebook files of our users. Additionally, some data scientists or data engineers have to prototype code in a notebook and then take that work and give it to appear to use a separate means of promoting that code to a production environment. To reproduce work on an ad hoc or semi-regular basis, a notebook has to be manually executed by an individual or team members. And for debugging failed data processing pipelines, one has to debug code by bringing it back into a notebook to pinpoint the error and fix it and then manually propagate that fix back into a pipeline. Finally, Jupyter doesn't handle large-scale long-running code well today as connectivity issues can lead to lost output. Ultimately, we want to improve the productivity of our user base by addressing some of these pain points. So let's talk about taking Jupyter beyond interactive. We want to solve again for the large-scale long-running sessions and also regular data processing jobs by allowing notebooks to be run in the background. We're moving towards a world of Jupyter-based applications allowing users to more easily build data in ML workflows where a user can run one notebook for data preparation and a chain follow-on notebook for ML training or similar. So let's ask ourselves a question. What would it look like to take data science code, configuration, kernel binaries, and execution outputs towards automated offline notebooks for teams in a scalable way with dedicated and isolated cloud compute? So to introduce the domain concepts, there are teams of users using pools of resources, interacting with notebook servers to write notebook files and generate outputs. And those notebook files are powered by kernels and we allow our users to customize kernels as needed. And there are kernel specs which are used to configure the kernels but to give them like a good state, a runtime state. And then we also offer a central repository to find and use a configuration. We also then provide the ability to share and administer those configurations for teams. So to talk about Jupyter server ecosystem, on this slide you can see the APIs available in the Jupyter's extensible ecosystem, some of which are using a number of well-known extensions like auto-completion, the Git extension, real-time collaboration, and Jupyter AI. We leverage this extensibility and we'll open source some of our work for both scheduling and sharing and publishing. Now I'll hand it over to Akshay and he'll tell us in detail how Jupyter fits into the bigger systems picture, allowing us to gain more value from notebooks. He will tell you about how we implemented our solutions to address both the interactive and the non-interactive use cases at scale. Akshay, over to you. Thanks, Russ. So this slide talks about our visual architecture of notebooks, platform, and Kubernetes. It essentially talks about using Kubernetes API to orchestrate notebooks. Our architecture consists of control-plane components which include various orchestrators like notebooks, other data platforms, or with orchestrators. And our data plane consists of all the clusters which users tend to run these large data workloads. And these are owned by various teams. And each cluster is configured to run various system components like operators or other control-plane components which are responsible to manage the Kubernetes artifacts specific to that service. Like, for example, notebooks operator would be responsible to manage the notebooks or deployments and the PVCs to store these notebook files. And the Spark operator would be responsible to create non-debate Spark clusters in any given namespace. Talking about notebooks execution, every user is provided with a secure, isolated workspace in a multi-tenant environment, which users tend to use to run their interactive use cases. So every interactive session actually is backed by a remote kernel, which is orchestrated or provisioned dynamically by our orchestrator in a data plane which is chosen by the user. So essentially, decoupling this architecture with our control-plane with the data plane allowed us to scale our data plane across a number of clusters and also in multiple regions in other cloud environments. And providing the ability for users to launch a remote kernel provided flexibility for the user to use a compute which is closer to the data right from their workspace. Now let's see how we solve the remote kernel execution. So we have extended the Jupyter kernel spec to actually represent everything related to the kernel environment. Essentially, it consists of, so that this kernel spec can be used to launch interactive experiments and also can be shared with various teams to reproduce the same experiments. Essentially, the kernel spec consists of the various properties including the kernel image, which is actual Docker image which is run as a kernel on Kubernetes. And this particular image is owned and managed by various teams to manage their dependencies for their specific use cases. And the kernel spec also includes various properties like CPU memory and the cluster name space to run these kernels, along with other data access policies. So this kernel spec would be used by the orchestrator to provision a remote kernel environment and configure this environment, like to use init container to configure the secrets by accessing our secret store. And other policies like ingress policies such that only the Jupyter workspace of the user can access this kernel and also the ingress policies is that the kernel part can access various data sources or data systems. Essentially, using these policies allowed us to use or provision a remote kernel which is isolated and secure in a multitary name space or cluster. And users tend to run multiple versions of Spark or Python notebooks right within the, in a specific name space. Taking a deeper look on Spark cluster, it itself is a large processing engine or large processing unit where users tend to run their workloads. It can consist of a long-running driver pod and other refermental executable pods which can scale with dynamic allocation configured. And to optimize such these large workloads on cloud environments, the orchestrator is responsible to schedule and manage these pods in different node groups with different scaling parameters with annotations and other optimizations so that we can optimize on compute and cost on cloud environments. Essentially, with this architecture where we allow users to configure their data science teams to configure the kernel specs for their specific needs and also an orchestrator being responsible to dynamically provision these environments and optimize on cloud, allowed our platform to scale for various use cases at Apple for multiple teams and also shall operate at scale. Now, let's see how these concepts can be applied for to run a notebook in a non-interactive use case. So to run notebooks, today, Jupyter notebooks are widely used for interactive use cases. But it can also be extended to support non-interactive use cases where user can run this notebook on a regular schedule. And Jupyter scheduler is such an open source extension which provides ability for the user to run the notebook once in a background or on a schedule. And it also provides interface for users to define or manage all the pipelines within the right within the workspace and access all the historical runs for a specific pipeline. And also downloads various output artifacts specific to that any pipeline in multiple formats. It also provides another powerful use case where users can convert a notebook document into a template by parametrizing the notebook document so that other users can just run the pipeline with the same notebook without updating the notebook file with different parameters. So it essentially aims to provide a simplified user experience for the data science teams so that they can create long running jobs hiding away or abstracting away all the complex complexities involved with creating and managing these pipelines or dealing with multiple data systems. Let's see how we are using this Jupyter scheduler to run these notebook pipelines at scale. So to run this notebook in a multitenant cloud environment, we've integrated the pipeline essentially with the current spec which we discussed earlier where users can configure their own pipelines so that the orchestrator can create an on-demand kernel to execute the pipeline at runtime. We also provide ability for users to link a pipeline with a specific version of notebook file, which is very essential because the notebook document itself is a mutable document and this allows users to link a specific version of a notebook to a pipeline so that they can reproduce the experiments. And we've integrated with Airflow to actually manage or schedule and run these pipelines on Kubernetes, which we'll talk about it later. And we also add support for notifications so that users can subscribe to various notification mediums like Slack or email so that they can deceive the output artifacts of a pipeline run which can be a visualization in a notebook document or HTML preview of a document with nice charts or just a notebook file, which can be shared with others. Now let's see how Airflow is being used to run the pipeline, specifically a notebook pipeline. Airflow itself is a workflow management platform where users can create pipelines with multiple tasks and dependencies. It allows, so a task can be configured with Airflow operator, which is used by executor to execute the task. So in this diagram, in this slide, we see a pipeline for tasks where the task one is configured to run a bash operator, which executes a bash script. And task two is configured to run a Python operator, which executes a Python function. So on Kubernetes, Airflow can be configured to use the Kubernetes executor, which essentially uses Kubernetes API to create a pod, to create a non-demand pod to run these tasks. Which provides that the benefits would be to provide an isolated environment or isolated task for each in a given pipeline where every task has its own pod. Which provides a greater flexibility for the users to configure the resources for each task. And combining with the availability and scalability of Kubernetes, this can also provide users to scale a parallel execution of tasks. Let's say with any of the auto-scaler enabled. So even though the Kubernetes executor provides an isolated pod for every task, the execution environment itself would remain the same. Because Airflow essentially would use the same base image. Executor would essentially use the same base image to run these tasks. So when combined with the, when combining this with the Kubernetes pod operator, which allows users to specify base image or specific Docker image for each task, essentially it allows to spin up a new pod from while executing a task. So this allows users to customize their own environment, where they can have various dependencies between the tasks with different packages. And also provides another feature where they can actually run using different run times. Like a scalar. So this is very essential for notebooks, as notebooks are language agnostic documents, which can be used to represent experiments in multiple languages. So let's now see how these concepts are being used to execute notebook pipeline at Apple. So we define our pipelines with a custom operator, which is similar to an extended version of Papermill operator. So Papermill is a notebook client, which can execute a notebook file. And we have extended the custom operator. We added this operator where it can connect to a remote kernel and execute the notebook file. Essentially when Kubernetes executor spins up a new worker pod during the schedule time when executing a pipeline, the worker pod, which essentially runs a task, has the operator which is configured. And the operator would use the part identity certificates to launch an on-demand remote kernel on behalf of the user. And this kernel can be a Python kernel or a Spark job. And which is actually managed by the orchestrator to configure this environment using the kernel spec which we link with the pipeline. And also the worker pod uses the version notebook file which is linked to the pipeline so that it can access the version notebook file from a remote store and execute it connecting to a remote kernel. And publishing a notification to the output artifact. So Airflow also provides other features like runtime variables, which are substituted by Airflow at runtime. So these can be a start date of pipeline or run date of the pipeline. And it also provides flexibility for the users to run backfill jobs where these jobs that are run on previous or past dates. So combining these features with the parameterization of notebooks, it allows data science teams a powerful way to create and manage the pipelines without managing these pipelines for various different use cases. And it's even more easy to actually debug a specific pipeline as we list all the pipeline runs with an Jupyter workspace. Users can access the pipeline, fail pipeline and reproduce the error using interactive session and fix the pipeline. Essentially, we believe these features would enhance the productivity of scientists and empower them to move their experiments and prototypes from Jupyter workspace to production environments without dealing with multiple data systems. So this represents a notebook pipeline execution in a single cluster. And to actually scale these pipelines across multiple clusters and multiple cloud environments dynamically, we integrated with pipeline service at Apple, which essentially provides an API to define a pipeline spec. So the pipeline spec consists of parameters of pipeline schedule, schedule definition, time zone, everything. And also defines a task with the operator which is defined with and the parameters of that operator. Essentially, it provides flexibility. It also provides features like version controlling of the pipeline so that users can track a changes across pipeline, specific pipeline. And deploy a specific version of a pipeline to a cluster with Airflow enabled. So this essentially allows users to define a pipeline DSL or pipeline spec and provides a way for users to deploy these pipelines in a number of clusters. And pipeline service would be responsible to convert the pipeline DSL to an Airflow supported DAG, which is deployed so that Airflow instance running in the Kubernetes cluster can actually execute and schedule and execute these pipelines. It's also great talk provided by our partner team, which describes the architecture of pipeline service in great detail. It's a link to the slide. Essentially integrating this Jupyter scheduler with pipeline service allowed data science teams to create notebook pipelines dynamically and schedule them on to any Kubernetes cluster of their choice where they can run these pipelines so that they can access the data. To recap, this architecture allowed us to scale both interactive and non-interactive use cases at scale for multiple cloud environments. And using concept of kernel spec, which allowed users to define their own kernel environment for various use cases depending on their needs. And using the remote kernels, they're able to access the data or compute closer to the data for both interactive and non-interactive use cases. Essentially, the basic idea is to provide a cloud ID like experience for data science teams with a platform using various open source technologies. And scale on Kubernetes. So to further improve the experience, data science experience on Kubernetes, we're also working on various other features like where right now we mostly be focused on creating a pipeline with single task, but we're also focusing on extending that to support multiple tasks where users can create tags from right from their workspace with multiple notebook files from their file browser. Where one notebook can essentially a Spark notebook to preprocess the data. And the other notebook which can run an ML experiment using tools like Kubeflow. And users can schedule this whole complete pipeline directly from their Jupyter workspace and run on Kubernetes. And we're also to solve resource contentious issues in a multi-tenant namespaces, we're also looking into schedules like Unicorn to essentially provide a flexibility for the users to define their resource requirements per workload type, whether it's interactive or non-interactive use case. And to help further on the collaboration side, we're also working on real-time collaboration where multiple users can collaborate on single notebook document. So this can be helpful when multiple scientists can collaborate on a single experiment in the same workspace. There's also talk from our team, which explains in great detail about the real-time collaboration feature. That's it for the, thank you. Thank you very much for the great presentation. And we have some time for questions. And there's a mic in the middle of the room. Feel free to use the mic to ask the questions. Is it on? Hi, thanks for the great presentation. Just have a question on cost optimization. How do you guys make sure that data scientists, since they have all the part in their fingertips, not use the giant GPU nodes for something really small? Because that's what at least our data scientists tend to do. So how do you make sure you organize that and manage that? Sure. To answer the question on cost optimization, so one of the things which we did is to scale down or inactive sessions. So normally users tend to use notebooks to create an experiment, and which can be a long-running session or leaving the workload alive. So we also implemented the culling, Jupiter culling, which actually does a lot of cost optimization by shutting down all the inactive kernels. So it measures the activity using the Jupiter session, and will automatically shut down the kernels. Thank you. Hey, I was just curious. I think it's a great talk, very popular research challenge for enabling scientific computing. What drew you guys towards airflow? Is it a history relationship with Apache-based products? Is it just personal preference? Since there's a lot of flow-based operators. So airflow provides the flexibility to use our own operators. So that's the essential thing which we have implemented to use the same configuration as we do for interactive analysis and write an operator to actually use all the benefits of airflow. So that's one of the benefits from airflow that we want to use. Also, just to follow on, we do have airflow expertise in-house. And so we have a sister team that runs airflow at scale that helps us get that integrated. OK. Yeah, I was just curious. Thanks. I know there are a lot of other things out there. There's Argo and so on and so forth. But we're kind of right now in airflow shop. Yeah. Hi, thanks for the great presentation. I saw in one of our slides you mentioned the scheduler is able to bring the draw closer to the data. And can you elaborate a bit more on that? Oh, yes. So essentially, from a Jupyter workspace, since we are, when a user, let's say, a user is trying to launch a Python kernel. So let's say, the workspace itself is being run, the server itself is being run in a cluster in West region. So user can access the, use can provision a Python kernel in a East region. And we can connect both the server pod and the kernel pod, such a way that we can establish a session. And the session is very secure because we also create an on-demand session key secret, which is a shared secret between the server and the kernel. Thanks. Thank you for the talk. I have a question about the scale of your whatever job with the resources. How do you prevent people from over-provisioning or under-provisioning the resources? That one, yeah. So I think on the Spark side, we have dynamic allocation that allows scaling up and scaling down when idleness is there. I actually mentioned kernel culling as well. But folks can over-provision. We kind of, given, let's say, the native quota mechanisms built into namespaces, we basically will confine them to the quota in the namespace. So individual users can have quota or teams can have quota and they can be bounded by that quota. Now with Unicorn, that provides us much more granular control within a namespace if there's a shared team namespace, such that we can limit the interactive and non-interactive use cases with sub-qs. So we could have a pro-user queue and users are not allowed to go past a certain amount. And maybe there's a swim lane for some users that can operate in a queue that has more capacity in off hours. So those are following things that we're bringing Unicorn into the picture to understand how we can solve those problems a little more granularly. Thank you. You're welcome.