 Hey everyone. Good afternoon. Welcome to join our session today. So we are going to share about the self-service stream processing platform on Kubernetes. I'm Chen Ya from Apple's AI ML data platform. This is our agenda today. So we'll talk about our journey moving to the cloud and Kubernetes and why we adopt Apache Flink for real-time data processing. Then our challenges while moving to production at a large scale and the solutions that we have figured out so far. So first about our journey moving to the cloud and Kubernetes. So our users expect very quickly scale up and down of the cluster resource to match their real-time traffic pattern and we cannot wait for long, right? And we also don't want to preempt user applications to honor some of them. So we can see this is like for the deployment itself. We have the auto scaling and how about the cluster. And from a platform's perspective sometimes it can be ideal that we decouple ourselves a little bit away from the infrastructure maintenance overhead. And when we move from virtualized deployment to containers, it actually help us abstract away the physical hardware complexities. So here we can see the containers can move freely around clouds and also the OS distributions. And in a multi-cloud environment, there are a lot of benefits for us to leverage the available support from different cloud providers, the hardware resource, and the cost efficiency. And Kubernetes can help us to organize these containers to workloads in a consistent approach to provide the compatibility. So there can always be limited admins who can help us address cluster issues at a very large scale. Luckily, we can leverage Kubernetes automatic recovery capability to solve some issues and to keep the cluster resilient and self-healing in case of failures. And we know many platforms have this need to add additional components on top of an orchestration layer. And here we can see Kubernetes has this modularized architecture that can allow us to integrate with multiple components and extensions. So that's about our journey moving to the cloud and Kubernetes and why we adopt Apache Flink for real-time data processing. So we know users have a very high expectation in terms of latency when they try to process data in real-time, right? They expect the data can be immediately processed upon arriving the system. And we can see Flink supports a pipeline in-memory execution model. So data can be accessed from local and one there is a slow operation. Flink is able to adjust its data processing speed. So we can see the busy operator can propagate the back pressure to upstream task and here it can avoid system crash overall, right? To slow down the throughput a little bit, but to avoid the system crash. And when there is a very large state, Flink is able to take checkpoints incrementally to capture the data and asynchronously in the background and to restore from the state, Flink can support this exactly once guarantee with replayable source like Kafka and the atomic state commit. So we know many users have this need to connect their streaming pipeline with upstream or downstream data source and things, right? And Flink has this very rich connector ecosystem, not the limited to Kafka, Iceberg, but also Cassandra, Elasticsearch, Pausa, right? And users don't need to write extra jobs to complete their entire workflow. We also know users have diverse background and needs when they're working with Flink, right? And luckily we have this multi-layered API on the very top. There is SQL table API that comes with optimization and built-in border plate codes, right? And in the middle we have this data stream API to unify stream and batch processing with some advanced windowing functions and at the very low level there is this process function that can provide fine granularity for users to manage their state and time in a very granular way. Okay, but still with Flink and Kubernetes, we have challenges when moving to production at a large scale. What can be some of them? The first is that we need an automated approach to deploy Flink on Kubernetes. And we also want a control plane that can scale beyond a single Kubernetes cluster. We would like to help our users self-onboard, observe and troubleshoot their applications without admins in the loop because there can be so many of them, right? Also, one support users and admins, they have this need to bulk operate a large number of deployments. How can we help them? And also to automatically scale not only stateless streaming applications, but also stateful applications. And how about managed resource availability? This can also be challenges, right? And also for streaming applications to have cluster upgrade in the cloud environment. So first, let's take a look at deploying Flink in Kubernetes, right? There are multiple options and the first option coming to our side is the Flink-Kubernetes native integration. It's first introduced in Flink 1.10 years ago, right? So how does it work? Let's take a look at this diagram. So we can see the Kubernetes client is actually embedded inside the Flink client to talk to the Kubernetes API server directly. And the Flink client can then create a job master deployment, including the job manager as well as the Kubernetes resource manager. And then we see the Kubernetes resource manager is actually responsible to dynamically allocate task manager parts through the Kubernetes API server. So for this Flink-Kubernetes native integration, we can understand in a way that Flink itself contained, right? It doesn't need to rely on any external tools from Kubernetes like the Kube CDL command. But it then also creates the challenge to automate the application upgrade in Kubernetes environment. So we must have already been familiar with these two concepts from Kubernetes, right? The custom resource and also the operator. Can we combine the custom resource and operator to capture the core responsibilities of a human operator to manage Flink deployments? So we know human operators, they have deep knowledge of how a Flink application ought to behave and what to do when there are problems. With a Flink Kubernetes operator, then it can help us to monitor, to upgrade, and to deploy the applications in an automatic way. And yes, we have this Flink Kubernetes operator introduced in Flip 212, with Apple being a major contribution to this as well. So the Flink Kubernetes operator, it actually leverages the Java operator SDK because it wants to directly talk to the Flink Java client library. And under the hood, it still uses the Flink Kubernetes native integration to launch the Flink deployment in a cluster. Both the native integration and the operator will use the Fabric8 Kubernetes client to talk to the Kubernetes API server. So here from this chart, we can see all possible current state and transitions of a Flink resource lifecycle. And let's take a deeper look on how the Flink Kubernetes operator works. From a high level, it follows this control loop principle from Kubernetes. So we can see the user creates a custom resource, and then the operator will try to check the status of this resource from the cluster. The observer will record a point in time status, but we need to be careful here. Before the control loop finish, anything can change, so the status recorded can also change. After the observation phase, the operator will have a validator to check the application resource back, if it is in good terms. And then are we ready to do the reconciliation and execute the upgrade? The answer is actually no. There are multiple things that we need to check. For example, we want to make sure there is no pending operation for this custom resource. It's very common that we trigger menu save points and it will take some time. So we want to wait any pending operation to finish. Another thing is that the status can change. What if in the middle something is wrong and the application is not stable anymore? Then the operator is responsible to bring this custom resource back to its last successful state. So something about the operator, and let's assume we now have the Flint Kubernetes operator in the production, right? There can still be challenges to run PAP-L enlarged. So the first one, what if the Kubernetes cluster where Flint Kubernetes operator fails and goes down? Can user group map beyond one Kubernetes namespace for modded tenancy, et cetera? Authentication and authorization might need to meet extra requirements and how can we support that and where should we support it? The deployment information need to be persisted for audit or recovery. What if the cluster is done and the operator is gone as well? And admins need to operate across multiple clouds, multiple accounts, and clusters. There can also be other service or platforms that want to integrate for real-time data processing. Lessons from BPG, right? What is BPG? This is the batch processing gateway service that Apple open sourced earlier last year. So essentially, it helps Spark users to launch Spark applications in a Kubernetes environment and manage there. And there are a bunch of REST APIs that expose to clients and users directly. So how does it work? Let's take a look here. The batch processing gateway acts as an intermediary layer between the user request and the Kubernetes customer resource. And we can deploy Spark Kubernetes operator in each cluster. And it defines a Spark application customer resource. And now it's the Spark batch processing gateway's responsibility to transform the user request. Into a CRD format that can be recognized by the operator. Then these CRDs can be sent to the Spark Kubernetes cluster, have the operator to take over there. So essentially, the idea here is that the gateway service is trying to take over all the heavy lifting on behalf of the users. For users, they only need to worry about specifying a few configurations, maybe from a UI hitting several buttons, and that's it. They don't need to worry about the underlying infrastructure details, or even be familiar with Kubernetes. It also provides these flexibilities to support more frameworks in the future. It doesn't need to be Spark, it can be anything behind. And more lessons from this batch processing gateway that we can learn from. So we talk about this multi-tenancy challenge. Let's say users from different teams, right, they want to submit their Spark applications into a single cluster, or maybe multiple ones. How can we help them there? From this graph, we can see in the middle, the batch processing gateway tries to define a list of Spark logical clusters. And each logical cluster can support multiple queues. Very interestingly, right, this queue one can map to both the Kubernetes namespace one and two in the backend on the Spark physical cluster. Let's say one of the users submit a request asking for queue one. Then both C01 and C02 have queue one included in their mapping. And how can we decide? So we can have this routing algorithm to consider the weight of each cluster and our current status, right? Plus some randomness, anything if we want to increase the load of certain cluster or namespace, we can have all the flexibility there. And then we will help the users to launch their applications in the corresponding backend namespace or cluster. And they don't need to worry about it, right? They only need to know, okay, I want to submit this application to a queue one. Besides the multi-tenancy, we also know users deeply care about the observability of their application, right? They want to troubleshoot, they want to debug. So the batch processing gateway service can have this capability to support a log mover to move the log of a Kubernetes pod to persistent storage. Why do we want to move it? It's possible the pod can be gone, right? After the application is finished or completed. And sometimes we really want to recycle these pods, right? Because it will create a lot of pressure if there are so many terminated pods and the Kubernetes API server have to monitor it for the health check, right? So coming back here, when the user is hitting the log rest endpoint, then it can go to two routes. First, it will directly check from the executor pod if it's still available. If their log is still there, then it's great, right? We'll stream from there. If not, then we'll use a large log indexer to search for this executor pod or driver pod log and load from a remote storage. So there are a lot of functionalities that can be supported from this kind of gateway service. Okay, so with all these lessons learned, let's take a look at our Flink control plane. We want to support real-time data processing here, right? It's a busy diagram, but let's decipher it, right? In the middle, we have these control plane instances that can be deployed in multiple accounts or even cloud providers with its own persistent storage. And on the right-hand side, we have these Flink physical clusters running the queue, right? Similar to the batch processing gateway. And if a user issued a request to our control plane, it will first hit a proxy sidecar to get its identity authenticated, right? After that, if they have the identity, it will go to the gateway service to get further authorized. For example, whether this request indeed has the permission to use a queue to submit an application. So there can be a lot of distinctive features while managing a real-time application platform. For example, we need to consider how to help the applications to upgrade suspended triggering safe points, right? These are different from the batch processing gateway. But still from a high level, we share this similar architecture and it has this great benefit to reduce our maintenance overhead of multiple platforms and also unify the real-time and batch data processing experience in the long-term. Let's picture several scenarios on how we can leverage the Flink control plane to help our users. The first one is self-service user onboarding. So imagine that we have so many users and everyone wants to have a queue and increase their restores. We don't have that many admins to help them, right? Then what we can do. From the client side, they can issue a request, maybe just from a UI, letting us know if they need a queue where they want to increase the resource. The request will go to the Flink control plane. The control plane can automatically approve it or in some maybe rare cases, we'll ask the admin to do some manual approval. Maybe the resource is too much. And then the control plane can automatically update cluster resource back. For the provision, it will go to the Kubernetes cluster, update the clusters back. It's also possible that we need to create a new cluster and we can also do it automatically. And it's important that after all this provision, the control plane will trigger the smoke test to validate that this queue is successfully created and the smoke test could be running there. And then just let the users know, either your queue is provisioned successfully or not. Similarly for admins, right, if we want to help them to automatically migrate clusters, what we can do here. Admins can submit requests to the control plane. I want to migrate all the workloads from cluster A to B, maybe due to some maintenance or cluster upgrade. It's also possible that the cluster is failing for some reason and we want to notify the admin to keep them in the loop. Then the Flink control plane is able to trigger save points for all Flink applications on this cluster A. Suspend them before moving them, right? And then on the other cluster that we want to migrate to, the control plane can help to automatically redeploy all these applications. Using the previously triggered save points and our persisted spec, maybe in a remote storage or RDBMS. And then the control plane, again, is responsible to verify the Flink application health. Before letting the admins know, okay, the migration is successful or not. So there can be so many features and functionalities are able to support from this Flink control plane. Now let's move one step further into our story of automation and cost efficiency. Flink application auto scaling is introduced in this Flip 271, also with major contribution from our group at Apple. The main problem we want to solve is actually one and how much to scale up or down. So we know when there's too much resource, right, it can be costly for users over time and too few can cost the job to be unstable. But it's hard to scale a Flink application in place because the application needs to take a save point, right? And then to stop and resume from the previously taken state with a new configuration. Then there will be this associated cost, especially when there's data backlog, the data is waiting, queued there to be processed. So in this case, it requires us when making every scaling decision to consider if the backlog can be fully absorbed per user's configuration or before the data retention time comes. And how does Flink application auto scaling work? Can we simply change, let's say, the job parallelism of the entire application? The answer is no. Why? Because there can be workload distribution very differently across all the Flink job operators. And our proposed solution is to change the parallelism of a job vertex. What is a job vertex? It's a group of chained operators that can be executed together without data shuffling. Then our goal becomes to find the minimum parallelism of each job vertex that can still ensure there is no back pressure of the Flink pipeline. So here in this diagram, we can see, previously, everything is running normally. So the data processing rate matches the input record rate. And then suddenly, our log ingestion rate increases. We can see in the middle, the map filter operators, they cannot catch up. They are almost 100% busy. And then we mentioned about the back pressure propagating. So they're propagating the back pressure to their upstream sources. And these sources tries to slow down, reduce their throughput to make sure the application can still run. And then auto scaling can come into the rescue to calculate the scaling factors. And the key here is that we want to make sure we predict the effect of the changing parallelism on the downstream job vertices. And then we get these scaling factors, 5x, 10x, or 4x. As we mentioned before, our goal is to achieve back pressure free. It's okay if the operators are still busy, but there's going to be no back pressure because the incoming record rates can now match the true data processing rate. Okay, so where do we implement this Flink autoscaler? The Kubernetes operator can be a natural place. Why? Because it has access to all the deployment metrics. The way we make the predictions heavily relies on the metrics. We cannot use magic to predict. Also, the Flink Kubernetes operator is highly available and able to reconfigure the deployment for its rescaling. And the autoscaler itself also needs to emit some metrics to review, to generate insights on why some scaling decisions are made. Besides autoscaling a Flink application, we also need to be resource aware of the cluster in general. So both cluster autoscaler and carpenter can automatically scale Kubernetes cluster resource. And for us, we choose carpenter and we move from cluster autoscaler to carpenter because it can provide us the flexibility to use a wide range of instance types available. So we are no longer bounded to just one instance type for a queue. We can specify five or six of them. It also has less limitations of the orchestration node group. Now it can directly communicate to a Kubernetes node so that interval to retry can be greatly reduced to milliseconds instead of minutes now. This is extremely handy for our users to have the resource they need in the cloud environment. In order to help carpenter to get our cost bringing down, right, then it need to recycle the nodes. And to recycle the node, there's a prerequisite. There's no running pod on that node that we don't want to get rid of. Especially challenging for streaming application because if we move those running pods around, it can cause potential disruptions to a streaming pipeline. Then what can we do here? We want to solve the problem from the source. So when we first allocate those pods, those containers, we want to put them into those nodes that are already pretty busy. Instead of just finding a random node, maybe there are two pods running there. We want to find a node that maybe there are hundreds or thousands of pods already running there. So this is the bin packing. We try to add to a customized scheduler alongside the Kubernetes scheduler. And to move the pods on the same node and have more empty nodes for the carpenter to recycle and reduce our cost. So we are almost running out of time and we want to do a quick wrap up. So for our self-service platform, right, we put everything into this big diagram that we mentioned so far. So we have this user interface for the experience and also the admin experience. And they will interact with the flint control plane, maybe through multiple client toolings. And on this right-hand side, we have all these giant boxes with the goodies like the customer scheduler, the Kubernetes operator, with all the scaling implemented on it and the flint applications, et cetera. And we are happy to open for questions. Do you mean cluster upgrade? Yeah. So that's a great question. So we mentioned before, for example, we have this automatic cluster migration. So we don't want to interrupt the flint applications. Basically, we want to still keeping them healthy and redeploy them into another cluster. And after that, we'll do the upgrade on the original cluster. So basically, they still need to stop with the save point. But then with minimal time, we'll have the control plane to automatically do the redeployment. If you are worried about a massive, for example, disruption to a lot of applications, you can do it batch in batch. Or even one application at the beginning and then increase it by some index numbers. Yeah. That's a good question. So we don't use service mesh. So for that one, it's only a simple authentication. So we use like a gateway service embedded there and to cause some identity checker. Yeah. That's definitely a great question. So at this moment, we are running GPU machines as well for both our streaming applications and also the offline applications. And they also fit into this multi-tenancy story that we just shared. So ideally, you can have a different mix of resources. It can be a dedicated GPU queue, if you want. It can also mix with CPU resource. And it's actually more ideal because sometimes certain data workloads, they are more suitable for CPU processing. And for some, maybe for ML workloads, you can leverage the GPU queue. Yeah. Great question. So as mentioned, so Coventry and Cluster Autoscaler, they, well, first both have open source versions. So we are just free to use them. And especially asking, since we're already leveraging the hardware resource from the cloud providers, why don't we just use maybe the AWS Kinesis? The managed service. For in our case, because we're moving to a very large scale. And a lot of users have these customized needs. It can be hard to handle by like the managed service. So we want to have the kind of the control to help our users to build some customized solutions. And for some of them, they want the service, for example, the entire service to deploy to their own AWS account. And for some, they want it to be shared. And also the cost can be a concern as well. If you have so many workloads, then the way that we calculate the cost, et cetera, for managed service, it can be very expensive. Okay. So for a flint cluster, it doesn't span to multiple Kubernetes cluster. It will just sit into one cluster. And with one cluster goes down, right? So we mentioned there can be a persistent layer for our control plan to manage. So basically we want to save all the back of the flint deployments. And also their save point information it can be saved into, for example, S3, et cetera. So it's okay if the cluster goes down. We have a way to bring up all the flint deployments back as soon as possible or migrate them to another cluster. Okay. So we are running out of time. Yeah. Thank you.