 Hello, everyone. My name is Alina, and I work for Apple. I joined the company as a software engineer about a year ago to work on Kubernetes and cloud-native projects. Since then, so many people from the open source community have asked me what attracted me the most to my new job. And the answer is easy. It is the scale of the cloud infrastructure and the variety of applications running on it. Before, I mainly focused on Kubernetes clusters provisioning. I thought that looking at things from the application and user and God scale would help to complete the circle. So here I am. When you join a new team or a company, getting to know its history helps to understand current decisions and trade-offs. Mesus has been used at Apple for over five years with an internally built container orchestrator called Jarvis. It had a good reputation as a scheduling platform, and it scaled well to support multiple frameworks for running a variety of services like Byte, Stateful, and Stateless. But the container orchestrator soon became a bottleneck for skills and responsibilities delegation, so we started considering other choices. In addition to Mesus, Apple used some other solutions for compute management. Along with the transition, we wanted to consolidate them and provide the unified platform for better capacity utilization. We found that Kubernetes was an obvious winner as an orchestrator. With Kubernetes being more generic and fluggable, it meant we could better divide work amount teams. The decisions on what provider to choose for core components like storage, runtime, and network were no longer life or death decisions. We knew that it was just a plug-in choice, and the choice could always be reevaluated without refactoring the entire system. Another big point of attraction for developers was an ability to extend Kubernetes APIs and core functionality by leveraging CRDs and custom controllers. The Kubernetes community was a huge asset to the platform. Its transparency and power was so obvious and provided a great level of comfort to developers. The goal was to meet the demand of diverse user base. We had teams who were already successfully running Kubernetes clusters in their organization, and we had people who had never worked with Kubernetes before but trusted the technology and wanted to adopt it. When the change of this magnitude happens, it is important to be honest about the trade-offs that need to be made for successful migration. It is a known fact that Kubernetes learning curve is steep, not only for end users, but for platform developers, too. All the platform features that were built in the past had to be reevaluated. Is the feature's current design aligned with the cloud-native best practices? Is this feature even relevant to the new platform? Asking and answering these questions can be uncomfortable for people. And software is built by people. We had to accept the fact that there would be a need for a change to the current processes, and people would have to adapt. We realized the impact of the decision. We were ready to invest in training resources and provide full support for developers, anything to make the adoption successful. At that point, Kubernetes became our platform choice for compute management at Apple. Our target goal is to make the majority of Apple workloads run on Kubernetes. We want to maintain end user focus while building the platform. That required us to understand and categorize our platform users to establish common patterns for hardware and software infrastructure planning. One category of users are Java, Python, Go developers who want to test and run their applications in the containers. Another category are applications SREs, whose focus is on building advanced deployment workflows for their teams. Another example is hardware Q&A engineers who are not afraid to try new things and explore creative solutions using cloud native tools, anything to make it easier for hardware teams to validate devices. There are machine learning and botch workloads engineers who utilize the platform daily, running thousands of bot jobs that use both CPU and GPUs for training machine learning models. Finance is also a platform user, as they're a big part of placement decisions and capacity planning. All of these users want to adopt cloud native tools for better debugging, logging, monitoring, and tracing for their apps. Our responsibility as platform developers is to provide a scalable orchestration layer with secure resource isolation and reliable scheduling. Building high-level obstructions and CLI tools is also necessary for the platform's success and adoption. So we have this big infrastructure pool of compute and storage resources managed by Kubernetes. Providing it to users can be done in two ways, cluster as a service and namespace as a service. Namespace as a service is our primary model today. SRE manages multi-tenant Kubernetes clusters and user operates in one or multiple namespaces. This way, users do not have an overhead of infrastructure management and can focus on application deployment and development. However, we realize that in the world where cluster provisioning has been commodified, some teams may still want cluster as a service. With the goal of centralized capacity control, we had to obstruct resource management to its own system. We wanted to make it as generic as possible to streamline the solution for the hybrid cloud in the future. The system would do a dynamic balancing between clusters, priority, usage, and limits to find the best placement for the user application. Resource isolation would be handled by assigning priority and quota to each workload. However, we still had to address privacy and security concerns of running applications in multi-tenant clusters. It had to be done both on configuration and runtime levels. Even with the namespace as a service constraints, we wanted users to have a freedom to choose cloud-native tools to run under namespaces. Some of these tools require CRDs, a cluster-level resource. We had to think of a proper CRD management flow to make that deployment possible. Solving these technical challenges would not have been possible without the wide array of cloud-native tools. Those we use on the platform level and those we recommend to application developers. This list is not complete and keeps on growing as we constantly evaluate new projects and find more areas for development and integration. With all the careful planning, only running things at scale could reveal certain issues. We have chosen open-source Kubernetes Perftest cluster loader tool to challenge our clusters. Developers can always write their own framework. But we found cluster loader being quite powerful. We wanted to engage more with the community and invest time in fixing and expanding the existing tool that could be used by others. Another area of immediate focus was scheduling. We run mixed types of workloads on Kubernetes, including batch workloads, and want to maintain a reasonable resource utilization. Participating in six scheduling and contributing back is very important to us as we want to stay as close to the upstream as possible. For some people in our team, it was their very first experience contributing to upstream Kubernetes, going through the process of joining SIG for discussions, filing first Kubernetes issues, and recognizing the various ways to help the community. It was all new and exciting. I would like to share some of our current priorities. In a multi-tenant environment, many customers' workloads run on the same machine. To meet security requirements and provide isolation between mixed level of trust, the common pattern is to leverage defense in depth. We are seeking to enable this level of isolation with micro-VMs. With that, multiple untrust in tenants can run on the same nodes and share the same control plane while continuing to provide two layers of isolation, virtual machine and container. Cluster as a service comes with the cost of increased security management, resource utilization inefficiency, and operational burden in the companies of Apple scale. As a result, we started exploring alternative models like virtual clusters. With virtual cluster, user self-services virtualize Kubernetes control and data plane on top of shared Kubernetes infrastructure. It eliminates the majority of security and operational concerns. With the deployment footprint growth and distribution cross data centers, comes the demand for multi-cluster workload management. We are researching on API proxy solution to offer improved self-healing, utilization efficiency, high level obstructions, and developer workflow experiences for teams that manage workloads across multiple clusters. We want to reduce the platform complexity for developers and operators. At the same time, we want to maintain high security standards and low resource utilization. So we are investing in function as a service solution leveraging server-side web assembly to solve this. With the constant additions of the new features and user workloads, observability becomes one of the top priorities. We want to keep the platform management scene as the platform growth. End user productivity and happiness is our end goal. We are always looking into ways to improve it with compelling CI CD and developer tooling. Adapting new technology on such a big scale often comes hand-in-hand with organizational changes. You need to change everything and everyone from managers and finance to tech teams and yourself. It is a long road with a lot of growing pains. Our security team grew and took on extra responsibilities. Ensuring security and multi-attent cluster is not a trivial task. We had to have a clear separation of concern between application security and Kubernetes platform security to maintain a shared responsibility model. One of our goals was to rethink capacity planning. Having a dedicated resource management team working close with the finance department was necessary. Some new teams were formed, some grew in size and responsibilities, and some developed a new set of priorities. SRE team used to be more operational and systems focused. Now it has an emphasis on engineering and a much stronger voice in the platform architectural design. Having a solid and stable platform made it possible to maintain focus on the customer's needs and innovate faster. Innovation is impossible without collaboration. Forming a cross-team technical review board for cloud-native-based projects wasn't valuable to us. It is a safe place to get period feedback from people having different backgrounds. Some have more experience with developing features for Kubernetes and some have a deep expertise in a particular technical domain like security. Getting their combined input is extremely useful during the design stage. With a shared platform, sharing information becomes a necessity. We are lucky to have a dedicated team managing inner-source and open-source initiatives here at Apple. The team focuses on internal trainings on cloud-native and Kubernetes, advises people who want to start making their first open-source contributions and supports those who have been contributing to open-source for a while. The team helps to connect people and grow both internal and external collaborations. The ongoing transformation leaves a very little time for reflection, but the reflection is always necessary. Migration is not something everyone wants to do, but those who invest in the modernizing their application and the platform benefit in the future by adopting more cloud-native technologies on top of the foundation. Migration at scale has taught us to embrace the community-best practices and shared learnings. We wanted to give back and have realized that engagement comes in many forms, writing code, serving on TOC board for CNCF, helping Kubernetes release team, improving documentation, filing issues, and many more. Innovation is driven by sharing and collaboration. Sharing with the broader community in the open-source, sharing with another team in your company through the inner-source, helps to enable reuse of existing solutions and minimize redundancy. Even small change can begin to make larger cultural change a reality. We have realized that in the end, the migration change is not just a technical stack, but the people involved in it. Please reach out to us if you have any questions. Thank you.