 Hello everyone, welcome to KubeCon and Cloud NativeCon North America 2023. Thank you for spending time and attending this talk. This session is pre-recorded and I welcome you very shortly. And I am honored to present to you a journey of building our Kubernetes platform, successes, failures, and valuable lessons left. But maybe first I can tell a bit about myself. Who am I? I'm Marianne Falakoli. I am a cloud engineer at Relics Solutions. Relics is a supply chain and retail planning platform that helps retailers unify their planning from demand and merchandise to supply chain and operations. Also in my free time, I sometimes write some technical articles on my medium account, so feel free to find it there or on my LinkedIn. And I'd be more than happy to get to know you and get connected. So today's agenda are the following items. I'll first give you a timeline history about our project. Then we will take a look at some technical details and project structure. After that, we can take a look at some statistics of usage of platform within Relics. Then I'll explain about advantages that we think our platform has brought within the company. And after that, we can dive into the lessons that we learned during journey of building this platform. And at the end, I'll shortly mention some of our future plans. So first is projects and timeline history. Let's start from 2019 when the project was born. Project was initially created for a specific internal development team that was willing to migrate their workload to Kubernetes at the time. As for cloud providers, Azure was chosen already by management and due to other reasons, which was not related to Kubernetes service. So it was an obvious choice for the team at the time to go with Azure Kubernetes service. So if you can self-answer higher to create the project and project was born at the end of October 2019. 2020, the project was changed to be a unified Kubernetes platform within the company. In January, I joined the company initially as a site-reliable engineer in one of the development teams within Relics. In March, there were already five different teams that started using the platform and my team was one of those. In April, there was one internal employee who joined the team to work alongside the consultants on the project. And in October, a massive merge request related to the migration of service principles to Azure managed identities was merged and unfortunately it became a major incident and a reverse of the change was applied within two days. Also, the second internal employee was recruited and joined the team and one of the development teams and decided to leave the project as the result of the major incident. Let's get a pause and take a look at failure scenarios in the major incident, what was not working. Well, the first item was that rights were missing. The developer had the required permissions while they were testing, but users didn't have the required access permissions and rights to delete the service principles when they were rolling in the change. Second point was about service manager. The service manager wasn't working properly alongside Log Analytics because Log Analytics was creating its own managed identity and service manager couldn't handle multiple managed identities at the time. Then the third problem was about an Ingress NGINX. Ingress was not having time out because it wasn't able to get an IP address from the internal network due to insufficient service principle rights. And the last problem was about authentication after a clean install of the migration change, authentication was not working properly except when an admin login was used. So our view of 2020 is that it was decided to change the scope of the project from a team-based infra-platform to a company-wide Kubernetes platform. There were so many repeated configuration quotes among different environments of different teams that it was decided that let's use Tricorat, which is a Tricorat wrapper, and it helps for keeping your quote dry when working with multiple short form modules. 2021, the team grows. In February, there were five different teams enforcing different environments that were using the platform. In March, I was also joined to the team that was working on the platform and tests were developed further and robot framework tests were added to the project. In April, semantic versioning was in place. Before that, there was no versioning and teams were required to deploy their changes based on the master branch and imagine that they needed to do that, all of them approximately at the same time. So the master branch of that is released and development team of platform can continue their development. Also, resource tags became mandatory at that time and a start-to-stop automation was created for our development clusters because there was no reason having those tests and development clusters running 24 hours. And it helped us to decrease our costs. In October, second try to manage the changes for managed identity happened and it was successful. Fortunately, this time, in November, security patching is standardized. It included hand-chart upgrades, provider upgrades and short-run upgrades and also full access privileges for platform developers were dropped and instead we used access packages and rights were segregated to those access packages and users could activate the packages and have the needed rights for a specific time. So 2021 overview is that project grew and more teams were willing to use the platform. So a special team started forming around the platform project and at the end of 2021, we had a team of three internal members that were working on this project. 2022, the project gets more mature. In January, we started having release blog posts mentioning the features, bugs and the instructions that were needed for our users. In February, we started migrating environment configurations to a separate health repository because until then the code of the platform and environments codes were all at the same repo, but it was a hassle. There was problem with access rights among us and the users, so we decided to make it a separate repository. In March, the employer service principle, the service principles that we were using in our deployment pipelines for deployment of changes of the platform to users environments, those service principles had two high rights and overall environments, subscriptions and resources, so we thought it's not quite secure this way and we did some segregation on service principles and their rights. In July, three new permanent team members joined the team, which was a nice thing. In August, we had a change of our release strategy. Until then, we had two weeks release cycle, but it wasn't enough time for our users to keep up with the change. So based on the feedback that we got from our users, we changed it to be a one-month release cycle, two weeks for them to deploy the changes of platform under their own test environments and do their acceptance testings, and then they had two more weeks to deploy the changes on their productions. In September, we had sort of a major incident, but only for one test environment in one of the team's environments and it was because one of Azure Created Resource Groups was built manually in Azure Portal by a user. So in response to that, we figured that we did have to add resource logs to all the resources that are created by platform. In November, a start-to-stop module was developed to decrease cost and our users could use of it to turn off their development clusters during non-working hours. So 2022 of our view is that project became more mature and more people joined the team and it became a team of six permanent internal members by the end of that year. 2023, time for supporting features. In April, a feature testing process was changed until that moment we had only continuous deployment testing based on each merge to master branch. But we figured that we also have to test changes that are coming in latest version compared to the previous version. So that was also added to our automated tests. In May, we had some future requests from user about service mesh. So we had our investigation and we came up with link early at the solution. So that's a feature that started and is under development. Also fully automated deployments were added, meaning that until then some manual operations were needed to be run by our users before each actual platform deployments. For example, a Helm uninstall command might be needed or track run import or a DLS resource. And all of these moved to be a automated script that can be run before the actual deployment pipeline. Then in June, we had another feature request by our users to have a shared cluster like a shared cluster that can be used by different development teams, but it would be maintained by us as the platform team and development teams can deploy their applications and the different workloads can be isolated based on the name space. So this was also a feature that was added in June and it's still under progress. And in September, we migrated from AKB to Kubernetes, the project that can be used for syncing secrets between Azure Key Vault to Kubernetes resources, like Kubernetes secrets. And that upstream project was poorly maintained and there weren't many contributors recently didn't have much releases. So we decided to change the project to external secret operator. So the review of 2023 is that there was enough time for us to add supporting features based on teams requirements. Now we can take a look at the project structure and more technical details. Here we have a high level architecture of the platform, you see AKS in the middle of an Azure subscription. Then we have other Azure resources such as container registry, Key Vault, Postgres databases. And then for deployment of different components within each cluster, we use a home charts. We, for example, have engineering ingress, which is responsible for exposing services from Kubernetes clusters to the public or private internet. And then the ingress controller is linked to an Azure load balancer, which either gets a public IP address or a private one. Then we have cert manager, which is certificate management controller for Kubernetes, it automatically carries and manages TLS certificates and TLS certificates for the ingress objects. And it uses let's encrypt for issuing certificates. In the platform set up, each cluster gets its own DNS zone. So external DNS monitors the creation of new ingress objects and it automatically creates DNS reports for new ingress ingresses within the cluster of zone. Also, we have other components such as data talk for collecting logs of the applications and primitives for collecting metrics from clusters itself and applications workloads. And then these metrics are visualized in our centralized observe solution that we have within the company using Graphana dashboards. Then I already mentioned about AKV to Kubernetes. And finally we have cured, which is a Kubernetes name on set that performs safe automatic note report, reboots when it's required. Here we can also take a look at our shared responsibility model between our team and development teams or our users. In the very top, very down level, we have Microsoft, which provides us the infra. Our team provides infrastructure as code, productizing features, security of the platform and maintenance of it. Then you can see that we have optional modules shared among us and development teams. The reason is that in case development teams have capacity and they are willing to, they are welcome to collaborate and make some of these optional modules based on their requirements. Then the development teams are mostly responsible for their content, their application, containerizing the application and deployment and maintenance of their clusters, application security and operating application service. Now we can take a look at the statistics of usage of this platform within Relax. Well, at the moment we have 13 user teams from which 133 are non-production and 47 are productions environments or clusters. Then we have 22 mandatory Terraform service modules, which provides the main core of the platform and it's necessary for all the clusters. And then we have 42 optional service modules which provides different services for different user teams, according to their requirements. Now I can talk a bit about advantages that this platform has brought to the company, of course from our point of view. And first item is a standardization and unification of the project creates reusable components, tools and documentations that makes it quite easier for development teams to work together and follow the consistent practices. And this results in improved code quality, reduced development time and enhanced knowledge shading among teams. Then the second benefit is security and compliance. Project implements the best practices for access control, encryption and vulnerability management. Also all platform changes are tested for security before deployment and security patching and upgrades are also taken care of. Then the third benefit is about cost optimization. Platform project takes care of identifying and implementing strategies to make efficient use of resources. This could be about right sizing in for leveraging cost-effective cloud resources or automating resource provisioning and deep provisioning. And last but not the least is faster development cycles. Project can be used to create and maintain a standardized development environment so that development teams do not need to make their own info and they can start coding immediately. Now we can dive into the lessons that we learned during this process. And first I want to emphasize on the major incident that we had and mention some of the lessons that we learned during that incident. And the first one is that it was a big change. The marriage request itself included the changes for migrating from service principles to managed identity as well as some other changes for service manager and engineers ingress. So it was a really massive MR. The second point is that there weren't enough testing at that time. Unfortunately, we didn't have. We didn't have much in automated testing and also the manual testing scenarios were not extensive enough. Then the third point is that there wasn't any rollback plan. The plan was overconfident and no one thought of a failure scenario. The fourth point is that the employment procession procedure had limitations and there was no versioning in the project at that point. As I mentioned at some time later we added semantic versioning. So the deployments were happening from master branch. There was a dependency between different environments for their deployments. However, deployments to the development clusters was happening automatically while production runs very manually. And the last item is that we think it was a major change in a mature state of the project. So the project that state was not ready for having such a massive change. Now we can take a look at other lessons learned from other aspects and the first item is having an internal team. Just having consultants to build such a big project which can be unified within a company is not enough because knowledge sharing doesn't happen properly. Having an internal team is really important. Having constant members who can share the knowledge and be around at least for a while until the project gets more mature. Then the second point is about the standard standardization. Well, not all the development teams have some DevOps person or DevOps knowledge. So it is really important that we as the platform team provide standard procedures and this would help them to understand what's happening and how they can operate within their environments. Then the third option is about documentation. Again, knowledge sharing can happen really good with documentation. So we try to provide best practices, blog posts and release blog posts and all of these helped a lot for our users to understand everything better, how platform works, how they should follow the instructions to deploy the changes to their clusters and so on. Another lesson was about cost optimization. You should think of different possible approaches for this. For example, resource tags and this sort of a start-stop automation for the clusters that are not needed to be running 24 hours. Another point is feedback. Development teams are our users. So it was really important for us to hear what they think about the platform, how easy hard is it to use it. And having a product owner helped us a lot for this understanding. Because of that, we find out that our release cycles are not suitable for our users and we changed the procedure that we had for our releases. Then next point is prevent incidents. You should use any possible way to prevent incidents. For example, having resource logs when it's possible or soft to let policies and backup options for any kind of resource that is available. And last but not the least important is security. For example, the employer service principles should be segregated and not one service principle should not have access to everything, all the resources. And the second point is about access packages, having those whenever possible to segregate access for different users. And finally, our future plans for the project. I mentioned previously that we started having service mesh and a shared cluster and we just added these features. However, we are just waiting for our users to get those into use. Then we can see what's working but not and we can improve on those features, of course. Then we would like to have global load balancing for stateful applications, web application firewall and it would be nice for us to enhance our secure testing strategies as well. Yeah, so these are some of our plans for the future. And that was it. Thanks a lot for being here and joining this talk. I hope it's been helpful and informative and I'd be happy to answer any question if you have in the chat.