 Thank you everyone for coming. We're here today to discuss strategies that CFCR uses to help you deploy production ready Kubernetes clusters and We will try to highlight and share with you learnings that you can have from our experience But first let's introduce ourselves I'm banked. I'm a software engineer at the cloud ops team in pivotal at Dublin and My team is responsible for maintaining production environmental like the pivotal tracker and other kinds of deployments including CFCR I'm Lorena. I also work in the Dublin office at pivotal and I've been on the CFCR team for more than one year So we're gonna start by trying to pinpoint what's important when you want to set up a Kubernetes cluster that's ready for production, especially describing the challenges you can face and What benefits you can gain from taking care of these aspects? We're then gonna describe how Kubernetes poses some obstacle to these efforts and how CFCR uses strategies to give us smoother developer and operator experiences and finally how CFCR Works with Kubernetes upgrades. Let's get started Cool So when you're planning to provision a production environment, you should be able to focus on all the elements that help you run your workloads as smoothly as possible coping with real-world traffic traffic and Being able to protect your user data So we came up with this for main areas of concern that you should focus on when planning a production environment They are reliability security up-to-dateness and performance So let's scratch the surface of which one So when we talk about reliability We picture operators that don't want or don't need to be actively running the environment themselves all the time and at the same time The development team or whoever has access to deploying into production should be able to do so and In the time that makes sense for the business and not on some kind of deployment window or any kind of policy that it's not part of running the business That will give you a requirement that Environment that that requires minimal intervention to be kept running When we talk about security Um Yeah, when we talk about security we're talking about things such as being up-to-date with CVE fixes and backpatches We talk about having control over who accesses the environment and minimizing damage when there's a breach Which means that the tools that the environment also has all already has Make your focus on the workload security because you already control who accesses your VMs and your containers Thank you. As you can probably tell there's overlap between Security and up-to-dateness, but the latter also involves having access to new tools and features that you can leverage to improve your workload And your environment as soon as possible And it also means that you don't end up using outdated and unsupported versions of software So this means that if you have control over your upgrade process You have born upgrades where you don't have to work to worry about things such as workload downtime or What to do when the upgrade fails? Finally for performance we mean that your resources are easily optimizable and that your environment can adapt to the amount of traffic it gets which gives you a Build so it means that you have ability to scale up and down vertically and horizontally so you can again adapt to your needs So at this point if you tried running or if you run Kubernetes You know that deploying can be the smoothest part because there are many tools such as installers that help you during that phase They do operations instead are significantly more complex and we're mostly talking about the production concerns that we are gonna focus on This year one of the special interest groups in the Kubernetes community conducted a survey and they showed that 18 percent of the users was using Unsupported versions of Kubernetes. They were at least three manner versions behind and this is not surprising Because when you want to upgrade you need a plan to make sure that all the parts that make up Kubernetes behave as they should during and after the process and possibly without disrupting workload and API uptime This is not trivial and requires a very good knowledge of the Kubernetes internals Yeah, and another crucial aspect is what platform you're planning to run your clusters on So if you choose let's for instance GKE you have most operations Automated But you might not be willing to be locked into a vendor or you might already have a contract with some other called provider Or you might have a hardware On-prem hardware where you want to run your clusters on The other thing to take into account is the security model that you are planning to use Kubernetes has its own security Recommendations that are important to keep your clusters safe So let's take a look at so this is barely scratching the surface on how complex Kubernetes is let's take a look at how CFCR helps you achieve all those goals Yes, CFCR or Cloud Foundry container on time previously known as Kubo tries to answer these questions By the way, Kubo is also the name of our mascot CFCR is an open source Bosch release for Kubernetes Which tries to take advantage of both the flexibility of Kubernetes and the experience and the opinions that the Bosch community Built with time. It's available on GCP AWS Open stock be sphere and soon in Azure and it's currently using production by three customers And we're gonna describe how it helps us with production concerns and Kubernetes complexities Especially focusing on what's provided by default and the key takeaways that you can learn from it Cool So what you get by default when you use CFCR to deploy at Kubernetes cluster You'll get by default three master nodes The master knows that them knows that contains all the processes that make up the Kubernetes control plane You get a co-located at CD process Since you have three masters you have a cluster an at CD cluster with three Members the at CD is the distributed database that Kubernetes uses to maintain the cluster state You get by default three worker nodes What the worker knows that the ones that contains the processes that Manages the containers that run your workloads So you get three worker nodes all all of these as spread across three different availability zones You see why it's this important and Soon Let's start with reliability now. We're gonna describe how CFCR takes her production concerns For production readiness you want a stable product We at Pivotal use test-driven development and CFCR is fully tested to make sure that every change we introduce doesn't break the existing setup We have unit tests integration tests and turbulence tests which introduce failure scenarios to verify what happens in those disaster cases What you want is make sure that you cover all the code you add for example If you have custom scripts for your upgrades and test all the switches and knobs here kates configuration plus We use the official repository and packages We don't have a fork of Kubernetes So we are vanilla and we run conformance tests which are parts of a certification program in the case community to make sure that Our users code will run as expected based on the common kates functionality So you either want to run these tests against your environment or use a conformant installer So continuing reliability AJ or high availability is really important if you want to have a reliable environment What CFCR does to help you on that is? Providing you three master nodes spread across different availability zones. So even if one of them goes down You still have a working cluster It's important to notice that you get the SCD located at the master so you have The at CD spread across the availability zones to it's important that the at CD at CD uses a specific Algorithm to maintain its consistency and it needs at least three nodes so you get that by the phone CFCR and last least but not last but not least You get three worker nodes spread across the availability zones to so you avoid workload downtime Yeah So apart from setting up HA components We take advantage of the auto healing capabilities that Bosch offers for VMs and monitor processes So these two aspects help reduce maintenance overhead for operators and increase Sorry and relieve pressure in case of disasters for example infrastructure disasters Finally we use BBR Bosch back and burn store for it never gets it backing up and restoring our at CD data And we use the at CD CLI for managing snapshots You want to make sure you have a strategy for backing up and restoring both for when for example you have infrastructure Disasters and you want to use backups or if you're running an upgrade and you want to roll back in case of issues Cool going on to security As I said before Kubernetes has its own recommendations on security One of them is that all communication between the processes that make up the cluster should be protected over TLS so You get that by default on if you use CFCR The FCD cluster all the nodes you need to communicate with each other and all the communications done over TLS and All the processes that make up the Kubernetes cluster on the master and the worker most of them need to talk to the API server on the worker on the master nodes and The API server needs to talk to the cubelet, which is the process on the worker nodes that manages the containers and Talk to the at CD node to maintain the cluster state all of this is done over TLS The dashboard is also protected so you don't run into the problem that Tesla had That they had clusters with unprotected dashboards and they had Hackers that were you're using their resources to do crypto mining All these certificates are auto-generated and securely stored using credit hub more security Kubernetes also recommends that you use role-based access control or RBAC So what CFCR does it it binds permissions for secure for specific users and Service accounts so that the cluster admin as control complete control over the cluster while the Kubernetes processes have only the necessary permissions they need to run and Finally we use Bosch themselves to be always up to date when it comes to for example Patches in the operating system and it's really important that you test your Kubernetes configuration against your operating system. You're migrating to Moving on to up to date. Yes Should be running yes our pipeline tests upgrades between the latest released CFCR version and The latest changes in our repo so we catch disruptive changes and we have smooth migration so we can make guaranteed migrations between consecutive CFCR versions our upgrade tests focus on Minimal workload and MPI downtime we have a 99% threshold And this is especially important when we bump kates because we want to catch breaking changes in the latest versions So it's really important that when you are planning to upgrade you check release notes So that things such as deprecations and changes in the default values don't come as surprises when you upgrade Cool, let's jump into performance So as we said before it's important to be able to scale up and down both vertically and horizontally Using Bosch that's easy to do And you are just a Bosch deploy command away from that So this this is a screenshot of a scale up YAML And you could just easily As easily modify the original manifest that used to deploy the cluster to so As I said, this is all a Bosch deploy away from you and in this case we are changing The number of VMs both on the masters and on the workers to five from the original three Which is the horizontal scaling and we are changing the VM type to have more memory, which is a kind of vertical scaling Something else that we expose is a feature from Kubernetes called horizontal pod autoscaler You can set the threshold on CPU usage and other Custom metrics to be able to automatically scale up your pods your pod replicas When the threshold is met so as for performance the really key aspect is that You are able to scale up and down because you have a repeatable and reproducible Deploying process and we get this from Bosch, but you want to find a strategy to have the same kind of reproducibility So let's talk about upgrading the cluster What does the a good upgrade process look like you should always upgrade a healthy cluster So you should check the cluster health first Do a backup? And then upgrade the at CD nodes the master nodes and the worker nodes Then after the upgrade you should check the cluster health again to make sure that the upgrade was actually successful So how the CFCR does the cluster upgrade So we start with the master nodes So the first thing that happens is that the at CD instance will leave the cluster We do that so the at CD cluster is aware that that instance is not part of the cluster anymore So you can maintain the cluster Consistency so now Bosch is Can safely upgrade the Processes that run on the master node so we'll shut down the processes upgrade them and restart them The same process rule Of course and the and then the at CD rejoins the cluster. So now you have the three nodes in the cluster and The same process will go on on all the other master nodes So now it's time for the worker nodes, but they're slightly different Because they're running pods So pods are the minimal deployable unit that you can use to deploy something in a Kubernetes cluster so You need a different strategy. We use a process called drain To safely upgrade a worker node So the first thing that happens is that the worker the worker node that's been upgraded It's made not schedule. So no a new workload will be deployable and this worker node then we is will stop each Each pod or workload is running on the worker and they will be rescheduled by the scheduler Which is a process that run on the master node they will be rescheduled and then Bosch can safely start upgrading the worker node So it will stop all the processes replace them with a new version and Start them again The same thing. Oh, yeah And the node will be made schedule again So the same thing will go on on all the other worker nodes Made unscheduled upgrade and then schedule again and then you have a upgraded cluster Let's look at the differences between the previous cluster and the updated cluster Basically, they are the same. They look the same they have a different version of Kubernetes potentially running and One thing that you might That you want to be aware of is that the workloads are scheduled differently than the original Previous version. So you have to keep in mind that to avoid Workload downtime. You should have at least two replicas of each pod that you're running on the workers So this is just the The default way that CFCR deals with upgrades Maybe you want a different upgrade strategy, especially for for scheduling because for example, you might want to have instead of just three Three nodes, you might want to add a node so that you don't have just the node that's been upgrading without Without pods or maybe you want to deploy something on top of Kubernetes that deals with Scheduling in a better way instead of leaving the third node empty at the end of the upgrade So this is almost the end of the talk I'm just giving you a quick recap. So Sorry, we said that For reliability Reliability gives us an environment that needs minimal intervention for because to be kept running So this is achieved by having a fully tested product product Possibly conformant by having HA components possibly auto healing components and having a backup restored strategy As for security, we can focus on workload security because we already have the Kubernetes recommendation baked in our environment. So we have communication over TLS Our back and we have access to some cell and operating system patches Having control over our upgrade process means that we have boring upgrades and we do this by testing our upgrades before going to production and making sure we know what's new in the new case version and finally We have a performant environment Environment with which is able to scale because we have a repeatable deployment process and we use tools such as the horizontal portal to scalar So these are just some references Our e-pods are slack in our buck look feel free to reach out and thanks for your attention Thank you