 Hello everybody and welcome to my talk, my session how not to start with Kubernetes. My name is Kristen Hegelman and I want to share with you within the next 30 minutes. The lessons I've learned running Kubernetes on-prem, so on your own data center. And it might be a little bit boring for you if you're already running Kubernetes because all the things I will mention are basically beginner, yeah, failures, right. So I've split this presentation in two parts operation and infrastructure as well as development and deployment stuff. But let's get started. So, back in 2018, I was working for another company, which was highly regulated, and one of our architects reached out to us to the operations team where I was a system engineer and said, hey, we want to use containers, we want to use container orchestration, and if we can help him, yeah, running a small PUC. And the PUC was Kubernetes against, not against, but Kubernetes and Docker swarm, we want to compare both of them, both solutions. And as I mentioned, we need to run on-prem, in our case it was based on SandOS on VMware and Xen. And it was Kubernetes version 1.9 we started with. And there's already the first little mistake I made or we made. So it was a small PUC with a small group of people, but when you are building a platform or starting using a platform like Kubernetes in your company, you should onboard all other departments as well. From the beginning, get involved with security, get involved with the data center guys like storage and networking guys, right? Because, as mentioned, you're building a little data center within the data center and the platform will be used by all of your developers, all of your operation guys. So do it in a little bigger scope than you expect. And yeah, that's the key takeaway here from my perspective. And also, if you have no idea or no experience running Kubernetes, then perhaps you should get external help. There are companies out there, they will take your money and say what you should do here, right? So, but let's first start with infrastructure and operation topics. So as I said, our small PUC was really just in my case, a three node Kubernetes cluster, so I provisioned three VMs. I started installing Kubernetes using QBADM manually with SH into the boxes, running all the commands, right? And it was working. It was working fine for the first couple of weeks. But then I upgraded to Kubernetes 1.10. This was also still possible. And we hadn't had any storage provision or ingress controllers on this little PUC cluster from the beginning. So it was just plain Kubernetes and some web applications deployed into it. But later on, when I wanted to upgrade to Kubernetes 1.11, yeah, my upgrade failed and the cluster was in a non-recoverable state. So, yeah, this was a little bit not the best thing, right? And perhaps I could fix it now with my knowledge what I have. But back in the days, the cluster was just that, right? So I switched to Rancher and not installing Kubernetes manually anymore. And Rancher is quite nice. It provides a nice UI. You don't have to think about authentication. You can connect it to your AD and so on. And it's quite easy to deploy clusters using Rancher. But there are other options in VMware. If you're running completely entirely on VMware, you can use VMware 10.2. You can use CubeSpray. There might be a lot of other provisionals out there which you can utilize. But as key takeaway, don't do manual installations, which leads me to my next slide, Automation Everywhere. So try to automate from the beginning with your first cluster, right? And I haven't small PUC in place yet. It might be better just to install it manually or using the UI of the different tools. But if you need to deploy multiple clusters or you have to provision clusters automatically to spin up a cluster for development group, which will then be destroyed after, then everything should be automated from the beginning, right? So take or invest the time in automation. Have an CISD process in place to automatically provision clusters and document as well from the beginning. So don't skip this step. Document everywhere so people are able to deploy clusters even without your help, right? So next, a little bit more into details with networking. So network scene eye and configuration. There are plenty of network scene eyes available. These are only listed a few of them like Kalaiko, Flannel, Weave, Canals, Silium. And so the network scene eye is a really crucial thing you need to think about before you're starting it. You're starting using Kubernetes, which network scene eye should you use and as well which side arrange for instance for your port and service network you want to use. So in my case, I had here a little bit of an issue. So when provisioning clusters with Rancher, Rancher will have the default subnet range for port and service network 1043 and 1042. Yeah, but this was colliding with one of our offices. And yeah, the cluster was running fine. And then when people from the office actually want to access services on the cluster, they couldn't because as I said, we had here a little bit of an overlapping. I was able to mitigate this problem by adding the five full reverse proxy in front of the workload, but still think about stuff like that when you're installing your network scene eye, right? Another thing I stumbled across was network policies. So in my first cluster was provisioning was basically in the example Flannel listed as a scene eye provider, right? And it was working great. It was working fine. And I also use Flannel then for my first cluster provision in Rancher. But Flannel is not able to to utilize network policies on false network policies. So one day I wanted to isolate network traffic from one of my namespaces and it wasn't, yeah, I wasn't able to do this. And also Rancher wasn't able to change the network scene eye after the cluster creation. You can do this with other provisioners or with a vanilla cluster, for instance, but with Rancher we were stuck with Flannel, at least on this one development cluster. Then you need to think about, do I want to encrypt data in transit and not using a service mesh, for instance. Then you can run, I think Kalaiko is able to office this and we've met as well. You know, at least the basics like Dean as how Dean as is being done in Kubernetes, right, and how to resolve service names was in Kubernetes, that you're not going. If you're, I've seen workload configured using or wanted to request. Yeah, a service next running in the same namespace but going through the entire ingress controller to reach the service instead of just calling the service name inside of the cluster. Right. But speaking of ingresses, that was the next question after my first PC cluster. So how we could reach our services inside of the cluster and yeah, I was also I had no experience right so so I was reading through presentations or examples. And I ended up with deploying a metal LB inside of our cluster and using traffic as an interest controller. I had a single traffic interest controller running. And we split our traffic for external internal using allow lists. But this is error prone. So developer could forget to add an allow list to its ingress notation and that perhaps your ingress will be exposed to the internet by accident right. So, perhaps it would be better here to have separate ingress controller for external internal workloads. And also, in our case perhaps it would be a better idea to put our existing a five in front of the, of the ingress controller instead using metal. Reusing the stuff you already own, right, or you already have in your, in your environment. And then you have to think about security. Do I want to add network web application firewall in front on the load balancer level in AWS you can enable environment on your load balancer. I could reuse my a five off, if I would use a five for my for my load balancer right. You can use on ingress level, like an engine x you can use mod security, for instance, right. So, think about security in advance. As well as how do we want to manage SL certificates very convenient ways using search manager was let's encrypt, but you don't want to end up with a deploying certificates manually with kubectl apply and taking care of re viewing your certificates after two years right so so you will forget it, give me trust me, you will forget sometimes to renew your certificates and specific namespace or whatever. And then you probably will have a short outage. Right. So, the next thing I want to cover here. So developer reached out to us, or to me and said, Hey, we need to store some files in our cluster and our loads. And I said, Okay, I have no idea. I need to do some research here, what what I need to do. And he said, Yeah, I already found something in the internet. And it was basically in a manual how to install he Katie with cluster and fs inside of your cluster. And we were, yeah, just following the guides following the instructions. But what was happening after that every time I was upgrading cubanitas, the version, and the cuba let's service was restarted on the note. All the mounts were failing. So, so I have to restart all the services to get my workload. Yeah, back online. Right. And this was, again, something what was a problem with rancher because there was missing extra mount or extra bind missing on the cuba let's service, which was causing this issue. And as we don't had so much IOPS, IOPS intensive workload was really just some some files dropping in a folder. I switched to plain old simple NFS client provision. So if you want to to introduce some some storage back and perhaps you should also ask your storage guys what they can provision or perhaps they have already some solutions for you, right. And you also need to think about, okay, how I'm doing backups for my persistent volumes. If you want to, if there are files in it, you really need right so back up. What kind of workload do you expect what your infrastructure is already providing. You need to take some time to to research what you really need to to implement here in your cluster. The next thing I want to bring up our big authentication authorization role based access control use service accounts from the beginning. One of the mistakes I did is after my my first development cluster. I just handed over to the developer and token with cluster admin rights. Right and this token was being reused by a lot of people. And it shouldn't be here the case right so so this was one of the problems that we. It was really hard after to get rid of the token because I was using it was in kube config files was in the pipelines and so on so don't share credentials across teams have separate accounts for deployment monitoring operation tasks everybody should have its own user, and so on so can also lock down which resources different departments can access within the cluster, right. The next topic I want to cover is logging monitoring. It sounds obvious right but having the proper monitoring in place it's also crucial. So, don't just monitor your your services you deploy it inside of your cluster, you should really also monitor the infrastructure. So, on note level, the cluster services like your SD database the API server latency. Your services running in the cluster right kubernetes events should be also covered by your monitoring, like om kills crash loop backup events. The actual requested and used resources is also a thing you you should decide or you should monitor with your, with your stack. And the funniest question I always was being asked by developers was basically, hey. Why should I write my log file, because every time I'm redeploying my pot my log file is gone. And this was one thing you always telling them yeah you but you don't write log files you write to standard out then our fluid deep process will grab the log files and forward it to Splunk. And here in Splunk you can now see all the logs of your parts and then be happy with that. Right so so. The log monitoring is also really crucial. And there are plenty of open source solutions out there like Prometheus in combination with Grafana Elk stack for logging and so on. So, take your time and waste the time to set up a proper logging and monitoring for your clusters here. Next topic, I was being pink one morning that, hey Christian all ingress are down not working anymore everything is completely down please help please help please help. And then I recognized when I checked the latest ingresses which were deployed to our clusters that one of the ingresses has no hostname defined. And so what the ingress controller was then doing it or this was doing it was interpreting the missing host field with the steric. And so all workload was then being routed to the one service behind this ingress annotation, but you can prevent such errors in advanced with tools like with policies cluster policies and there are plenty tools out here, the most popular are caverno and policy agent. And yeah, deploy cluster policies from the beginning so that no workload will will be deployed even on the deaf clusters without the proper validation like the ingress validation I told you like port security, for instance this low privileged that your deployment requires some resource limits and requests defined right that there are health checks configured within your service. So start with policies straight away. The last thing I want to mention in operations infrastructure tasks is, yeah, I titled the slide cube cut control cube cuddle or cube CTL, however you want to pronounce it. But operators and developers should at least be familiar with common cube CTL commands. Like get pot logs describe pot and so on. I've seen a lot of people were struggling when, for instance, the web UI from from Rancher was not available. And they were not able to troubleshoot deployment issues. Right. And, and these are the basics you should be aware of, and also invest again in trainings here to to get that the people are familiar with the tools. So, next will be already development and deployment stuff. So running workload in cubanitas because of the wrong motivation. I've seen this multiple times when when you ask somebody hey why you deployed this application in our cubanitas cluster. Then you're getting answers like I run it in cubanitas cause I don't want to request the via. So there are other examples like we had puppet and higher so for for conflict management in our company back in days. And people used to deploy stuff in cubanitas because they want to skip the merge request in, in our puppet configuration. That's also not the way how to decide why to run workload in cubanitas, right. It's like running a static website inside of importance that of just putting that in a three bucket. I think about why the workload should run in cubanitas. If it makes sense. There should be the architects who decide what kind of workload to put in which environment or if it should run in cubanitas or not. The next one local development. A lot of time I've seen something so I getting pink like cubanitas is down my deployment is not working it right. And you were checking the deployment and you had a lot of restarts of the container then you pulled the container to your local computer and just run the Docker start on the container and you see if the container wasn't starting at all. So developers should be familiar or when they're working on cubanitas should be testing their containers on their local machine. There are multiple options to run a cubanitas on your local machine like when it is in Docker mini cube keys. So you can also test your deployment. I've seen a lot of people who are building health charts and then having a good commit history. You don't need a lot of commits just to test their health deployment. You don't need to run every time an entire CI CD pipeline to just test out your health chart this couldn't be done locally faster in my opinion. So, get familiar with local development tools and and yeah just use them. And one of the main reasons deployments were failing or or CI CD pipelines are failing where they were using latest text. And I can tell you never ever use latest text, either in your CI CD pipeline, nor in your cubanitas deployment. It could break your entire workload. If it's getting rebuilt your image with let's let's assume you're building a Node.js application using node latest right then all of a sudden it could be that yeah Node.js is releasing new new version and your applications are not compatible anymore with the new Node.js version, and then it will build with news base image and will break basically your workload right so try to avoid at least the latest text. One of the things I had with a Katie storage provision that was for instance that in the example I was using to to install a Katie in cluster of as he Katie was deployed using latest and every time he Katie was rescheduled it was pulling the news image. And then we had inversion discrepancy between cluster of as anti Katie. Right. And yeah, this was also one of bigger problem because we weren't able to provision any persistent volumes anymore. Right, so please don't use latest. But talking of images, private registry and base images. I think it's really crucial to have a library of own base images and own registry in your environment, private registry. You will control over the images where you can scan your images for vulnerabilities. So so for instance harbor it's an open source registry, you can use it and then just scan your images constantly. You can add additional configuration to the base images. Java eight was there, for instance one one example. No need to pull everything from Docker hub and hit the rate limit. Right, and then don't use the image pull policy always. So there are some use cases, image pull policy always is usable or good. But normally you don't want to pull the image every time it's getting rescheduled on another worker node, if the image is already there, because it will be much more fast that's been up your application. Right, if it's just reusing the image which is already cashed on the machine. So if possible use block and allow lists for your developers you don't want to have something like this in the screen to blow Docker pull from some untrusted sources, perhaps was a crypto minor in it. So I've seen people using from really untrusted sources or not known sources images like alpine curl. Why not just building your own alpine image with curl included if needed. Right. So, and another problem I've seen frequently was not using the power of Kubernetes or like this quote says why auto scaling I have to pot. So, when you just deploy two pots or one pot of your application, why just run it and Kubernetes when it don't use. Yeah, the features of Kubernetes what Kubernetes is offering you here like as I said auto scaling very cool stuff. Right so use just the features Kubernetes is offering you. Other problem in development lifecycle health checks. So make sure your application is configured with a proper health check please and prevent again, you can use Coverno or policy agent to prevent applications being deployed to your cluster without a proper health check right in this check right in this probe or whatever. Right, and if you get ever getting asked. Hey, please restart my container or my application Kubernetes, then you have to ask developer okay why we need to restart it manually this shouldn't be the way you work in Kubernetes your service should recover automatically and then restart automatically. Right. So, proper health check in place is really, really crucial for gotten resource limits and requests. So, I've seen a lot of deployments which haven't had defined any resource limits and resource requests in their application. So what you can do against something like that. Again, you can prevent workload that this was a policies, or you can add just default requests and limits based on namespace or cluster. And a question I was getting asked quite frequently was, Yeah, but I don't know how many resources my my application is using. I have a reference here again back to the local development section. So, just run your image locally and then you will directly see how much how much resources your image is consuming. And if not so I've seen, for instance, small microservices written in go, and they just requested 12 CPUs and 128 gigs of memory. And then you should also ask a developer hey, isn't it perhaps a little bit much because Kubernetes will try to reserve the requested memory on the worker node and is also trying to schedule your application based on your resource request. So, monitoring such things is also beneficial. The next thing I want to talk about is environment variables for configuration secrets. You can totally put some configuration volumes and environment variables. But I've seen also deployments with, I don't know 5060 different environment variables for for configuration. So use config maps for configuration, right, and never ever put secrets and plain text into environment variables. Create a standard house your application should be configured. Right. And use a secret store. For instance, you can use walled or you can use some stuff like sealed secrets from Bitnami, which make the things more in their more secure. And Kubernetes secrets are not encrypted in in your cluster right so if you can read a secret with your service account or your user account. You can just run a base 4060 code and you will have the value of the secret so keep that in mind. Also when creating service accounts that you're not. Yeah, allow them to read all secrets. So, and one of the, yeah. Using a template engine or have so so what I've seen in our environment that people are reusing the same yaml files for deployment for for ingress is and for service annotations for services over and over again. And it makes something. Yeah, it makes really hard to change all these. Yeah, deployment and and all the yaml files and different git repositories. If you want to introduce something like a new annotation if you want to add some some monitoring annotations to your application and so on. Provide perhaps an helm template for your developers, which they can use because most of the deployments in our organization were most likely the same. Right and I create an helm template. They can use and then also just having switches inside of the template like expose my service externally. Yes or no, do you want to have a persistent volume. Yes or no, they don't need to think about storage classes that don't need to think about ingress configuration it's just a matter of the template and the developer could then easily just deploy in its CIC deep pipeline to application by using a helm upgrade install, providing the image, the image URL and the image tag, setting some more parameters and they were good to go and they can deploy their application to. So this was for us or for me a huge time safe. So and as summary of my talk is when you're starting your journey in Kubernetes in your, your company invest in trainings, if you have no idea, like I had back in the days. The training document how to use the platform documentation is is is here also the key to not run into the same issues over and over again. Right, to give the developers and the operation team guidance and how to use Kubernetes how to use the platform. So I template so what come mistakes like the forgotten health checks, and so on, and try to reject everything as early as possible which is not. Yeah, meet the required standards of your of your deployments. Right so again stuff like using the policy engines. And involve your security operations from the beginning and involve all other teams from the beginning. Right, as I said you're building here a really huge platform which is being used by all of the people in your organization later. So that's all from my side. And yeah, I'm happy to be here available for some Q&A sessions. You can reach out to me via LinkedIn on Twitter or shoot me an email. And we will see us I hope you enjoyed the session. And see you later in the Q&A. Bye bye.