 Hello everyone, my name is Sadam Kozłowski and I'm here to present you your own Kubernetes Castile. This is a presentation which tackles a very important topic of building production-ready Kubernetes clusters using the open source components. So shortly about me, my name is Sadam Kozłowski. I work for GrapeUp now over five years on the position of technical leader and cloud solution architect. And I have a big enterprises with their Kubernetes adoption and promoting their clusters to production. So let's think about the topic I put there, why a Castile? So I imagine that the production cluster is a very similar in properties with medieval Castile. And why is that? I think they share the common requirements. When someone was building a Castile, it was built for years. It wasn't just a temporary project which was meant to work only a few weeks or few months and then be destroyed. It's a solution that has to be reliable and resilient and work for years. And this is also important topic for production clusters. It shouldn't be built with few weeks in mind because in general, this is not how production works. So it cannot be temporary solution. The other important topic when we think about the Castile is that it has to be secure. For example, it can be built on a mountain. And this also applies to Kubernetes clusters. The cluster has to be secure when it's production. It depends if the cluster is development. The development doesn't have to be as secure as the production, but production cluster, I think that security is one of or maybe the most important topics when thinking about that. The other thing is that there should be a supporting infrastructure, both for a medieval Castile, when you need to clean water supply or blacksmiths of carpenter services. And the same thing, maybe not same explicitly, but the similar thing for the Kubernetes clusters. The Kubernetes is just an orchestrator. It's a tool which has a specific use case for orchestrating your workloads, orchestrating your containers. To make it production ready and production viable, you need to install the supporting components, which help with the topics that are not handled directly by the Kubernetes, like storing and displaying logs, storing and displaying metrics, or for example, just the automation for CI CD. And the last topic, which is rather important and also shared between the Castile and Kubernetes production cluster is that access to the Castile depended in the medieval era on your role in the society. So King have a different access, Knight has different access and the blacksmiths has a different access to the inside of the Castile. And the same thing is for the Kubernetes cluster, your user and the group you're in is the dependency, which says what you can do in the cluster. So if you are an admin, you can create any kind of resources like you can do anything in the Castile. But if you're just a developer, maybe you are not able to configure, for example, ingresses or storage classes. So it really depends on your group or your role. So that's first part of the topic, but there is also a second part, why the open source and why the open source components. So the open source is a rather broad term. It may refer to the distribution model, to the license or just the open source movement. And to describe why open source, I will use the definition, which is the open source model is decentralized and focused on open collaboration, because this is why I picked the open source. When you use the open source components, it's based on the open collaboration. Anyone can make a change, anyone can propose the change, create a pull request. Not all of the pull requests are accepted, but this is not just the end of the world. You always can create a fork, either for your own use or share it with the community and use that solution in your product. So it's not closed source. You can anytime make any change and just alter it for your needs. There is a community collaboration. People are trying to solve the issues. It's not just a single person trying to make it all work. There are even companies sharing the source code and helping with the development of open source tools. And the affordable pricing, it sounds a little bit funny because in most cases we think about the open source as a free tool. And in most cases it is free, but it's also important to consider that the open source does not explicitly mean it's free, it only means it's publicly available. So if it's free in your use scenario for your needs, it's better to check the license. There are different licenses and most of them in general are very permissive. But if you think about changing the code, you need to make sure that the license that does not, for example, require you to publish it. It sometimes does. Require to publish your changes. So what are the bricks, the last part of the topic? I would divide them into two main aspects, the tools and the configuration. In this presentation, we'll focus on the tools, which are also multiple topics like observability, automation or security. But the important part, which is not part of the presenting this presentation is the configuration. Like role-based access control configuration in your cluster, Kubelet fine-tuning, authentication authorization, like implementing OpenID or LDAP to access the cluster itself, and making sure the underlying infrastructure is resilient, safe and backed up. This is very important. And to make sure the solution is 3D production ready, you need to combine both. The good tools, good components and the great configuration. So in this presentation, I will show you a multiple different open source tools and components which I would recommend for using in your Kubernetes production environments. But how did I select them? So all of them were tested by Grape App and by me in our projects, and they are proven to work great in the production clusters. They are all open source licensed and have active community support. This is very important because you can always find an open source tool which is working, but the last commit was like a year ago, and the last answer for the case or a bug is three months old. So that's not really useful. The active community support is really important for this kind of tools that if there is a security issue, even if you can fix it, someone has to accept the pull request, right? And the other important thing is that all of them are part of the CNCF, Cloud Native Computing Foundation landscape. So we can always find them there and check how they compare with the other tools which are not present on this presentation. Okay, so let's start with the first topic, observability, and what I mean by observability. So observability is an ability to make sure what happens inside your cluster. So the core components for that, which we will go through later are logs, metrics, and the network, which is observable through this service mesh, for example. So those are three important aspects of observability, and why do we need observability? Obviously, you need to always know what happens in your cluster when it's on production. You need to make sure that it's operating correctly. You need to make sure you're getting alerts when something might get around in the nearby future, or just happened wrong a moment ago. So the first one will be logs. And for logs I put or selected two components, two systems. One is ELK, which is Elastic Search Log Station Kibana. And the name may be also, the often used name is also EFK, which is Elastic Search Fluentic Kibana. And this is a widely used tool. Very common, very popular, very capable of handling extremely large amounts of logs and storage, but also quite complicated with the configuration, especially if the configuration has to be HA, so highly available, and you store a lot of logs. It might be rather complicated to configure ELK if you have no experience. The huge advantage of it is the query language, which provides an ability for the full text search, which is not that common. This is something which ELK is great with. And this makes searching logs much easier than some of the other tools. The disadvantage of the ELK itself, maybe it's a huge disadvantage, is just the small limitation that the security is not part of the open source part, but there is a workaround for that, which is called the OpenDistro. It was created, I think, by Amazon. And this is a version which extends the basic ELK tool set with the security and other important tools which are not present in the basic installation. So if you need the security part, multi-tenancy, you can use the OpenDistro for that. But sometimes, even for the production cluster, the ELK stack is too big or too complicated to configure. And for that situation, you have Grafana Locky. It has much lower the resource footprint than ELK. And it shares the UI with the Grafana. Grafana is just for the metrics. It's very easy to install, too. The disadvantage of it is it's much simpler. It doesn't have the visualization that Kibana has for the metrics. And the query language is very similar to the Prometheus query language. And this means it's limited. It's not the full text search. It's just, it's more like Regex. So it is limited in this case. And it might be harder to find the lock you are looking for. But both are great. It just depends on your needs. Next topic, metrics. And for that, actually, I just picked one solution because it's so popular and so widespread that I think it's something I could recommend easily. And this is the set of Grafana Prometheus and Alert Manager. And most of the installers, for example, the Prometheus operator, installs the full set at once because all of them depend on each other. The Prometheus is a tool which gathers and stores the metrics. So this is like a pull or push model depending on the configuration. And the metrics are stored in the TSDB, which is a database holding the metrics sorted by a timestamp, so time series DB. And to see the metrics, to see the dashboards and display them for the user, there is a Grafana, which is a very nice tool, highly configurable for creating the dashboards. And it reads directly from Prometheus. So it doesn't really require a lot of storage and a lot of database. It just stores the dashboards and all the metrics data is held in most cases. It depends on your caching strategy too. But in most cases, the data is put directly from Prometheus. And for the metrics, the metrics are also a base for alerts. And there is a tool which is called Alert Manager, which is also part of this tool set. And this tool is able to alert and notify you about the problems or alarms created based on the metrics. Either the metrics coming right now or the historical ones. And you can, for example, create an alert when the CPU usage is too high. The amount of storage available is getting low. And for example, if the bandwidth is getting low for some specific services. So it's highly configurable. You can use any metrics you want and very complicated mathematical formulas if it's required for calculating specific alert. And as a tool set, it's really great, especially for the start when you need to monitor your cluster because if you use the Prometheus operator installer, but it's not just limited to that, you can easily install it using a single CRD. But the CRD is not just for installing the Prometheus Grafana and Alert Manager, but also it is able to read or you are able to use the CRD to create alerts and dashboards dynamically, which is great. And you can even allow users to add their own dashboards to their tools. And most of the tools which are on this presentation and most of the open source tools from the CNCF landscape already have the examples or existing Grafana dashboards, which is also great. So the support for Prometheus and Grafana is very often built in. And sometimes you need to scale up the metrics system. And there are also open source, two solutions, Cortex and Thanos. They're very similar. The difference is that the Cortex was designed for scalability while Thanos was designed with a small footprint in the mind. So how Cortex works, it's a centralized system while the Thanos is deployed as a sidecar. So for the Thanos, there are more Prometheus instances, but smaller ones. Cortex is a single big one. And this is also different from the query perspective and the storage perspective because Thanos has to query all the Prometheuses and then gather the results, which it has a great code for making queries and the final system, which makes it very, very fast. When the Cortex has a centralized storage, so the metrics, all metrics are sent to the single storage. So this is, for example, easier to backup. So those are two different tools when you need to scale up your metric system. And the other thing you might need for observability is the service mesh. A service mesh is a dedicated layer for making service to service calls or container to container calls. And the idea is to solve the challenges of the developers when they need to call remote endpoints or the endpoints inside the cluster, like making the secure by default or adding the service discovery. But also a service mesh is a set of a proxies which abstract the network inside the cluster. And because all the traffic goes through those proxies, there is very often observability built in, which makes it easier to gather metrics about the bandwidth being used, amount of connections failing, or being successful. So that's a very important topic too. And for this topic, I picked two solutions. They're very, very similar. There are not that many differences. The Istio is very popular. I think it's the most popular service mesh right now. And it has a lot of examples and code snippets and a lot of stuff already built in, especially in terms of documentation and the articles. And it has a multicluster support, but compared to LinkrD, it's slightly less performant than Istio. It's slightly less performant than LinkrD when the traffic is high. So there are high loads and large amounts of data being transferred. The LinkrD was built with performance in mind. It doesn't have multicluster support. And some of the features of Istio like circle breaking are missing. But the resource footprint is very small and it makes it much faster if there is a large amounts of data to be transferred. So the next topic is automation or continuous integration delivery and each cloud-native solution, each production cluster has to consider this aspect of deploying and developing applications. So making sure the builds are repeatable and observable and also proper version control and for production clusters mainly, the application delivery system, which is also reliable. So let's start with the rather complicated topic which is a GitOps and why GitOps is complicated. It is complicated because it tries to solve a very important aspect of the development. So it's trying to solve the problem that developer has to be able to deploy their applications automatically from the development to production. And the problem with that is there has to be a single source of truth. Like for example, the Git repository which holds all the configuration and then it's pulled for changes by the Argo CD or Reflex in this case. And the state of the cluster differs from the repository. It has to be updated. So in theory, this is rather easy concept but there are caveats that are very hard to solve. Like for example, the secrets. The secrets are not really safe in Kubernetes if you store them as a secret because they are not encrypted by default. So the GitOps has to read these secrets somewhere and you shouldn't put them obviously in Git repository. So this is the part of configuration that is very often challenging. And the configuration otherwise is rather simple and why I picked those tools? I tested both of them and they're both nice. And there are also small differences between Argo and Flux. The Argo has a very nice UI. So it's easier just to look at it and see how it behaves and it has a great multi-cluster support. So for each project, each component you configure in there, you can set the target cluster. So there can be multiple targets. So the Argo technically supports the multi-cluster design. While the Flux is able to only read one remote repository and one target cluster. So that's a limitation, not a big one because in most cases, you can leave with just having one with Flux in your cluster. And also it doesn't have an UI but it has a nice CLI for management. Both of the tools are only continuous delivery GitOps tools. So there is now continuous integration. And the continuous integration will be the next part. So for the continuous integration, I picked two tools. It's Jenkins and the concourse. And both of them are great. The Jenkins is widely used. It has a huge adoption. I think almost every company or every developer have used the Jenkins at some point of their journey. And it has tons of plugins available. So you can install almost everything as a plugin. But it's also a little bit monolithic. It's harder to install and configure than the other tools. And the configuration as a code is a little bit strange compared, for example, to the concourse because it's partially configured through UI and partially through the code. The lightweight alternative to Jenkins is the already mentioned concourse. The concourse is very easy to install and it has a very great system of pipelines which are deployed through CLI, the fly, and are written in the YAML. So the UI is very clear. All pipelines are described in YAML. So there is literally no way to change the configuration through the user interface. And it has a very nice way of... It was designed in a way that the workers are very lightweight and fully isolated. So each worker in the concourse runs in the container which is fully isolated. In Jenkins it is also possible but this is not how it was designed initially. So it requires a little bit more sophisticated configuration. The next important topic for your cluster is the Ingress Controller. And what is Ingress Controller? Let's start with that. Ingress Controller is a combination of load balancer and the proxy, which is responsible for reading the Kubernetes Ingress resource or object. And based on those objects, it draws the incoming traffic for your cluster to a specific service or set of services. And for this I picked three possible Ingress Controllers. There are much more of them. There are literally I think 12 or something like that. There are a lot of Ingress Controllers. I'm not saying those three are the best. Those three are the ones I have tested. So the first one and the one which is part of the official Kubernetes documentation is the Kubernetes Ingress Controller or Nginx Ingress Controller because it's based on Nginx. And this is a small disadvantage of this tool because there are two Ingress Controllers. One is Kubernetes Ingress Controller and one is Nginx Ingress Controller. And even though the Kubernetes Ingress Controller is based on Nginx, this is not the same thing. So if you go through the web pages and try to figure out what's the difference and which one is which, it's slightly more complicated than it seems initially. But both of them have great advantages. So a lot of people know Nginx. It works great and it has nice configuration. It requires some knowledge to write the Nginx configuration correctly. But if you have this knowledge, you can configure almost everything. And some, maybe let's say a lot of these configuration options are available in the Ingress Object as annotations. So you can take advantage of most of the configuration or configurability of Nginx just using the annotations on the Ingress Objects, which is very great. The difference between the Kubernetes Ingress Controller and the Nginx Ingress Controller is not big. The Nginx one provides the query parameter support as the extension of just route and path. And the Kubernetes one has additionally the authentication part, like basic authentication can be configured using the Kubernetes Ingress Controller. And the two alternatives I picked, the traffic, which has now version two and HA proxy. And the traffic advantage is it has a very nice user interface. So if you need to quickly look at what's wrong, what's going on in the Ingress Controller, the traffic is great for that. You have an easy to use user interface. You cannot make any configuration changes from there, which is also rather good because you can allow access to the CUI for the developers. And it's really easy to install. The disadvantages of this one. So when it was switched from version one to version two, is it lost support for some of the features. I think they might just add them later on in the development. And it supports the native resources, the native Ingress resource, but it also uses its own CRDs, which is slightly confusing when you just start from there. And the HA proxy alternative. I put it here because it's very performant compared to any other Ingress Controller. It, from some of the tests and some of the comparisons, it may be even the most performant load balancer. It has a lot of configuration. You can configure a lot of things, but it doesn't have user interface. And the configuration may be slightly harder than the Kubernetes and traffic ones because there is less resources available for that. But if you need the really performant Ingress Controller, the HA proxy may be your choice. So the security. Security in the cluster is a very important topic. And as I said previously in this presentation, the configuration part of the cluster, like making sure only authorized users are able to access it. And the role-based access control airbag for making sure only specified people can make specific changes are very important. But you're not really limited to those two or three aspects of the Kubernetes security. There are tools that can be installed that help you with making sure that cluster is secure. So first important topic is OpenID or OIDC OpenID Connect provider. And those providers are just identity layer for verifying end-users authentication by using external authorization provider or third party. And both of them are very easy to use. And it's just nice to have the OpenID or OIDC provider inside your cluster because you can easily change it to the third party which is used for authorization and verifying the identity of the user. So that's great. And here I picked two of them. And they're also very similar. The DEX is just a simpler tool than the KeyClok. It's easy to use. It's simple, it's just the OIDC proxy. So it proxies your authorization authentication requests to the other provider. But it has a slightly limited capabilities compared to KeyClok and it's just a proxy. No automation, no custom claims, nothing like that. It's just proxies your requests. But if you need more than DEX provides you can use the KeyClok which is very extensible and advanced in configuration and it has a new UI. So that's two nice things to have but also it allows you to create a custom flows and to factor authentication. So you can configure more secure system with that and this advantage of this solution will be it's harder to configure than DEX and requires additional database for storing all this configuration and the user claims all the custom profile changes. So even conceptually it's just bigger to deploy. It's just a bigger solution has a bigger resource footprint than DEX. And the next backup and restore. So backup and restore for me personally it's either the most important topic or the second one in the security. And this is because a lot of people take backup and restore for granted because a lot of solutions provide this kind of thing and a lot of companies do not test the restore functionality. So as long as the backup works it's tested once initially or even not that and this is all and the backup and restore toolkit is probably the most important part of your production ready deployment because you cannot rely on the fact that what you created is so great. It will survive any kind of disaster. The underlying infrastructure like AWS cloud or Azure cloud it's so resilient it doesn't fail that often but there is a very small chance that it may fail. And if the underlying infrastructure fails and you have no backup you have a very great disaster recovery plan to recover from that. And if you have a backup it's just as easy as restoring the backup. The bigger problem is that how to configure the backup and restore correctly especially if you're using either on-premises Kubernetes deployment or it's using the not so widely used and supported cloud because for example some of the clouds like EKS they use for the persistent volumes the storage, the volumes from the cloud provider which can be configured to be backed up automatically by the provider. But then you might need to copy the data somewhere else in case something happens with that provider. And Dvalero is a solution which is in most cases independent on the provider because it has a support for all common clouds like AWS, Azure, Google cloud you can think of anything like that but it also has a tool built in which is called RESTIC which allows to make an image of the persistent volume which is not supported directly by Dvalero. So if you have a provider like I don't know OpenStack which might be not supported by Dvalero you can use RESTIC to make a backup of this volume just a plain backup bit to bit but still a backup and just have a working backup and resource strategy like that. The disadvantage is that it doesn't have a user interface it's probably not the biggest disadvantage in the world and the backup metadata is stored without versioning. So sometimes you may be able to break the mechanism by just removing the file by mistake or altering the file so it's not really recommended but just make a backup of your backup tool let's say. But there is also an alternative which is not an external tool but a part of Kubernetes itself it's called volume snapshot and volume snapshot is a new resource in Kubernetes but it was introduced recently and it's now supported but technically it's really new and it requires not just a new Kubernetes version but also a supported CSI driver version for infrastructure. So the infrastructure provider has to implement the support for volume snapshot in the CSI driver for that infrastructure provider, for the cloud provider and only then the volume snapshot is going to work but if you are able to run the volume snapshots it's a great solution because it's native it's supported natively by Kubernetes it's easy to configure you can configure the backups as a part of your cluster configuration and it's natively supported by the CSI driver so there is no external tool for monitoring like Valero to make sure it works. So if it works it's controlled by the Kubernetes and to finish the security part there is one more tool that I wanted to talk about and this tool is an open policy agent what is open policy agent? It is a tool which supports the maybe let's say the different way open policy agent is a tool which extends the role-based access control abilities with role-based access control you are able to configure for specific user or group of user what they can do and anything you don't configure is just denied. So this is only, it's kind of a white listing way of configuring the security and only based on a role resources of Kubernetes so you can say the user is able to create a deployment the user is able to create a pod the user is able to delete or update the ingress but you are not able to say user is able to create a deployment which only contains a single pod. So this is not possible with airbag but this is possible with the open policy agent and the gatekeeper tool which is used by it and what the open policy agent or OPA does it provides a system, a component which can verify the policies written in the Rigo language versus the JSON documents and this tool is configured through web you can configure the open policy agent or gatekeeper as an admission web hook in Kubernetes and then each change to the cluster is sent to OPA and to be verified if it should be accepted or denied. So for example, you can configure or create a policy in open policy agent service using the Rigo language to say for example, you're only allowed to create a deployment, a pod using our container registry. So you have for example, Artifactory installed locally in your network and you don't want people to use the containers coming from the internet because sometimes the internet connection is quiet and there is always a possibility to use a firewall to limit this kind of behavior but with the open policy agent you can just say the only allowed container registry is our Artifactory using the Rigox versus the image name and this would work great and it also allows you to only limit this behavior to either specific namespace or specific set of namespaces. So for example, you might want to have a space where you need to make some experiments which might require access to different repository, external one or just different one. And then you can just say, okay, this open policy agent is checking the namespace too inside the policy and verify that or you can just annotate the namespace to make sure that it's not being taken care by the open policy agent. So for all more sophisticated policies or more sophisticated security, the OPA is a very great tool to expand what you can do with Airbag and make sure that the actual behavior of people is more controlled than what you can do just with the Airbag and firewalls. So that's all for that topic. I hope you have learned something about those tools and you have seen that the open source may not seem initially as a greatest solution but it contains everything you need to create a working, reliable and secure Kubernetes production ready cluster. So that's it. Thank you and have a good day or have a good night. Bye.