 Ευχαριστώ για αυτή η κύπκονη προσπαθή, γιατί χρησιμοποιηθήσατε 3 κλάσταρες, χρησιμοποιηθήσατες, και αργο-CD για μαθητένανση. Είμαι Κώστης Καπελόνις. Είμαι δημιουργημένος κοτφρές. Είμαι η Ελία, κοτφρές δευόμπρος, και το πρόβλημα που θα μιλήσουμε σήμερα είναι, πώς προσπαθήσατε αργο-CD σημαντικότητα. Είμαστε δημιουργημένος κοτφρές. Είμαστε δημιουργημένος κοτφρές, είναι η δημιουργημένη κοτφρές, έχει ένα CICD και γειτοπιστό, αν θέλεις να γνωρίσεις πιο πιο για κοτφρές, έχουμε ένα βούθο εκεί, αλλά εμείς θα μιλήσουμε για ένα πολύ σημαντικό πρόβλημα που θέλουμε να βρήκουμε. Οι κοτφρές μπορούν να πηγαίνουν στο κοτφρές's website, που σημανούνται, δημιουργούνται με εσάς, και μετά μερικές λεπτά, δεν ξέρουμε ποιο χρήμα που εσείς έχουμε με τα προστασία. Και θα έβλεπα μόνο να αγγωθεί, δεν θέλουμε να δημιουργήσουμε ένα κοτφρές, να αγγωθεί ένα κλάσσο, δεν θέλουμε να δημιουργήσουμε ένα εμμαίνο, θέλουμε να εξοφελίσουμε πιο. Αυτό είναι το πρόβλημα που θέλουμε να βρήκε, και να φανεί μόνο να δημιουργήσουμε ένα γεγόπιστό για όλοι, ανταμενωμένουμε. Ποια είναι τα πρόσφυματικά λεπτά. Αν είσαι familiar with Argo CD, Argo CD, on its own, has an installation mode where you install it on a namespace. So you split the installation into two parts. First you install the CRDs only on the parent cluster and then for each namespace you install an Argo CD instance that works only on that namespace. After you do that, the customers come in and they can connect their own clusters to their own Argo CD instance. The service elite's customer has a namespace on a shared cluster. Now this could work in theory. It's very easy for us because we have a single cluster. It's centralized. It's also resource efficient. We can use all the autoscaling methods that we have for Kubernetes. Creating a namespace is super fast. So it's very easy for customers to get what they want right away. But it's not secure by default. So there is no isolation between namespaces. We need to set up policies, set up quotas, and do something for isolation. You also have the usual problems with resource salvation. If somebody is doing something that gets stuck, the whole system will have issues. And also specifically for us because we are installing Argo CD and we want its tenant to get Argo CD. Argo CD has its own CRD so we would have problems because we can only use one CRD for everybody, one CRD instance. So essentially everybody is locked to the same cluster version and the same Argo CD version which maybe it's not the optimal way to do things. Now on the other end of the spectrum we could say, okay, let's give each customer their own cluster. So as soon as somebody signs up, we launch a cluster just for them. We install an Argo CD instance on the cluster and then we give them full access to this cluster. So this model is a customer per cluster. Now on the one hand we get total isolation. Everybody owns their own cluster and they can do whatever they want. We are also free to do different versions. So if somebody wants a different Kubernetes version or a different Argo CD version, we can do it. We don't have any conflicts with Argo CD CRDs. So as far as customers are concerned, it's perfect for them. But for us it's super expensive creating a cluster for each customer. Remember I said we don't know how many customers we have in advance. Creating a cluster is super slow. For some cloud providers it's 15 minutes, 20 minutes. And it would be also difficult for us. We would have so many clusters and we would need an easy way to manage them. So these are the two choices that we had to make. Essentially you can see that nothing is perfect. Each one has advantages and disadvantages and somebody might say, okay, cost is more interesting for me. So I would pick this one or scalability is more interesting for me. So I would pick that one. But there isn't like a clear winner. Specifically in our case, because Argo CD is not just an application but it's an application where you connect target deployment clusters. So this would be super secure because if somebody gets access, a tenant gets access to the Argo CD instance of another tenant, not only they get access to the Argo CD instance itself, but also to all the cluster that somebody might have connected there. So a security issue would compromise the clusters of the customers as well. Not only our own infrastructure. So we said maybe there is a third option and this is when we discovered Vcluster. So Vcluster, if you're not familiar with it, it's an open source project. There is a website at vcluster.com. They also have the source code at GitHub. It's managed by Loftlabs and essentially it gives you the capability to deploy a cluster within a cluster. Its cluster solved the way down, clusterception. And it's fully Kubernetes compliant and we're going to talk about this. So in this model, we have a root cluster like before, but on each namespace, we deploy another Kubernetes cluster, like a real Kubernetes cluster. You have many choices. You can install K3S, K0S, and I think even EKS. And it's a real Kubernetes cluster that passes all the compatibility tests like any other Kubernetes cluster. It has its own API, fully compatible with standard Kubernetes. Then we install an Argo CD instance there and Argo CD thinks it's in a real cluster, but it's not. It's Argo CD instance, yet it's own CRD, so it's a standard installation, not the namespace one. And then from that point onwards, the process is the same. Customers come in and they connect their own clusters. So this third choice is a bit better because we get the best from both worlds. We have good isolation. I'm not saying perfect and we will see why. Each customer thinks they have full access to a cluster. It's cost effective for us because we still have one parent cluster and we can do auto scaling and make sure that everybody has their resources they have. There is no problem with CRDs or conflicts anymore. It's very easy to share resources if we want. Creating a virtual cluster is faster than creating a real cluster and maybe not as fast as creating a namespace, but still real fast. And it's very easy for us to give capabilities to the customer if they want a different Argo CD version or even if they want a different Kubernetes version. So it's possible that the parent cluster is, let's say, 1.24, but the virtual cluster inside has a different version. There are some issues. You need to do some hardening so it's not perfect and you still have a single point of failure. So if the parent cluster breaks for some reason, you lose all the children clusters. So that's the theory and now Ilya will talk about the implementation. Okay, hello. So we'll get a little bit technical. Now let's look at the solution architecture from above a little bit. So we have our host cluster. Inside our host cluster we deploy a namespace for each customer. Inside this namespace we deploy vcluster and onto the vcluster we deploy the virtual Argo CD. So and to this Argo CD the customer then connects their own target clusters and the Argo CD syncs the resources onto their clusters. Now this is how the architecture looks in general. Now let's dive into the components. Now let's explore some vcluster concepts and how vclusters operate and deployed. So vclusters are deployed onto the host cluster like any other Kubernetes manifest, like any other workload. You can use PlayManifests, you can use their official Helm chart and vclusters are entirely namespace scoped so you wouldn't require a cluster admin privileges to deploy the vcluster. And the way they operate is that high-level resources are entirely virtual. They exist only in the realm of the vcluster in its API and the low-level resources that are absolutely necessary to execute workloads like pods, like secrets, config maps, you know, mapping to the pods. They exist on the host cluster and they are synced by the syncer which we will see in the next slide onto the host cluster. So this is how vclusters operate. You have the vcluster pod. The vcluster pod has two containers. One is the vcluster itself which represents the Kubernetes API and there is the syncer container which is responsible for syncing API objects from the vcluster onto the host cluster. And basically it syncs by default it will sync only the absolutely necessary resources like secrets, config maps, pods, etc. what I said before. But you can also configure it to sync any other resource you might want. For example, Ingresses if you want vcluster workloads to be accessed from outside. So like any star solution especially our solution because we don't know how many customers we will have and how many of those vclusters we will need to provision and we need to worry about scaling and proper automation. So to deploy a single instance of our virtual argostd we need to deploy two things. We need to deploy the HelmTry for vcluster and then we need to onto this vcluster we need to deploy the manifest for the virtual argostd. But the challenge here is that vcluster has its own kube API. So it's not just like deploying any other two Helm releases side by side. Basically you need to deploy the vcluster you need to get the kube config from the vcluster you need to deploy the argostd onto the vcluster. So if we look at this in a different way and the vcluster is actually a piece of infrastructure it's like I would deploy an EKS cluster in AWS or an Azure, whatever and I want to deploy some workloads onto it automatically. So this takes us to the realm of the infrastructure provisioning tools and here comes crossplane. So we were looking for Kubernetes native solution to deploy all this and what crossplane does is it utilizes Kubernetes to serve as a general purpose control loop which means you can use Kubernetes to manage any type of resource as long as there is an API you can access and create this resource by some controller you can manage it with crossplane. So you can manage any non-chromatic resource even pizza orders as we will see in one of the next slides so you get one control plane to rule them all. So how does crossplane provision infrastructure crossplane uses it's a very simplified way of describing crossplane but crossplane uses providers and resources resources are represented using Kubernetes CRDs and are describing the resources that we want to create. For example, if we want to create an AWS VPC we will have a spec that for example would have seeder blocks and providers are the actual Kubernetes controllers that are responsible for creating those resources on the third party APIs. So for AWS we would use AWS provider. Provider configs are the configurations that define how the provider should create the resources. For example they would have authentication they would reference a secret that would authenticate me against AWS. So here is a simple example of using crossplane to provision infrastructure on the left-hand side we have the provider which is the defines that we are using AWS provider the provider config defines the credentials we would use to access AWS API and the VPC is the actual resource that we are going to create. And you can see it under its spec it has region it has seeder block it has tags etc etc and one of the most powerful features of crossplane is the possibility to create compositions and composite resources. What that means is that we can create our own CRD that would utilize multiple providers and create our own kind of resource. For example, if my cluster needs to include argostd in it so I would use and I want to provision it on gcloud I would use the Google cloud provider to provision the GKE cluster and I would use for example the Helm provider to deploy argostd onto it and I would call this my cluster and whoever wants to create one of those will create a resource claim and they will be able to deploy such a cluster. There is such an example in the crossplane repo you can access it later and the presentation is of course uploaded the CNCF website and if you want to learn how to order a pizza from Domino's with crossplane you can scan this QR code it will take you to their blog post it's a really nice blog post so the guy basically created a controller that accesses the Domino's API and by accident he made a mistake and ordered like half a dozen of pizzas and so let's see how it all comes together for our solution in ColdFesh so when a new hosted argostd is deployed our platform commits a resource claim of the type hosted runtime which is how we call our virtual argostd it deploys it into a git repository there is argostd on the host cluster that sends those resources onto the host cluster and there we have the crossplane composition that utilizes mainly one provider which is the Helm provider it deploys the vcluster Helm release it then creates a provider config you can also use crossplane compositions to create crossplane resources so the provider config has the kube config for the vcluster and then we use that to deploy our argostd onto this vcluster so how does the end user experience look a user would go into rui they would click install hosted runtime and they would see this nice progress bar everybody loves those and once this is done they will see that the hosted runtime is active they can see their components and once the target clusters are connected they can create target applications and sync resources to their target clusters so what are the benefits the benefits for the users of course are that they get one click installation argostd instance zero configuration, zero travel and they have flexibility in managing multiple argostd versions and commercial versions they get a friendly management UI and enterprise grade as a zone all those stuff and how does it benefit us we have a centralized setup it's very cost effective the cluster grows and shrinks if customers join it grows if we shut down run times they shrink back we get security isolation out of the box we don't need to worry about isolating pods so they don't access other namespaces and it allows us to have different combinations of Kubernetes and argostd versions so we can test out for example new versions of argostd without affecting other customers and how do we monitor all this because pods provisioned by vclusters are available on the host cluster you can basically use your favorite tools that you use to manage your Kubernetes workloads and we use the Prometheus and Grafana stacks so everything is like scrapable we can scrape the slash metrics endpoints and we get all the metrics that we would get as if everything was running without vclusters in addition we built our own proprietary exporter to monitor the runtime health from the platform side it's some business metrics and to know whether it's synced and like how many of those we have so let's see how it looks this is a Grafana dashboard that represents the health of a single hosted runtime at the bottom we can see the lowkey and the dashboard from lowkey that shows the logs from all the components that we have inside and we can see the pods all the logs, their statuses whatever you would do for pods and this is the dashboard from the proprietary exporter you can see at this point in time when we took the screenshot we had 64 active runtimes they are all synced and they are all healthy ok so now it's time to show a little demo and because Abbound are sponsoring the Wi-Fi for this event I'm going to yolo it and do it live so let's hope Abbound they own crossplane by the way so the demo is inside this repository you can access it by scanning the QR code or clicking the link and there is readme with instructions on how you can run this demo yourself with a few easy steps and let's do it so in the demo I'm going to show behind the scenes how it looks when we provision and deep provision hosted ARGO CDs and so here we see the repo let's go over it a little bit we have three folders ARGO CD applications, crossplane resources and virtual ARGO CDs so let's start from crossplane resources we have here all the resources that are required for our setup for crossplane so we have the home provider comments provider are used for a small thing so the namespace gets deployed and removed and the composition for the virtual ARGO CD which is the most interesting part which consists of the definition how the CRD looks and the composition in the composition we will define you can even try and show that so we have a list of resources like one is the vcluster release and then we would have the observed delete namespace and provider config for helm on the vcluster and of course the last one would be the ARGO CD on the vcluster and so this is the crossplane resources part and we see the other folder is the virtual ARGO CDs the virtual ARGO CD is where we hold the resource claims for those compositions so basically this is where we would hold our customer ARGO CDs every instance that they would be created would be created here so at this point in time we have customer 1 which is currently deployed and we will see it in a moment and we have customer 2 which is commented out ARGO CD application folder contains the manifest for the ARGO CD applications we have 3 I will show them in ARGO CD so the first one is crossplane which is basically just a helm release the official helm release for crossplane crossplane resources is syncing the folder with the manifests that we saw earlier and virtual ARGO CDs syncs the other folder and let's look at the virtual ARGO CDs at this moment it looks like this we have one customer one virtual ARGO CD which interpolates into 3 resources the provider config and two helm releases let's add another customer and see what's going to happen and go back and go into virtual ARGO CDs folder uncomment this commit go to ARGO CD refresh it and we should immediately see what happens inside the cluster we can see that a new namespace was deployed for customer 2 and inside we can see the vcluster starting to spin up this is the vcluster pod and vcluster also includes core dns deployment for name resolution once that starts the composition will know that vcluster is deployed and it can start deploying the ARGO CD in a moment we should see ARGO CD start and we get deployed we start to see the pods for ARGO CD being deployed and if we check and see for deployments in this namespace we for sure know that ARGO CD has deployments and it's all empty why is it empty? because deployments are high level resources and if we execute into the vcluster pod and do cube CDL-N ARGO CD get deployment we can see they're all here inside the vcluster realm now let's try and remove our customer and see what happens next go here comment it out commit sync ARGO CD again we can see it disappeared and if we look at the namespace it's all getting removed and in a second the namespace will also disappear ok so provisioning, deprovisioning works and and the Wi-Fi hasn't failed us so here we have some resources for you to if you want to learn more you can go to vcluster.com by loft you can read the documentation for vcluster there and crossplane.io also at APA on the DIO if you want to learn more about crossplane and on our website we have a training for an ARGO CD certification that you can go and have a look at and if you want to give us feedback we would really appreciate that and if you have any questions we're happy to answer now and you can also find us at the codefresh booth later yes can you hi so since everything is deployed like vcluster is deployed using humcharts and you deploy that using crossplane how do you make sure someone doesn't accidentally delete all the vclusters and all the ARGO CDs in it do you have any protections around that or testing so we use the getups model of course because the only way you can add or remove those is if you I repeat the question because they encourage on the recording they might not hear it because the question was how do we make sure that no one accidentally deletes the entire thing because we have one composition that controls everything and if you remove all of those compositions the cluster will become empty and we will lose all of our vclusters and all of our hosted ARGO CDs so in the getups model basically it's modeled with git permissions but the code owners to the repository is the bot user that's used by the platform and ourselves and no one has not even our devs have cluster admins on this cluster on the hosted run times cluster they have no reason being there ok so this is how we make sure that yes can you pass the mic one question for for the vcluster storage were those the resources for high level deployments and anything like that are currently stored you you show that you have access inside the vcluster pod but what happen if I restart the pod ok so vclusters the api actually vclusters are stateful sets they have pvcs and if you restart the pod the QMS API it reads back from the pvc and continues from the same spot so you don't lose it's not stateless in that sense and they also have the possibility to use an external postgres database and you can then skate it out so under the getups model which you basically store all the if I press the little button in your platform you basically just create another yaml file in your big repository and everything gets deployed is that the case or is there any separation between that's exactly the case when you click the button our platform the bot user that's inside the platform one of the microservices it just commits this yaml file that's it and then everything like the repository is really locked down so nobody can end that ok cool thank you one question you mentioned that you were using virtual vclusters could you give some more details about what you had to do so just to make sure the question was what additional hardening we use so that no one can access like one customer can access another customer so the hardening basically what we did is just to make sure that we cannot do inter namespaces network calls ok so just network policies and our quotas for resource isolation and that's basically because vcluster has its own API there is no chance you would be able to read a secret for example for a cluster that's a target cluster thank you just a question how did you solve the single point of failure caused by the root cluster I didn't hear ok ok what happens if the root cluster have a problem yeah so the question was how did we solve the single point of failure like which is the host cluster in this case so everything is fully getups so if we want to basically everything is starting get and our model as well as the virtual argosid which is like more code for specific stuff now but everything is like fully getups and even if the cluster gets deleted we have DR and we can spin it up on another cluster we will have some downtime there will be some downtime yeah but it's a question between you know risk and how much you want to to put effort on that risk do you really just have one vcluster then for each customer or do you have vcluster replication in terms of a failure on one vcluster you can switch to another vcluster instance or replica we have yeah the question is we have a vcluster per customer each customer has their own vcluster and basically vclusters can scale out you can use external database so it has more than one pod so if a pod restarts it all leaves basically any other commands workload or did you solve the problem with the noisy neighbors that you referenced in the single cluster approach the question was how did we solve the noisy neighbor issue ok so vclusters have isolation modes and they are completely resource isolated and because it's a different QAPI the neighbors don't really affect each other if for example we deploy CRDs for ARGO CD in one vcluster we can deploy a completely different set of CRDs in another cluster with the same name it wouldn't disrupt anything there's a noisy neighbor problem would really be prominent if we had name spaces for each vcluster and then cluster wide resources would conflict and stuff like that there are some scenarios that maybe we don't cover right now but it's the same answer as before it's a question of how much effort we're going to put to put versus the risk that we're going to to avoid so if somebody does something strange or we have malicious actors maybe they will do something but do you have any problems with the hardware part so in vclusters you're still sharing the nodes right so how do you maintain that isolation between the vclusters so the question is like how do you manage to if one customer runs in just one node and you lose the node and all the workloads are lost so we use pod topologists to spread constraints on the pods those are also synced to the host cluster and you can spread out the pods on different nodes and so we spread those so one customer doesn't get affected by a single node failure and so everything remains up thank you so the question is you said you could deploy multiple vclusters of Kubernetes is there compatibility issue when you deploy something that is deprecated between so for example PSPs in the host cluster then you have the pod security admission controller and the downstream cluster is there some compatibility issue here and the question was if you have deprecation issues like on the host cluster if you deploy something on the vcluster that's deprecated on the host so you might get those if you sync some resources but usually what you sync is pretty basic resources like pods and secrets and containers which are mature features forever but in theory yes the deprecations can affect you if for example you use ingress v1 beta and the underlying the host cluster is Kubernetes 125 and it doesn't support those anymore you could get that of course so vcluster is a brand new project on its own and maybe you know this question is better answered by the vcluster people but essentially what's their recommendation is you might have some issues and they recommend maybe going one version forward or backward so if parent is x you should have a child that is x plus one or maybe x minus one but not x plus three or x minus plus three I think are we so you use XRDs in crossplane and to manage it how do you test these XRDs so the question is how do you test compositions called XRDs so basically you can create another version of the composition and we do it a lot because we have for ourselves hosted virtual argostd deploy that we use internally and we deploy a new version of the composition which switch our claim to use a new version of the composition we test it out on ourselves we run it in our production for a couple of days and then we roll it out to users you can, we use we have end-to-end tests for all of the components we use Cypress like anything, any other end-to-end testing or any more questions thank you for the talk how do you manage the life cycle of the virtual argostd into the vcluster did you hear the question sorry how do you manage the upgrade when you want to upgrade argostd into a vcluster how do which any other questions how do we manage upgrades of argostd am I correct basically by creating a new composition version with the new argostd in the resources we just bump the argostd version we again use test composition we apply it on ourselves and we roll it out to customers if we need to basically we just really easy upgrades any more questions does the upgrade of vcluster has some downtime vclusters say QPAPIs again louder please does the upgrading of vclusters QPAPI have some downtime to the QPAPI does the QPAPI for the vcluster have any downtime but we don't upgrade it we do upgrade if you want to bump but it's rolling upgrades you update like one pod then another you might get short downtime Kubernetes all the way think that way Kubernetes inside Kubernetes standard Kubernetes tools you might get short downtime if the vcluster is down for like 30 seconds because the pods all run on the host cluster you wouldn't feel it and the problem arises with some CRDs stuff like that but for short periods of time it really doesn't affect customers I don't feel it Thank you for the great talk I was wondering have you considered using cluster API and if yes why didn't you choose cluster API but crossplane the question was why didn't we choose cluster API as far as I know cluster API is good for deploying Kubernetes clusters or providers I don't think there is an option to do something like deploying vclusters with this tooling and we really needed the vcluster thing it was really essential here I think crossplane when we did the evaluation was a superset of what cluster API was offering maybe I think there is today vcluster support for cluster API there was a presentation from Adobe I think but we started maybe one year ago maybe this didn't exist yet Thank you for the talk we have sometimes hard time troubleshooting and debugging some problems with crossplane compositions especially when they are not ready or healthy you mentioned that you created your own node exporter for Prometheus to monitor those crossplane compositions especially we have sometimes hard time with sub-resources of the compositions how do you manage monitoring did you do to debug those So the question is on monitoring is basically the proprietary exporter just uses business matrix it reads matrix from our platform it doesn't access the vclusters even vclusters we monitor them with Prometheus and we haven't had any issues with them really we didn't see it necessary to monitor the vcluster itself it was also always stable we had no such issues but in theory you can also deploy collectors inside the vcluster if you want to and have them report somewhere externally I was thinking about if we need to I'll add an open telemetry collector inside as one of the components and we will send the matrix to some external Prometheus