 Greetings! It is our pleasure to meet each other at this CubeCount 2000 virtual event. My name is Huamin Qian. I work as an office of studio as a Red Hat. And my name is Martin Frantzik, where I work as a software engineer at Kubermatic. This talk is organized in this way. First, we go through the CubeVerse project. CubeVerse is a CNCF project that manages virtual machines the same way as Kubermatic does for containers. Then we introduce Gardner project. Gardner is the open source project that manages Kubermatic cluster's life cycle and offers day-to-day operations. It runs on a number of infrastructure platforms including CubeVerse. We explain how CubeVerse features can help Gardner to deploy high-performance and highly secured Kubermatic clusters. We focus mostly on networking and storage features. That's how to use motors, CNI for high-performance and fully isolated networking configuration and how to use beta volume and clone to accelerate virtual machine deployment at scale. CubeVerse joined the CNCF sandbox over a year ago in 2019. It enables workloads that run instead of VMs to be deployed on the same Kubernetes cluster as containers are running. It uses some of the Kubernetes native objects, such as positions, volume claims, and resources claims. It provides a convenient way to describe virtual machine configurations and their states. In this example, a virtual machine that consists of a virtual network interface and virtual disks are described as devices API instead of the virtual machine. Specifically, the interfaces API describes how the networks are configured, which network they are attached to, whether that is the Kubernetes pod network or additional networks that are defined in the network attachment definition. The virtual disks are described in a disk API. The information includes such as the size of the disk, whether the disk is the data disk or cloud in this disk. Specifically, the disk can be referred to existing virtual persistence volume claims. In addition to the virtual machine APIs, CubeVerse also provides APIs to manage the virtual machine templates, data volumes, and virtual machine states. CubeVerse can be conveniently installed on open shifts and Kubernetes as an operator that is available as an operator hub. Gardener is an open source project that delivers fully managed Kubernetes clusters at scale. In simple words, imagine that you can create new Kubernetes instances as pods. We can call it Kubernetes as a service. Let's take a closer look at architecture. Gardener runs on Kubernetes. We can split the diagram you see into three separate clusters. The first square on the very left side is the garden cluster, where all core components run like Gardener API server, Gardener controller manager, and Gardener scheduler. The second one in the middle, the seed cluster. As you remember, I said Gardener is capable of creating Kubernetes clusters as pods. Thus, the role of the seed is to run control planes of freshly created Kubernetes instances. Each seed must have a running component called GardenNet. As you can observe, there is an analogy to Kubernetes core components. GardenNet to Cubelet, Gardener API server, controller manager, and scheduler to Kubernetes ones. Last but not least, the shoot cluster. On the right side, where all worker nodes run, the cluster itself must be created on one of the following cloud providers listed below. Each provider has a Gardener extension that knows how to start a new worker node. To summarize this, a newly created Kubernetes instance is stretched between seed, where control plane is present, and shoot, where worker nodes are available. Communication between them takes place via load balancer services and VPN connection. Recently, we added the Kubevert extension that allows customers to use any bare metal environment that supports Linux KVM, and allows to start many Kubernetes clusters on-premise. I must admit that I find this use case very interesting as we run Kubevert VMs as worker nodes. It brings some challenges and part of them we will describe in the following slides. Mouser-CNI is the meta-CNI that allows Kubernetes parts of Kubevert VMs to attach to a different network other than the default part network. In order to use motors, you have to install the necessary demo sets and CRDs. In OpenShift, these are all pre-installed for you. In this example, we create a Linux bridge-based IP VLAN enabled network attachment definition. There are multiple types of motors in line. In this example, we use a bridge. In that example, you can see Mac VLAN and IP VLAN. The bridge has to exist first. If you do not have it, you have to use some other automation mechanisms to create the bridge on the nodes as you want the attachment to happen. The Linux bridge uses IP VLAN tag 1234. In environments, especially when you are running on public clouds, you may not have access to native VLAN. You can create tunneling mechanisms. On top of the tunneling device, let's say, for example, VXLAN, you create the bridge and then using the VLANs on top of the tunneling device. The IP address management used in this example is a well-bought, which allows you to manage IP addresses across the nodes globally. There are other types of IP address management such as static, which only manages the specific IP addresses, or host local, which is limited to per host as VNs or parts are running. Q-verse uses motors for a number of use cases. So in this use case, we want to separate the networks. So the virtual machines can use the default part network to access Kubernetes native services such as API server and using another network for data management. So by separating the two networks, the VMs can ensure that the traffic from each of the networks will not interfere with each other. For such configuration, we first defined a network attachment definition that describes how the network interface is constructed and how IP management is done. The network attachment definition is referenced in a virtual machines network's API. On the right side of the YAML, the network API supplies a number of types. The part type is a Kubernetes part network. The part type is referenced to the network attachment definition using the format as namespace slash the name. It also supports a number of attachment mechanisms, such as bridge as well as RILV. Once this YAML is applied, the virtual machine YAML is applied. In the background, the virtual launcher part will create a necessary networking and a QVM view environment to configure the VM, so it's just as desired. Allowing the VM to access both the part network and other networks is great for the VMs to access all types of services provided by Kubernetes and other networks. But for security sensitive environments, it does not want the VMs to be exposed on the part network, because we can pin or just attack the VMs without proper authorization. Network policy is surely going to help, but it's not efficient. So we have another configuration for full isolation, where the VMs is completely disconnected from the part network. So if you look at the YAMLs and the configurations, the only difference is that in the virtual machine network's API, the part network is completely removed. Only the motors network stays. In this configuration, when the VM starts, it loses access to the part network. So any network traffic from the Kubernetes part network will not reach the VM. This ensures that the VM is in full isolation in its own VM. And this also saves the IP address from the part network, so it's also efficient. We are going to see the both configurations used in different use cases in Gardner. With that information in mind, now look at some of the diverse network configuration use cases. The first is the high performance network configuration. As you know, the virtual machines are often worked with cloud native storage. That's a provisioned by Rook. Rook is another CNCF project that's recently graduated. It's a storage backend operator that's provisioned set, as well as Apache Cassandra. One of the recent features in Rook is that it allows a self-cluster to use separate network defined by network attachments definition so that the front end, the public network, the self-clients are interacting with, has the best possible bandwidth. And while the cluster network, where the OSDs are communicating with each other to share for the data to balance the data, they are separated from the public network. So in this use case, the VMs and the self-public network reference the same network attachment definition, so that the traffic will be isolated in the same network. In this configuration, they can assure that the traffic is not interacting with other networks and the performance is guaranteed. On a separate occasion, for the full isolation configuration, the VMs are protected in their own VLANs. In this configuration, we use Linux Bridge as the connectivity mechanisms in network attachment definition, with the VLAN number embedded in the definition. Again, for environments that should not support native VLAN tags, you can build up tunneling IP service underneath using VXLANs and create a bridge device on top of the VXLAN. So the VXLAN will emulate a switch trunk that will just tunnel all the VLAN between different endpoints. So this is one of the configurations you can reference for network isolation. Now, let's put our learnings into practice. We will use Gardner and the Kubernetes integration as a case study to show how we can use the two configurations of model C&I to provide the best performance as well as the best isolation. So, we call this Gardner project provides Kubernetes as a service. It's a provision of Kubernetes clusters on a number of infrastructure platforms. It's done so by abstracting the underlying resources including the cloud environments, the infrastructure environments, opening the system, network, and working-nose management and such and such. These abstractions programmatically and administratively allows Gardner to be extended to different platforms. So the Kubernetes cluster working-nose created by Gardner is called shoot. So we have two shoot clusters in this example. The blue shoot cluster consists of infrastructure config that has two networks. One is the shared network. In this case, the shared network refers to the network attachment definition created by Rook for SAP's public network. So the reference here will be, the reference used by namespace RookSaf and namesaf will refers to existing network attachment definition. The VMs will use this network attachment definition to attach to the same cluster as SAP. The Tennis network, on the other hand, provides a set of information that will be used to create a network attachment definition on the fly. So this is its own privacy network attachment definition that will not be used by other clusters. To have another layer of isolation, the NAS attach depth also uses a VLAN1234 to ensure that the VMs are running in isolated VLANs that are not accessible by other VMs on a different attachment. The second shoot cluster, the green cluster, the gray shoot cluster also refers to the same SAP's shared network for high-performance purposes. This creates its own Tennis network by providing a different set of configuration for network attachment definition. The VLAN configuration in this case is different. It's 2345. So by having two network attachment definitions and two VLANs, the green network and blue networks are separated physically and symmetrically as a different Kubernetes cluster. So this just shows the case that the motors configuration used by Qiverse can provide a flexibility for different use cases. Before I start talking about data volume and clone, I must say a few words about the project behind, the containerized data importer. It's among one of my favorite add-ons in Kubernetes. The primary goal is to provide a declarative way to build virtual machine disks. However, you can use it as well to initialize Kubernetes volumes with some data out of the keyword context. CDI includes a custom resource definition that provides a data volume object type. Those objects are an abstraction on top of persistent volume claims and are very helpful in terms of data imports and uploads onto PVC. This is a way to automate virtual machine disks management. Without that, you would have to prepare a PVC that contains the disk image yourself. One of the good outcomes of CDI integration with Qivert is that you can specify a concrete data volume which is tied to the virtual machine lifecycle. You can do it over data volume template in the VM spec. When you delete the virtual machine, the data volume would be destroyed too together with the provisioned storage, so there is no requirement for the user to take care of the cleanup. On the right side of this slide, we can see a sample data volume defined. There we have two interesting parts, source and PVC. The source determines if data is supposed to be cloned from another persistent volume claim or downloaded from some address. In that case, in the source we would have HTTP options with URL value. Below there is a PVC section as the name indicates it's about PVC settings, like what kind of storage class you want to use and how big storage you want to have. So having all these details in place, data volume would create a new persistent volume clone for our disk. The addon makes our lives much easier. I encourage everybody to check it themselves if you haven't yet. I'm going to show the possible user strategies in the next slide. Virtual machines images management. Here we can see two diagrams that show two different strategies of data volume usage. Adhoc and triallocated data volumes. For now let's focus on the first one, the adhoc approach on the left side. As I mentioned in the previous slide, we can download data directly from the internet by changing source from persistent volume claim to HTTP with an appropriate URL. So now considering the scenario where we type data volume to virtual machine by using data volume template in the spec, means each time we create a new virtual machine, each time we download the disk. This can cause many problems if you want to schedule hundreds or thousands of VMs. First, you can have performance issues. You might have quite slow network connection or some rate limitation, even heavy network workload. Let's say you must download 4GB disks for each VM and you run many of those. Second, the endpoint you are using might disappear, causing reliability issues. Somebody could say that instead of downloading stuff from the internet, you could create, let's say, a local HTTP server, I don't know, maybe using nginx and serve images from there. Yes, you could, but that doesn't solve all problems I mentioned in 100%. And you still can cause network latency on your cluster. Now, let's focus on the pre-allocated data volume, as it helps to get rid of problems I spoke. The diagram on the right slide, the pre-allocated data volume that is visible in the middle, is a source for other data volumes. You download the disk image only once and you use the volume as a source for other virtual machines you want to bring to life. This strategy we follow in the gardener-cubivert extension implementation. We get rid of problems I described previously. We reduce network latency. Data volume supports two types of cloning and I'm going to describe this in the next slide. We have two ways of data volume cloning, host-assisted and smart cloning. The first split on the left side shows a diagram of the traditional cloning of persistent volume claim. That is host-assisted. It means there is a data stream from source PVC to target PVC. The entire volume is being copied. It is a traditional heavyweight approach. It is not necessarily needed to weigh storage space and put some additional load on disks just to copy the whole image. On the right side, I show another strategy of volume cloning. It is a feature called smart clone. This is being executed when it's possible. In order to improve the performance of the cloning process, the containerized data importer team introduced smart cloning that uses volume snapshot. Of course, to use such a feature, the CSI plug-in which you use must support snapshots. The YAML structure of the data volume is still the same. The data volume automatically checks if smart cloning is possible. If yes, then it creates a snapshot of the source PVC. Next, the snapshot is being used to bring a new persistent volume claim. Finally, the snapshot itself is going to be deleted. Additionally, if you want to use the feature, the requirement is to make sure that source PVC is in the same namespace as the target PVC. The source and target PVCs are in the same storage class. And finally, there must be a snapshot class associated with the storage class. Depending on the storage backend, you can really benefit from this. Especially when snapshots are implemented as copy on write, which is often. On the next slide, I'm going to show you in detail how we deal with disks and volumes management in the Gardener Kubevert extension. It should cluster that I was talking about at the beginning is defined as a custom resource object. You can see an example on the left side. The should YAML definition. As you can see, there is a worker section where users can define volumes. The first one, the volume object spec. It defines a root disk of virtual machine and the second one, data volumes, which are blank additional data volumes disks. The worker pool definition is processed by the worker controller in the Kubevert extension. Based on that, an appropriate specification of virtual machine is prepared. The size parameter is an obvious indication of the storage size. And the type determines which storage class should be used. Somebody could ask why didn't you name the parameter storage class instead of type? We didn't because Gardener supports many cloud providers and this is the API. Then, following the arrows, we can see that volumes are translated to data volume templates. So data volume is closely related to the virtual machine lifecycle. By default, the extension uses a relocated data volume approach, which means that the first virtual machine created will download the image and other virtual machines from the worker pool will clone the volume. We use blank data volumes to provide additional disk and storage space. These are completely blank disks and it's up to the user to handle them. One way would be to run cloud init and format the disk according to their needs. Although I must admit that this is not yet possible to pass users cloud init scripts in the extension. The reason is that we are using cloud init to join nodes and we still have to decide how we want to deal with the custom part of cloud init. But for sure, at some point, this feature will land in the extension. On that slide, I'm going to talk about highly isolated clusters and how to upload these images there. It happens that customers demand highly isolated environments. Not so long ago at Kubermatic we had such a case and environment where we could not pull keyword images directly from the internet. There was only a bastion host where we could access a cluster and expose a few services outside. But still, those services were only visible to the bastion host. Each port couldn't reach the internet. Somehow we had to provide virtual machine images to the cluster. We could create a local image repository. However, with containerized data in Porter, it is quite easy to upload image. We can benefit from that and prepare data volume which will be a source for others. The upload is done over CD8 upload proxy. First, users have to expose upload proxy service and it must be accessible from outside the cluster. Next, it is required to create a data volume with source set to upload. Similar like on the slide. And finally, we must request an upload token that can be done over upload token request custom resource. The CRD of the object comes from the project itself. When all dimension steps are complete, it's time to upload an image. You can obtain the generated token from the status section of the CR. Then, you have to pass the token to an HTTP request with a disk image. There are two ways of upload. Synchronous and asynchronous. The synchronous connection can be closed unexpectedly because of the conversion or resizing process that can take some time. With an asynchronous approach, the connection will be closed as soon as the disk image has been transmitted. However, you have to check yourself if the process of resizing and conversion is finished. For more details and examples, I encourage you to visit the containerized data in the portal github page. It's a very nice project which solves many issues regarding volumes and disk management. This concludes our talk. We are so excited to share with you the optimizations in Qtiverse using modules and data volumes for high performance and high security configurations. There are even more features in Qtiverse including the steps you will remember with storage and networks. We are looking forward to share with you in the future opportunities.