 Hello, and welcome to this tech demo. My name is Joshua Moody. I'm a staff software engineer working for Edge Labs, which is now a part of SUSE. I work on the project Longhorn. Longhorn is a distributed block storage engine built for and on Kubernetes. In today's tech demo, I'm going to show you why you would want to use Longhorn's persistent storage for your Edge deployments to make sure that your data is safe. Let's give a quick overview of how Longhorn works. Longhorn conceptionally consists of two components, the control plane, which we can see on this slide, as well as the data plane, which we can see on the following slide. The main component for Longhorn's control plane is the Longhorn Manager. It's responsible for orchestration of all volumes, creation of backups, and initiation of restores, et cetera. We also have two auxiliary components, which is the Longhorn CSI plug-in, as well as the Longhorn UI. Both of these interface with the manager via the Longhorn API. The Longhorn CSI plug-in acts as a bridge between Kubernetes and Longhorn Manager. This way, Kubernetes can request volume provisioning, as well as attachment, creation of snapshots, et cetera. The Longhorn data plane consists of two parts, the Longhorn Engine, which you can think of as the volume controller. It's responsible for processing all IO requests, as well as communication with the replicas, which you can think of as the volume's data. Here on this slide, you can see two workloads with each one having their own volume. Each volume has only one engine and two replicas per volume. The amount of replicas per volume is configurable. Each volume can have a different amount, depending on your application workloads requirements. You can also see that the replicas got randomly scheduled onto different nodes. This allows for the case where one of your nodes fails, that your data is still safe. If you look on the left, you can see my messy test setup. If you want to replicate this, you will need four Raspberry Pies. I used the four gigabyte version. Four 128 gigabyte SD cards, or bigger, to provide storage to the operating system, as well as the cluster. Afterwards, you need to install a 64-bit ARM operating system. In my case, I used a boot server. If you want to use RWX volumes, you need to ensure the availability of an NFS4 client. You can do this on Ubuntu by installing NFS comments. For an operating system, there's alternative packages. Once your pies are prepared, you need to install K3As on each of them to form a cluster. At this point, you can optionally add that cluster to your existing Rensha installation. Let's start with the tech demo. We start by logging into Rensha. Here, you can see my existing clusters. For the tech demo, we're going to use the K3Asable to Edge cluster. Let's have a look at the nodes. These are my four Raspberry Pies. And you can verify that your cluster is working correctly by checking the system information. Now that we've verified the cluster is working correctly, let's start by installing Longhorn onto it. We can use Rensha to deploy the Longhorn app from the catalog. Now that Longhorn is up and running, we can have a look at the Longhorn UI. For this cluster, we want to configure a specific setting that allows Longhorn to delete parts on failed nodes. By default, Kubernetes will not delete parts on failed nodes since it's the responsibility of KubeNet to do so. But if the node is down, KubeNet isn't running, so it cannot delete the part from the API server, which means the part is stuck. And any volume that's used by the part is stuck as well. So in Longhorn, we provide options for deleting the stateful set parts, deployment parts, or both. In this case, we're going to choose both. Besides the part deletion, we have a prior implementation that is less aggressive but only applies to deployments. In the volume attachment recovery policy implementation, we leave the failed part on the API server but only clean up the Kubernetes volume attachment object. Assigned on this cluster, we enable both deletion for stateful set as well as deployment parts. We don't need this, so we can disable this feature. Now that we configured our settings, let's go back to Rensho to deploy our application workloads. Let's start by deploying our data aggregator workload. You can do this by using kubectl to deploy the YAML file or using Rensho's import YAML feature. For this example, we're going to use Rensho's import YAML feature. The data aggregator consists of a service, persistent volume. In this case, we only need a single instance of this data aggregator. If you needed multiple instances, you would use a persistent volume claim template as part of the specification for the stateful set. The only interesting aspect in this data aggregator configuration is that we set the termination case period to send seconds as well as overwrite the default tolerations of the Kubernetes sets to allow for faster failover. But the fault Kubernetes sets these tolerations at five minutes, which means a pod will not be evicted for five minutes from a failed node. You can also configure these defaults as part of the Kube API server command line or during the K3S installation. Let's wait for the data aggregator to be up and running. We can see that the volume has already been provisioned and we can see that the data aggregator is now up and running. Let's continue by deploying our data collector workload. The data collector is a stateless demon set that runs on each of the Raspberry Pi worker nodes and is responsible for collecting sensor data, then transmitting that sensor data to the data aggregator service where it can be processed and stored. Now that both our application workloads are deployed, let's take a minute to discuss why one would want to use persistent storage for the data aggregator service. In the case where one uses local storage, should the worker node where the data aggregator is deployed fail for whatever reason, one would lose the data that has currently been stored. In the case of Longhorn, since we use three replicas by default without anti-affinity, each replica is scheduled on a different worker node, we would still be able to just reschedule the data aggregator service onto a different worker node and continue processing the priorly existing data without any loss of data. Let's show this in action. We start by identifying the worker node where the data aggregator is currently deployed, then we're gonna ssh into the node to turn it off. Let's wait for the node to get marked. The time that is required for the node to get marked has failed, it can be configured by setting the node monitoring interval as well as the QBlade post status. You can do this during the K3S installation. Now that the worker node has been marked has failed, let's have a look at our deployment. Currently the data aggregator is still on unknown status because we're still within the toleration seconds. Once the data aggregator gets marked for eviction, Longhorn will go and delete, force-delete the pod, which allows the pod to be rescheduled onto a different worker node. Now that the data aggregator is up and running on K3S worker 3, let's have a look at the Longhorn volume. We can see that the status is now degraded, this is because we have a failed replica on worker node 1. Even in the case where our cluster would be bigger, Longhorn would now go and rebuild another replica on a different worker node. But since we only have three worker nodes in this cluster, and we have hard onto the affinity, we will run into degraded status with only two replicas. Let's recap, as you have seen a single failure of the Raspberry Pi did not lead to any data loss. Besides that, you're also seeing that Longhorn cleaned up the workload pod so that Kubernetes can reschedule onto a different worker node. Had you used local storage, you would have lost your data and your data aggregator would have been stuck. Let's make this even better. By setting up a recurring backup schedule for the volume, we can ensure that even in the case where the whole cluster dies, we will not lose any data because we can restore from our backup on a different cluster. In the background you can notice I have replaced the faulty party which allowed Longhorn to rebuild one of the volume replicas and transition the volume back into a healthy state. Let's continue by setting up our backup target. Longhorn supports any S3 compatible endpoint as well as an NFS4 server. For this demo, we're going to use Amazon S3. We set the S3 endpoint as well as the previously created S3 secret. We can now set up the recurring backup schedule for this volume. For the schedule we want to take a backup daily and we want to retain seven backups so that we have backups for the whole week. To start off, we're going to kick off a manual backup. We're going to add a custom label. You can use labels to categorize different backups and differentiate between them. We can see the backups in the progress of being created. Let's take a look at our backup store. You can see that I shared this backup store with different clusters where we already have volume backups for different volumes. Let's have a look for our volume and the backup that we created. We can manually restore this backup to a new volume inside of this cluster. If the original volume would have been lost, we could also recreate it. As you can see, the volume is currently not ready for workloads because it's in the process of restoration. We can see the restore process and once it's completed the volume will be awaitable for workloads. Let's also demonstrate the ability of setting up a DAV volume for one of the volumes that is currently not present in this cluster. You can see this is a different volume that is currently not present in this cluster. The volume information is recorded so we can use the same name, size and replica count that was present on the other cluster. You can see this marker that shows you that this is a DAV volume. What the DAV volume does is continuously monitor the backup store and always restore the latest backup for this volume that is present on the backup store. In the failure case of the other cluster we can activate this volume and it will become awaitable for workloads. This concludes this tech demo. If you want to learn more about Longhorn and the upcoming features, come join us on the Longhorn CNCIF Slack channel.