 Hello there, and thank you so much for joining us here at ORQ Contag, how we will declare on K3S the learnings of growing up fast. My name is Anais Ueles, I'm a Zed Reliability Engineer at SIBO. Before joining SIBO, I worked for several years as a developer advocate first in the blockchain space and then transitioned into DevOps. I also started a challenge called 100 Days of Kubernetes, where we aim to learn something new related to Kubernetes across 100 days. If you're curious, you can find more resources on my Twitter, linked on the slide. Hi there, I'm Alex Jones, I'm the Principal Engineer here at SIBO. I'm also a technical advisory group technical lead for app delivery within the CNCF. I've worked at companies such as Microsoft, B Sky B, JPMorgan, American Express and many more. And you can always reach me on alexjonesax at Twitter if you want to chat. So what is SIBO? What are we actually basing our experience that we share with you here this talk on? SIBO is a managed Kubernetes provider, meaning you can spin up Kubernetes clusters in 90 seconds or less. We are highly community focused and community driven, meaning a lot of our offering is based on community input. Also, our platform provides several different options to allow for community feedback and community created resources to contribute and help us grow the platform. Additionally, SIBO is based on K3S, which is a separate Kubernetes distribution that allows us a lot of the features that we will talk about throughout this presentation. So why is there a need for yet another managed Kubernetes provider? Well, first of all, the market is growing. More and more people want to use Kubernetes either for their personal projects or for their company's needs. So in itself, people want to be able to use Kubernetes as quickly as possible with little hassle and without having to read any books on how to actually do it. Secondly, SIBO is based on K3S. There is not yet a managed Kubernetes provider that is based on K3S. We will dive into more detail into what K3S actually is in a second. And lastly, there is a need for a cloud native Kubernetes service provider. SIBO is cloud native first. A lot of the CNCF projects that you will hear about throughout this CUBE column are actually integrated with SIBO. You can spin them up when you spin up a Kubernetes cluster alongside it. So you don't have to dig through and this documentation, but you're up and running quickly with the tools and platforms that you want to use on top of your Kubernetes cluster. So what is K3S? K3S is a lightweight Kubernetes distribution. On upstream Kubernetes, you have a lot of components added to it that are provider-specific. Now, those have been removed from K3S. You only need about 512 Macs of RAM to run on a server, meaning you can run K3S on a Raspberry Pi and other small devices. It's extremely lightweight and it comes packed as a single binary so you can install it like that. And the best part is that it comes with core technologies such as CNI and CSI and so on. So you don't have to install them yourself on your Kubernetes cluster. So beyond these specs, why would we want to use K3S over upstream Kubernetes? Well, first of all, K3S has reduced installation complexity, meaning because it comes as a single binary, it's really easy to install to spin up. You can try it out yourself on most machines. Then it also has higher class density per compute node, meaning we can spin up more Kubernetes clusters on the same hardware because it requires less pre-allocated resources upon installation. And then it also has a simplified backup and recovery system and overall, like mentioned earlier, faster launch times. We are able to get launches below 90 seconds. So now that we've learned about the value and benefits of K3S, let's connect it back to the infrastructure we have at Sievo. You can see here on this image, one of our super clusters with the compute nodes within. Now our super clusters, our infrastructure is open compute compliant. On the next slide, you can see a diagram and illustration of those set super clusters. We have in each super cluster 32 compute nodes of which four are the control plane. Now our infrastructure is hyperconverged, meaning all of those compute nodes have the same storage and the same specs and can all run workloads. So you can schedule workloads, you can schedule clusters on any of those compute nodes. The best part about our infrastructure is that it's zero touch, meaning you can just roll one of the super clusters and pretty much any data center and just plug it in and go. So let's have a look at how the tenant K3S clusters actually look within our super cluster at Sievo. So first and foremost, we have the K3S process, but we need to run that in a safe way. We chose to use libvert and that with cubevert allows us to run a virtual machine within a pod. This virtual machine as a pod is set up with several others to form a K3S cluster. So how does that actually work? Well fundamentally, when you start up a virtual machine, it gets loaded from a base image and then libvert on the host super cluster node will create a VM that is then attached through a virtual machine instance. This is extremely powerful because with hot plug technology, we can attach PVCs and allow storage to be provided to that K3S tenant cluster. And with that, we can reach 90 seconds or less for our cluster launch times. We load up the base image, we spin up the virtual machine, we run cloud in it, load the dependencies and spin up our K3S Q&A disservice. Let's talk about networking for tenant K3S clusters. As you can see from the diagram, we have a namespace for tenants and in that namespace, they can have multiple clusters. However, what they will have is a single router pod. This router pod is an L3 network bridge and on that router pod, it will have two nicks. One of those nicks will be on the private tenant network and one will be on the public network. And we have the ability to control ingress and egress traffic with firewall rules and IP table exclusions. This means that as you resize your clusters and recycle nodes, we can dynamically update those rules so your traffic is always ingressing to the right place and equally, we can see how much utilization is being required by a tenant at any one time. Because of the way that we've designed our K3S tenant clusters with virtualization using libvert, that means that we can have multiple clusters within the same super cluster tenant namespace. This is an exciting proposition because it keeps things simple. And on top of that, because of the lightweight footprint of K3S, we can pack more K3S clusters in per physical compute node. That combined with the fact that every single node pool from a tenant cluster is distributed across our digital infrastructure means that reliability and resilience is built in. So having described the technologies that Civo has built and some of the challenges around virtualization and creating tenant clusters within our super cluster, it's important to reflect on what we learned in our beta period once our MVP was pushed out the door. It's been really, really insightful to learn about some of the community-led feedback and also some of the things that we discovered along the way that we didn't know we went into this. Firstly, deployment maturity. When you get things started, it was a mixture of a bunch of different scripts. We had Terraform, Ansible, we had distributed deployment processes, some of which were manual, and we had very, very little control over how fine-grained we could be with our deployments. It was at a regional level. So if you had something you had to distribute to London One, the entire region would be updated through Terraform or through Ansible. This meant also that it was really constrictive for multiple team members to work on the same code base. Even if you've got infrastructure as code, you have to be extremely careful with how you position that, and also it was extremely difficult to roll back. So what do we learn from this? Well, we moved to a workspace-based model, we moved to dry configuration with Terraform and Terragrunt, and the idea that you could apply microchanges, that every single region is comprised of modules, and you can update those modules independently of the work of other team members. This also was in conjunction with a large-scale change to having gated pipelines so that we could deploy through our test regions, then to production, all the while performing linting and testing along that process. This has now matured to the point where we have automation to build clusters, to check that they work, and to then continue on that journey of continuous deployment. The next part was observability maturity. When we went into this, we were effectively flying blind. We had some raw Prometheus configuration, some very Prometheus-based alerts, since the sense that they were node exporter alerts and Kubernetes default API alerts. We didn't really know how SIBO stack was doing. We didn't really know what our customers were doing. We couldn't answer basic questions like, is a customer cluster working? Are the tenant nodes working? Or are they facing loads of problems inside of their clusters? There was a huge manual effort in having to jump in and look at that Kubernetes cluster. However, we've now moved more in the direction of having three different signal generators and collectors. We've moved to have aggregated signal collection and the ability to have a holistic view over all of our telemetry across our regions. This is in the shape of having Prometheus with distribution into Thanos. It's in the shape of having Loki to collect our logs from the Syslog level all the way up to the tenant cluster level and about having Jager tracing to look at how our own operators perform. Are they hanging? Are they finding that there are particular function calls that are becoming unoptimized? As we're building out this observability control plane, we're also starting to mature our model of what is the kind of data that's meaningful to us. We mentioned our cluster launch times are an extremely important part of our proposition and so therefore we have a big dashboard that displays how we're doing in performance across regions. How does this particular container D or storage layer change impact our ability to perform and our ability to deliver? I mentioned also that because we've had this distributed approach where we have now federation of our observability, we have much more of a holistic viewpoint. The most recent thing that we've moved towards is having Thanos to enable us to see how our clusters performing over time and we're now starting to build out forecasting telemetry so that we can detect problems before they happen. The final part of this key set of initiatives in the terms of maturity was to change the way that our SRE culture sort of performed in group and I say change the way but really when many small companies go into this there is that very organic culture that usually has a few very intuitive but particularly autonomous individuals who just get things done and the problem with that kind of culture is a lot of that tribal knowledge is lost. A lot of those behaviors are specific to the individual and it's very difficult to then transfer that to a new team member in an effective and a reproducible way. This was self-evident in the fact that we incidents were handled earlier on. We didn't really have that process of post-mortems and learning from those incidents and really trying to embrace the SRE mindset of spending 50% of our time developing and solutionizing and improving those systems. So we're now moving more towards the stance of let's think about SRE culture as important as the way we think of our development or our engineering culture and let's have the rituals that matter. Let's cherry pick the things that to us as a company makes sense. So we've started moving in the direction of having more rituals around, okay, let's look at the backlog. Let's decide how we distribute the work. Let's look at the rituals that don't make sense such as arbitrary things in the board like picking up items that won't improve our toil. And let's look at the things that our active feedback mechanisms are telling us that early warning system I mentioned earlier that's been a good indicator that we have some toil or some tech debt in a certain area. Is there a particular type of activity that customers are participating in that we can't really tell? Are we looking at building on the customers that we have very poor granularity on what sort of node sizes are more successful? And we should be able to also facilitate the ability for our development teams to iterate faster and faster. As they're pushing operators out by version we should have the ability with our traces our logs and our metrics to tell them whether they are compliant within that error budget or whether we need to put the brakes on and think about are we going to introduce risk into our production platforms? So we've tried to explain to you in this talk some of the reasons we really think that there's huge potential in the market for technologies with K3s with providers like Civo that are community driven. We've spoken about the hardware that we use and some of the technical challenges that we had to solve and some of the interesting things that we've managed to accomplish. On top of that we've tried to be very open with the learnings about things that went really well versus the things that were a little bit tougher to achieve. We're not sure what the future holds but we're really happy that we picked K3s as our core technology because we know that we can iterate on it really fast. We can sidestep all of the more traditional Kubernetes challenges that you have when you're trying to scale it up really quickly. This also means that we can stick to our primes of sub 90 second launch times with sub 60 seconds becoming tantalizing and close and so from Civo's perspective we're gonna see a lot more interaction with the community and a lot more excitement in terms of where K3s is going as an independent distribution of Kubernetes. Hope you enjoyed this talk. Thank you so much for listening. We will be available for Q&A now. You can also reach out to either Alex or myself here on our Twitter handles as well as Civo Cloud.