 Okay Welcome to all of all of you We would like to tell you something about philis automation and what we accomplish with yahooke in open stack Let quickly introduce we have Felix here from schwarz IT. He's a cloud engineer. We have Jonas here, which is a Team lead of cloud development at cloud and heat and I'm Marvin. I'm a product owner at stack it And we would like to talk about what is decade What is the schwarz group then show you some use cases for what we are using open stack and the growth of our cloud platform and then we talk about yahooke To talk about the schwarz group first. It's one of the Companies which has multiple sub companies in it mostly famous for the retailer brands Lidl and kaufland We also have a recycling company called pre-zero and we also have a multiple production facilities with Bongellati, which are doing ice cream or we have like the emigre who is Refilling water bottles like for little for example and within this group Felix and I are part of the schwarz it which is like the let's say the the heart of all of this where all the IT solutions are made developed and operated and Within this schwarz IT we created a stack it which is a sub-print in the schwarz IT and There we created the Stack it cloud which we offer for for the dach market in Europe and Our target customers are mostly like these small medium enterprises in the dach market and For doing such Challenge we think about it's easier with a companion and there comes cloud and heat in place which helped us to developed yoke and You know, we'll tell something about yeah need just a few words cloud and heat is a small company which has started about ten years ago with the mission to provide a sustainable and holistic cloud stack built upon innovative water cooling and Converged infrastructure using open stack as the software stack, which is why we're here after all and we have recently also expanded in the into the managed Kubernetes Yeah market we have a partnership or rather joint venture with the sequel neck Which is the sickle stack which is a hardened version of open stack. They are also some people from that here and Yeah, we're supporting the schwarz in running their Open stack cluster and developing yoke Okay, then I would continue a little bit and talk about our use cases So on the one hand we're using the open stack environment that we have Inside the schwarz group a lot and on the other end a little bit for external customers The nature of providing a cloud service is that we definitely don't know all of our customers at all Because from an infrastructure perspective, they are all just VMs But we have two big users that I want to highlight a little bit We have a use case named app cloud based on cloud foundry where a lot of internal and extra Internal apps are running cloud foundry is basically allowing you to ship applications Using some predefined build packs. So you don't need to care about Docker files deployments or things like that you just push a code and System handles the rest this whole environment is Quite or consists of two environments in total around three thousand virtual machines Six and a half thousand vcp use and 20 terabytes of memory basically amounting to 25 to 35 percent of our whole environment depending on if you count instances of vcp use if we take a look at the other use case It's the second Kubernetes engine That we don't provide one big Kubernetes cluster, but we provide our users with individual Kubernetes clusters for each use case so you can go there and say I need a Kubernetes cluster now say what kind of workers you want and get it In total there. We are running 250 clusters with 2500 instances 10 song we see views and 25 terabytes of memory And with that they are definitely our largest shareholder if you count based on vcp use Over the last year. We had a lot of growth there We grew from 13,000 vcp use to 24,000 mostly based on the Schwarz Kubernetes engine and Also increased our memory Footprint that we gave to our users by 60 percent and yes You can see that our open-stack Prometheus exporter died in the middle of the graph And Now the question is how do we run all this? We do that using the oak and I'll hand over to you for that. Yeah for this part We want to start with a little bit of motivation because When you in these days come up with a new open-stack lifecycle management You need to have a really good reason for that and We hope we have that So we we made three core decisions here The first one was we wanted to run open-stack on Kubernetes So have Kubernetes as a layer below firstly because declarative things are really nice They allow you to define how your infrastructure is and then there's a thing which ensures that the infrastructure actually is how you want it to be and In addition to that we need this gives you some real good goodies to which are useful in the open-stack world for instance Having replicas of your API's and being able to scale this up easily That usually doesn't come with a let's say classical lifecycle management like Ansible or so It's not easy to say. Yeah, I want like 10 replicas of the keystone API running on this note or I want Five I don't know. That's generally not that easy. They are not that flexible So Kubernetes is much more flexible in this regard and you get the replicas you get networking and you also get Liveness probes, which is of course really important to run a reliable service to have things which are not working Recycled taking out of the rotation taking out of the load balancer So that API requests are not failing all the time or not failing because of the temporary outage or whatever As a bonus when you run stuff on Kubernetes It's easier to port it to other platforms and you get easier testing because all you need is a Kubernetes cluster and you Can get that for instance from a galina instance or from a cluster API provider You just can set it up and then you have that cluster for testing Yeah, which improves development drastically in our classical lifecycle management That was the major problem that you always had to get like physical hardware and have the installation procedure Which is really daunting and finally you also get observability from Kubernetes because you have all the status and it's continuously Checking the status of services. So we can monitor that more easily without having to write specific checks for each service So that is how we oak was born it is yet another open-stack lifecycle management on top of Kubernetes You might get the drift where the name comes from by now It is a Complete tool with the holistic background to cloud and heat has so in installing and upgrading from the bare metal on using open Open-stack ironic as the bare metal layer so to speak then installing on top of that Kubernetes using Ansible and then on top of that open stack How do we install open-stack? It's not with open-stack helm This is the second decision and Drawing back to the title feel as automation automation being the keyword here We took the idea from Kubernetes to have control loops which continuously check the state of the system because in an unintended System the breakage can only increase So we have these operators which are for each open-stack service one or more Which take the declared state of the service like I want a keystone here with three API replicas a database We just backed up regularly and so on and so forth And they check the state of the cluster against that So is the deployment using we are using of course communities deployments Is the deployment set to three replicas do we have the database is the database healthy are the users created in the database and so forth and Have checking this regularly and also based on watches to orchestrate the entire Kubernetes wallout and also to handle Kubernetes life-cycling like upgrades in this fashion Taking glance as an example if you know Kubernetes, you know what this is. This is a Custom resource where you where we define the glance deployment so to speak and because we have operators we have for each Yeah, open-stack service component a separate resource type Which is referred to by the kind and then you put in all the specification of the Which makes up a glance like how many replicas does my diet database have how many replicas does my API have The back end configuration the policies you want to inject your target release of obviously, which is yoga what else would it be and Definitely no Queens glasses in production. I assure you or no pike either or rocky or train. No none of those it's all yoga and And you can of course also inject arbitrary config to some extent I will get to that in the decision 3 And also self configuration and that is then fed into the operator which reads that and gets notified on updates So when you edit this it gets a watch from another an event from the Kubernetes API and then it ensures that your Specification is actually put in place in the cluster During all of this it goes through this fun graph, which I won't go into detail. Just it's just an illustration So for instance, we have the config generation step below here, which feeds which has some input So this is really a state machine so to speak or a dependency network rather So we have for instance the config thing down here, which of course depends on the keystone information It needs to know where keystone is to get keystone user credentials It needs to have its database to generate the database URLs which grants needs the user password and so on Which is it's just sent fat into a database sync job and to the into the meter depth loading of glance Which is also a separate job and only when that is done It will install or update the API deployment and then it will create the Kubernetes service and then from that we have the service monitors, which is the monitoring and so and so forth and the point of this is that this dependency network is Directed as a quick graph, which is nice because that means we can go through it in a defined order and have it Have this be our core of the reconcile loop and we don't have to write all of this by hand That would be crazy But instead we have a declarative approach within yahook itself in the code which allows us to write this in a more sensible manner and The loop then on every iteration checks where the things are which have divergences for instance When I change the configuration then this will mark as unready and the configuration will be applied And then in the next iteration it will note. Okay, the configuration is changed I need to run deep DB sync and then until it eventually reaches the API deployment That also means if the configuration is invalid it will never reach the API deployment For instance, or rather what means what does invalid mean? This isn't free Of course, we introduced Kubernetes in the stack. We introduced operators. That's a lot of moving parts So we have to eliminate complexity somewhere else. Otherwise, this gets completely unmanageable and configuration management is hard and For glance it may be rather easy because you generally only have like this glance config and then you're happy But for something like neutron or nova where you have might have no specific configuration where you might have to inject for instance On your gateway notes, you need to inject more VLANs, which are provisioned by L2 To make them accessible your provider networks. They all need to be configured, but they may be not specific So the question is how do you merge all of this together without being completely? Lost when you need to track down where is this incorrect configuration coming from and For that we use Q-Lang, which is a framework for yeah working with data And we use it for you working with Configuration and Q-Lang doesn't allow you to have two conflicting values. It's that easy So if you have two inputs to the same configuration and they say different things Then it will just error and then the operator will halt and tell you well That's not well that you need to fix that and I won't hand it down to any service before you do That's just one example of complexity elimination But this is a really nice one because I think this is one of the Bains of other configuration management tools and other lifecycle management tools that you have Multiple inputs to the same thing and if they differ well one of them wins But it's not clear where that come from it might be clear when you know all of the inputs Which will win but if you look at the output it's hard to track down So if you have an incorrect configuration that it's hard to track down This isn't the case here because only one can be valid and otherwise it will tell you there's a conflict Yeah, that was our road to your oak there's obviously a lot more to it in the end a Slot is only that long although. I think we do have some time left So if you have any spontaneous ideas you want to talk about Otherwise, we will have you have us guided by questions that works So maybe just a quick rundown from then at 20 2018 the stack it was started with the open stack endeavor then using a different platform I might probably not name 2020 support by cloud and heat was bought in and we have started supporting them with operating the platform and in mid of 2020 we had the initial idea of how that we should do yahoo and the concept and Nearly a year after it was open sourced and in September 2021 I think you pulled glance into production and that worked nicely and then yeah more stuff is pulled into production They hit 100 compute nodes deployed in parallel the other day, which was nice And yeah, it's who you are asking if it was nice, but yes It mostly worked there were some hitches along the road, but that's what you get when you run a thing first time Yeah We have a website for yahoo It's all open source on github.com and That's the main deal and we are now open for questions. I think yeah For the questions there would be a microphone Otherwise, I will call you out and then you can just raise your question I will or any of us will repeat it and then so that's on the screen to Yes, you were first I think Yeah, it's okay It does you're right about that so the question was How does the bare metal part of yahoo quirk essentially so especially day two and storage So I will maybe go about day first a little bit to get some context and then we will talk about day two So I already hinted at we are using OpenStack ironic obviously We have a small component called the metal controller which essentially ties together a net box OpenStack ironic and HashiCop Walt HashiCop Walt for the IPMI credentials and some onboarding of the nodes Net box to know which IP addresses they should get and OpenStack ironic for the obvious part actually getting the thing to do what it's supposed to do That then it gets a config drive which essentially Contains I think nowadays the full ansible. Yeah, the full ansible repository which deploys Kubernetes As well as some configure additional configuration like not the join token that's fetched from Walt But you get the drift and that is then run on the first run on the first boot using cloud in it and the run script Then it is joined into the Kubernetes cluster and from there on it is mostly Kubernetes things Node upgrades is currently not fully defined within the O project. I'm not sure what you are doing We are doing it with yeah kind of running ansible. Yeah on top of the notes. Okay, you do the same thing So the Ansible playbooks exist for Kubernetes and they can also be used to upgrade the nodes and also upgrade the Kubernetes clusters There's not much automation in that regard. That's something which is being worked on also to integrate And to communicate with the operators to let them know beforehand so that they can evacuate the workload when you're starting to Yeah, upgrade the node. So this is also maybe one key thing about you that the operators are Communicating about things like this. So if you change the layer to configuration for instance It will generally proactively try to evacuate the node first if compute the workload is running on it because it might be disruptive And it tries to avoid that Regarding storage right that was the second part of the question and Brooke obviously We are not tied to theft though. They are running this with net up Yeah, just obviously completely different, but the ansible playbooks for the communities class I also cover Installing Rook if you want it optionally otherwise you have to integrate differently You can run it next to it doesn't matter just as long as it's reachable And basically what you saw as a configuration for theft is basically in the root format So if you use Rook you can directly link to it and the format just yeah So it expects kind of the Rook format, but you can use different stuff to generate the same thing Yes, you were next I think no behind you. Sorry Sorry, I didn't catch it. Can you try louder? Yeah, that's easy. That's or it was a fun story to begin with The question was basically how we handle the conflict but potentially between Kubernetes networking and OpenStack networking We in our environment were using IP tables firewalls rules previously and try that out and that didn't end that well the solution there is basically to switch to the open-v switch firewall driver and Then basically avoid this whole kind of issue. Yeah, just don't use IP tables in OpenStack The question was about the format of the CRD's and how they are getting there the CRD's are basically handcrafted at that point in time The configuration validation and afterwards is going against the conflict definitely defined by also conflict Maybe to be more specific The TRD's are not completely handwritten that would again be a daunting task But they also generated via Q-lang it can also output JSON So we are effectively just putting snippets together of these here these so like we have the API section there We use it for all services obviously, so that's not always copy-pasted, but it's Composed together, so but yeah, so and yeah the other part is Oslo Which isn't in the CRD's that there the CRD's are actually freeform and the generations the Validation against Oslo is only happening in the operator So you can actually apply an invalid configuration for instance a list instead of a string But the operator will then reject it and tell you that doesn't work. You need to change it The question was whether a CNI is used or whether we are using host networking exclusively If CNI is used that works planetly actually our default is Calico, but we are not sold on that I think you can use any CNI which provides the basic stuff Yeah, but there's also the side effect that potentially weeks LAN ports might overlap between the CNI and what mutual does Then IP tables rules to do migration might rescue you, but they are horrible Do we have any further questions? Yes The question was whether the yoke open stack layer so to speak can be used standalone without all the stuff below that The answer is clearly yes. So for instance, we run our CI on a gardener cluster Others use well, okay, that's again Yoke Kubernetes. So yeah, that doesn't count But it isn't tied to the Kubernetes layer. So it just needs basic Kubernetes functionality We have a list of requirements in the documentation actually, but it isn't that steep So you need an ingress third manager and I think that's storage class And you need the storage class, but other than that any Kubernetes cluster will do So I actually running on k3s on my laptop that works as well. Although it drains the better record fast Anything else Okay, if there's anything and not nothing else then I would say thank you You can meet us here or in the yahoo channel on the OFTC ISC, right? That's OFTC. Yeah, and Thank you for your attention. I guess And also if you would like we have stickers here