 And also a welcome from us. So we are very humbled to be the first session of the day. So we are all from Linux integration engineering. And when I knew that the topic for this DevCon would be cloud and hyperscale, so I couldn't resist and invite these awesome engineers who are fascinating projects to have a panel discussion and talk about how we are doing hybrid cloud. And actually, when we started planning for this, we first had to sync on what hybrid cloud actually is. So we had to look it up on internet, find the precise definition to be on the same page. So it was quite an experience for us. So let's start the introduction. Ondra, please introduce yourself. So hello, I am Ondra, and I am working as an engineer at the image builder project in Red Hat. And yeah, I'm one of those people in our team who are not afraid of touching ops, so I'm here. So I can tell you something about our operations. And if you are wondering what this image builder, it's a tool for building and uploading images. Primarily currently, that's cloud images, but we can do also ISOs, containers, and much more, I think. And we are doing it for REL, for CentOS Stream, and also for Fedora. And we have basically two big operations thing going on. And one of them is our nice CI that runs like hundreds of jobs every day to test that we are building the right thing. And then we have also our production service that's online, and every one of you can try it. So basically you need to go to console.reader.com, then go to insights and image builder, and you can build your own customized REL image. Fedora image is not there yet. But it will be, definitely. And CentOS Stream, and it's just the best way how to get a code image of our stuff. So feel free to try it. And let's go to Michael. Thank you. I'm Michael. I'm part of the CKI team and responsible for infrastructure there. And CKI stands for Continuous Kernel Integration. So we are providing kernel testing as a service. So independent of what other people say, the kernel is actually tested. So we are providing testing for upstream, integrating into the awesome email-based workflow of the upstream kernel developer community. And we're also providing testing for internal Reddit kernel developers that are nowadays living on gitlab.com in merge requests and treating the kernel as a normal software project. And so CKI does everything you would expect from a CKI project for a software project. We are running builds for the kernel, for multiple architectures. We are testing them in labs inside of Reddit, mostly. And then we're also looking at the results of those tests, trying to detect new issues to alert kernel developers about them and trying to figure out what to do with those new issues. Because as the kernel is a bit special, still it means that you might have issues that you know about, but you can't immediately fix. So that's where we are kind of working on the workflow and improving it, making that possible. Okay, thank you, Michael. Okay, hi, I'm Pavel. I'm an engineer and team lead in the copper team. We are working on the copper build system on the underlying mock utility and several other, let's say, packages, maintenance software, maintenance-oriented stuff. The copper build system is a tool for building third-party RPM repositories on top of, like integrated on top of Fedora and the Red Hat Enterprise Linux and other distributions. And by integrated, I mean it's super trivial to build with us, you just get the account, you create the project and you can build. But it's also like very easy to consume the content from us, like you just hit the command DNF copper enable and the name of the project and you can start installation. Thanks to our sponsors, cloud sponsors, we have quite some power nowadays. We run even over 300 VMs in parallel today and we could even scale more if we needed. And as such, it's like kind of expected and desired you to use us for continuous builds and for stuff like builds RPMs in pull requests and so on, either using our webhook support or for example, the packet service. I checked today and we have like 17 terabytes of package data, so it's, yeah. Wow, that's impressive, thank you Pavel. Nothing like having notes in front of you and not saying what I wanted to say. I think I did not introduce myself, so I have Tomasz. I also work for Red Hat and you probably see the schedule and there were four people and suddenly there's three. So actually Miro Watker, he got sick and cannot be with us so maybe he's watching, hi Miro. And please get well, we need you. Okay, so we should probably speak about how you are working on your applications. So let's deep dive into clouds. I mean fly up to the clouds. So please describe like what infrastructure do you use and like if it was VMs or containers, maybe can we start with Pavel with you? Yeah, we have two copers. One is public, one is internal for Red Hat purposes and I would like to talk about the FederaCoper one because that's much more interesting. So everything is public. We don't use some private stuff in our lab. We have on-premise machines in the Federa Infrastructure Lab where we run a libVirt. We have, we run machines in AWS, spot instances and on-demand instances just in case something gets wrong. We use IBM Cloud for S390, we use Oregon State University for PowerPC or little NDN and yeah, we start VMs using a separated tool from the copper code some resource allocator which divides our VMs into pools and those pools are kind of homogeneous sets of builders always dedicated to one infrastructure and each VM in that system is tagged using like architecture tags. So when a copper or I mean user or even the copper build system comes and wants to use let's say S390 machine, it says so using the tag and it gets the machine either immediately or after some time when it gets like allocated. So like for saving the budget we like allocate certain amount of VMs in advance so people don't have to wait till we allocate them and we allocate more if there is the amount for them. There's also like prioritizing between pools like we try to use the VMs we have in our on-premise lab because we already paid for it so why to use cloud and then we fall back to other pools like IBM, I mean AWS spot and then we go to the on demand one because they are simply more expensive. Yeah, starting stopping takes some time so we try to recycle VMs but it's kind of like tricky because building RPMs is a privileged operation. Not the building itself but for building RPMs you need to install other RPMs as build dependencies and for that you need to be root. So we kind of give full powers to those VMs to all of our users and as I said, anyone can build in copper and it would be kind of dangerous if we allowed to hack users on each other. So we recycle if it is absolutely like safe for the same users in the same projects but otherwise we shut down the machines as absolutely normally like we start every day like thousands of machines, maybe tens of thousands today. Okay, thank you Pavel. So please don't try to hack other users. It shouldn't be possible. Okay, Michael, please tell us about the kernel CI that is to be also pretty interesting to work on. So yeah, we are also using hybrid resources, both cloud resources and on-premise resources trying to kind of split this depending on the properties of those environments. So if you have cloud environments like AWS or GCP, you have highly reliable infrastructure managed professionally, which can more or less endlessly scale or at least what you would expect. So that's what we also use cloud resources for. So really stuff that needs to be stable. We have an OpenShift cluster running there that hosts our microservices. We have a messaging cluster running in AWS because that's based on VMs, a very stable environment. You're using lambdas as webhooks, really having basically only three megabytes of code as the stuff that can break and no VMs in the middle. And then we also using obviously cloud resources for anything that needs to scale. So if you're building kernels, that might be no build job running when all kernel developers are asleep and then at four at night, basically they actually have their productive times and then they're spinning like hundreds of them, right? So that's only possible if you go to a hyperscaler to do in a responsible fashion. Otherwise you might starve other teams of resources. We are using on-premise resources for other reasons. For example, if you need interesting architectures like IBM S390X, PowerPC, then you might have cloud providers that can give them to you, but it's actually a bit harder to use them when the copper team does, but we don't. So we are falling back on local machines for those. The same goes for the testing aspect of the kernel. So if you're testing the RDMA support, you need an RDMA lab. You need special hardware. So that's something that you will only find on-premise because you kind of built it yourself. And then the third reason for doing stuff locally is again the cost factor where you might have a workload that is long running and it's actually pretty beefy, so you need big machines and they would be really expensive to actually rent from somebody like AWS or GCP. So it depends on we're trying to be both cost effective and mindful of our customers that expect a stable service. Okay, thank you, Michael. Please don't tell us about ImageBuilder. So as I told you, ImageBuilder is a hosted service that you can just call the API or visit the front end and just build an image. And yeah, that's a customer facing service and it just needs to be high available. There needs to be zero time redeployment. So it just must work for all of you if you want to build an image. And so we need to do something stable, which is OpenShift, you know, running multiple containers just in case if something fails to have redundancy. And we run it in AWS because that's what our SRE team supports. And the set part of this story is that currently you just cannot build an image in a container for various reasons like it would have to be privileged and you would need to open a lot of paths to it. So it's just not possible currently. So also we are running a pool of machines in AWS and they are doing the actual worklifting. So already it's kind of non-standard situation and we will talk about it later hopefully, right? I hope so. We have time. Yeah, that's it. Our CI is also very interesting because we are running a lot of machines and we would like to run most of the stuff in AWS because it's just, you know, hyperscale you can run as much stuff as you want. But we need also to test these images. And so we need virtualization. And in EC2, it's very hard to have virtualization. There's an option, but it's extremely pricey. So we don't want to go there. So instead we are going hybrid and are using local OpenStack. And then you need to solve all the issues to handle resources into clouds and we use Terraform to abstract it and it eats so much memory. So yeah, it's painful. Okay, so next topic is not painful. You already said OpenShift. So let's talk about OpenShift. So for those of you who don't know it, it's amazing product by Red Hat, like Red Hat OpenShift container platform. It's our flagship product to run containerized applications. And honestly, like even in our project, I can't imagine you would run it on anything else because it's just saved so much time and everything. It's based on Kubernetes. And so let's hear it from the panel. It's how they use OpenShift. On there you have the mic, so you need to start. Okay, all right. So yeah, our production, basically the API layer is running on OpenShift. And for us, I think it's a success story because it does what it should do. Just it hosts the platform in a stable manner, even though the cluster under the application is being upgraded, it just works. And yeah, we have like probes and metrics to measure everything. And so far, I think that we haven't managed to broke production. So that's pretty nice. But also, you did? Not yet. Not yet. Oh, not yet. Yeah, not yet. I haven't checked like in a long time. Anyway, yeah. And what did I want to say? Well, I know. Yeah, that it can be tricky. That's something to realize. Like it's an asynchronous big platform. It can contain, I don't know how many nodes. And sometimes you need to be careful about timing. Basically, from one of our stories, we were killing the pods while requests were still going to the spots, which lowered the success rate quite a lot. So yeah, it's not a great thing. So we need to play with it for some time to understand all of its concepts. So your application needs to be aware of how Kubernetes OpenShift works. So, okay, thank you. Michael, are you using OpenShift? Yeah, before production. Yeah, so the CKI team, so we are running everything we can on Kubernetes and then by extension OpenShift, it kind of makes it consumable. We try to run Kubernetes by ourselves, but it's far more involved after you know what all the pieces are that you need. So OpenShift kind of like really abstracts it away. So, but then there are some limits what you can run on Kubernetes, for example. Kubernetes has some ideas of workloads coming out of what Google designed, Borg for at the time, I suppose. So it has this idea that workloads can get killed and then you can re-spin them up again. It's stuff that can be paralyzed. So we have, for example, long running jobs that need to stay up for a couple of days. So that's not something that Kubernetes really was designed for, can deal with nicely. So we keep those things out of Kubernetes. We have other workloads that don't fit maybe the feature set, but they could be made to work. For example, we are using, we are building kernels. We need to support architectures like SV90X, PowerPC, but the Kubernetes cluster normally is one architecture. So there are some limitations. If you want to build container images, you can't build them for SV90X on an x86 Kubernetes cluster nowadays, maybe that will be fixed. But so we're hitting those limits and then kind of like you always need to, yeah, then kind of like try to solve those problems outside of it. It might be more painful. It would be kind of nice if it would be on OpenShift, but it's not at the moment. And then the third thing we are not putting on an OpenShift cluster is stuff that needs to scale extensively, like build jobs and it's what the corporate team similarly deals with you. You could scale a Kubernetes cluster by putting more nodes in them and then you would put workloads on those nodes, but it's much easier to just spin a VM in a hyperscaler directly and then run the workload directly on those machines. So that's, it's easier and more cost effective and has less failure modes. So there we are some, yeah, not going Kubernetes or OpenShift. Okay. Thank you. And Pavel, can you tell us about your OpenShift story? Yeah, we are not using OpenShift yet. And we are sometimes broke the productions. Maybe that's the reason, but we certainly plan to use it. Like we already have a work in progress pull request for OpenShift deployments and there is yet another pull request from some guy from the outside doing customized for Kubernetes deployment. And yeah, the question, how am I supposed to start my own RPM build system because I cannot use Fedora-Copra one because legally choose or because I have some proprietary stuff I need to build. Such questions come often and often and it, I mean, moving to OpenShift without Infra will not only like simplify our own work but it will also like kind of connect us with the other folks doing the deployments. We will kind of like standardize it. So that's it, yeah. Builders, builders are like the problem still. Like even if we are with the Infra in OpenShift, we will still have to spawn VMs because we need to give our users full privileges. As I said, so with this case I think in, I mean the OpenShift virtualization could like help us a bit but still we need all the architectures and even though we are using multiple clouds now, none of them is using or providing all of the architectures we need. So likely the allocation part and the flexibility part will have to stay with us. Okay, thank you. So about the builders, Pavel, would it help you if OpenShift had support for username spaces? Yeah, definitely, like to some extent it would not probably solve the architecture point of view but having username spaces, I mean, in the mock tool that copper is using, we already support a rootless username space container so you can run mock there and it would like really like change the way the RPMs can be built. We could like move some of the stuff to containers that it would be really awesome. Okay, thank you. I hope we'll get there one day. Okay, next topic. So we are still in the OpenShift area but we already heard like what works, what doesn't so Michael, what's missing for you in OpenShift? It's mostly on the multiple architectures I think. So we are not an application team so we are providing CI pipelines so that might not be the usual case but I think we three have similar issues that we want to provide support for multiple architectures out of a common infrastructure and for us, for example, that means that we want to build container images for multiple architectures and it's possible you can build a multi-arch image that's available in Docker or Builder or whatever and the user doesn't really need to care about the architecture that their machines have. They just put an image and it just works but then actually producing those images is something that's harder to do so I think there's still pieces missing of the multi-arch story and it becomes interesting I think also for customers that for example want to migrate, for example to arm machines or arm clusters or something then they might have actually similar issues that this migration story becomes really hard to implement in practice because you're stuck to the native architecture while you're building your application and actually moving that over is more painful than it needs to be. Okay, thank you. So I'm glad that we have one panelist who works on building images. So Ondra, what's missing in OpenShift for you? So sadly it's not username spaces because we are currently in the way how kernel works. We are dependent on the kernel. So for example, nowadays it's not possible to build a Fedora properly on REL because Fedora uses a special file system. If you can guess the name, I will tell you where you can get cheap butter. And yeah, so for us it's basically support for virtualization and in our case we are running on AWS so there is an option for using virtualization for OpenShift, it's called OpenShift Virtualization or the upstream name is Kubevert, right? And yeah, that would be nice, but in AWS if you want virtualization it's extremely expensive. So in our case if I can like trim, it would be amazing if we can spawn EC2 instances as basically plots. Like the lifecycle would be the same thing and it would be just very nice to integrate it into the whole Kubernetes workflow with transparent networking and things like this, but one can only dream for now I guess. No, you need to file that feature request. Right. Okay, thank you. So like even the audience, how are you doing? Are we too technical? Are we boring you? Or because the next topic I'd like to ask is deployment and testing. So is it exciting for you? Okay, so, so wonder how are you deploy and test image builder? Okay, so I can talk more about the deployment. So our Sari team made us a wonderful workflow for GitOps so we could just store all the configuration for our service in Git and we can, you know, revert, they even run some tests on PR so we know that it will break even before we deploy it which is very nice. And yeah, it benches both the EC2 instances that we have using Terraform, probably, yeah, Terraform and the OpenChiff parts is like a custom pipeline built on Tecton and Jenkins, I think. So that works wonderfully. And yeah, we have staging and production environment for the service which is amazing. The only difference that there is is that a stage is just a small scale, like, I don't know, four times smaller. And that's nice because if we ever had an issue somewhere it appeared in stage and we just fixed stage and everything was on track again. So that's nice. And also it's a rule for all of our production stuff or yeah, all of them is that we try to not have SSH keys there or like any access. So if we want to know like in which state the machine is we need to have a look at logs or at metrics and we push all logs from all of our services into one's plunk instance and it's just super nice that we can see everything there. No, very nice. So Michael, how about you, deployment and testing and staging and production? We don't have staging, we just break production. So yeah, so we have used GitOps as well so everything is somehow managed either by having yamls that can be deployed into OpenShift clusters, Kubernetes clusters and it's heavily templated to be able to abstract stuff like monitoring and logging. I think we ended up at a similar setup. So everybody is a platform engineer, I think, kind of like trying to come to a common layer that kind of encapsulates the common parts, especially if you're on microservices. And then for all the rest we use Ansible. So there are like bare metal machines, VMs and AWS DNS zones that need to be updated and all four of those kind of like miscellaneous things. It's Ansible. So yeah, we had an architect that kind of like started doing these things and before he could get to Terraform he left for the image builder team. So we are kind of stuck with Ansible and they got the cool stuff. Yeah, and then so for testing, we are mostly unit testing on one level. We don't have a staging environment and not because we don't know how to spin what up but it hasn't up to now provided enough value to do it. So we have a partial one, but for most other things we are, because we are running kernel build pipelines, we are going to change components of those pipelines when we are working on them. And so what we can do is spin up cannery pipelines where we take a known good pipeline, basically a kernel revision that we built and tested and we can spin it again, but now with updated components. So before we merge those things so we kind of like replace these individual pieces and then look at it whether it still works right and still gives the same results as before. And that actually works very well. But it's not that we consider that the all end, I think how we deploy is, there are improvements we would like to do, for example, on Kubernetes using something like OpenShift GitOps, Argo CD, like those consolidating things where you also actually check whatever you deploy is actually running in the cluster and that there's something that actually looks at it instead of just throwing it over the wall and hoping that it will all work out. So that's one. The other aspect is that the way we manage secrets is okay, but we have like, I counted was like 300 of them. They're all nicely stored. But then if somebody comes along and says like, oh, you got compromised, please rotate them in the next two hours. I mean, we don't even know where we got them, right? That's not, so the whole secret rotation dynamics secrets rotating on a schedule, rotating them. If somebody leaves on incidents, it's something that we are working on. So that's there's a lot of stuff that needs to be done. Okay, thank you. So Pavel is here from you. Yeah. We also have production and development stack, but still somehow happens that production goes down. I don't know. What do I do? For development, no fancy getups. Unfortunately, we only do continuous builds and builds from PRs and the LinkedIn and stuff like that. For like local development, we have a portman compose stuff so we can run locally the copper stack and test, develop and so on. Yeah, so that would be it. One specific is like building the golden images because we cannot simply, we need those VMs to start very quickly. So we cannot use simply the images that Fedora provides because they are too old. We need to install there some, we need to update it. We need to install some software configured and so on. And that would be simply take too long. Therefore, we built our own golden images and they are based on the Fedora images and we are doing it using Virtsys prep and tools around that, gluing it together with some of scripts and they are very ugly. So, and you can imagine that supporting multiple clouds with golden images, it's, yeah, we are really looking forward for image builder support, but currently Fedora is not in scope with all the architectures that we need, so. Okay, okay, thank you, Pavel. We are actually slowly running out of time. We have about five minutes, so I can't imagine we would be doing this presentation with Miro as well, like we would already run out of time. So final topic, current challenges. You already spoke about some, so do you have some more? Yeah, I think moving to OpenSheet will be one of them because like we are three folks speaking of full-time, it's two and one quarter and it's not too much. So we would like to like minimize the maintenance cost even more, so OpenSheet should help us. We are looking forward for the image builder support and moving this to them. And yeah, all the problems we are talking about are kind of similar, so maybe like take it as a challenge and organize something and do it together, solve the problem in a way for all of us. Yeah, like a panel discussion. Actually, while we were preparing, like we already sink on so much stuff, that was amazing. Thank you, Pavel. Michael? Yeah, so the issues we see next to the other stuff we already mentioned is around scaling. So introducing testing might initially be a problem because people don't like that, but then after a while they kind of like get into the mood and then stuff starts to increase in scope and the amount and so if a service scales, it depends on whether, if it's on-premise fast then we are going, we are hitting scaling issues where we hit resource limits of statically allocated resources like a testing lab for example, that only has a certain amount of machines and if you throw more jobs in this direction at a certain moment, that basically will not work out very well. And on the other hand, if you throw more jobs at a hyperscaler then the hyperscaler is kind of happy about it and will charge you for it. So there will be costs associated with scaling and it will expose, and in our case it does expose then certain maybe assumptions that were not valid of what causes costs and how many costs and all those budgets you needed to formulate in the beginning of the fiscal year might actually exceed it quite easily. And so you kind of need to keep track of the costs that those scaling aspects actually cause and mitigate them. So that's a continuous battle of keeping track of it and working on it to stay within budget. Yeah, thank you. Yeah, so I think that our upcoming challenge is scaling our service because we are currently using the China Auto Scaling Group in AWS but we have the size fixed at a certain amount of machines. So we need to move as we get more customers, we need to of course scale the service automatically. And yeah, this is like a challenge, a complete new challenge for software engineer because you need to collect metrics, decide on which metrics you will scale, how much, and then even somehow validate it before you put it into production. Otherwise, it may happen that you end up with zero machines in production because you failed at making proper script. So yeah, that's the big next challenge and we'll see if maybe we can share some information about it. And I was even thinking that Pavel, you said that it would be nice to share information more. I was thinking during this talk that maybe organizers should deploy OpsConf or something like this, operations. Anyway, that's it from me. Okay, thank you. So Honor says we are out of time. So unfortunately, we can't take any questions right now, but please reach out to panelists and when you see them on DevConf, I'm pretty sure they'll be happy to talk to you. And thank you panelists for taking up the challenge and sitting here and talking about your experiences. Thank you everyone for joining. Please have a round of applause for our panelists. Thank you.