 Hello, hello, hello, everyone, Radik, come here. Wow, I'm so nervous. Same. We haven't done this for a couple of years. I haven't seen so many people for a while. Yeah, but it's great. You arrived and we were able to meet in person, but also welcome to everyone who is joining us online. All right, so, yeah, welcome everyone at DefConf. Let's do first some introductions. So, folks, my name is Radik Vokal. I've been at every DefConf so far, and this is Dorka. Hi, my name is Dorka Vokala. I am responsible for the organization now. I have not been at every DefConf. I think I've been at the past five DefConfs. I started as a volunteer and now I'm standing here. So, every volunteer, this is your chance. You'll be here once upon a time. Who's here for the first time? Raise your hand. That's quite a few people. I'm amazed. That means that I will have to explain you now what DefConf is about, right? So, here's the thing. How much did you pay for the ticket? Zero, right? Perfect. Okay, so that means that whoever paid zero for the ticket is automatically a volunteer at the conference. You're helping with the conference here. You're making sure that this is a great event, right? So, make sure that everyone has fun here, but we also need to leave this room and this venue clean and in the same shape as we get it, right? So, please help us out. Make sure that this place stays it is. And that's the main thing of DefConf anyway, right? So, this whole event is organized by volunteers. It's organized by a large group of people. I stopped counting it at some point. How many people participated in organizing this thing? And also all the speakers. It's a mix of first-time speakers. It's a mix of some people who've been presenting for a while. A lot of them hasn't been presenting to such a large audience for a while as well. So, please be patient with them. Give them feedback as well. Help them out during their talks. Ask questions. Interact, participate. This is the main thing that we want to have here. We want to have a discussion. And you should really enjoy the conference. But there are some housekeeping things that we need to go through, right? We can do that now. Yes, we can. So, first and most important thing is the schedule. Please have a look at it. It's the main thing. We made some last-minute adjustments. So, if you were looking at it yesterday, have a look again. Thank you. I need my notes to remember everything that I want to say. The important thing is that in this part or in these several sections, there are talks and workshops. On the other side of this street, main street outside, there are meet-ups and activities. For example, we have physiotherapy every day. So, if you want to learn something new or exercise, that's your chance. Another thing, we're transitioning to matrix. So, people who are online might be already there. But you in this room, or joining us in person, can join the DEFCOM CC 2023 space as well. Participate in Q&A and watch the streams in other rooms. I guess it's up to you. The thing that everyone is asking about all morning is social event. Yes, we're hosting social event tomorrow. You have to wait for it a bit still. We want you to attend the talks. And then talk about the talks during the social event. So, it's held tomorrow. It's going to be outdoors. Speakers already got tickets, but we have many more for everyone in here, almost everyone. And we are handing them out tomorrow at 12, near the registration area. And the venue is still secret, right? Venue is... Yes, why not? It's not that far from here and you'll be surprised. Yes, it's on the way from here to the city center. It's outdoors and we'll update the schedule so you'll know where it is tomorrow, probably after 12 or around 12 noon. Another thing, we have an IT museum at this faculty. They were very nice to us and let us use it or show it to you. It will be open in very uncertain times. So, we'll let you know either on matrix or word of mouth in here. You might see open doors and arrows pointing there. We have an EMT present or first aid station. It's there near the coffee that's stationed outside. If you need any help, want to report something or want to suggest something for us to improve, you can stop by at the registration or you can talk to our volunteers who wear a light blue t-shirt or to me, even though I don't wear the light blue t-shirt. And last thing, please, respect our code of conduct and enjoy the conference. I'm going to mention one more thing. You don't have to give this to me. As you probably remember, some of you who've been here for some time, the last session of the conference is interesting because you should come because you'll be able to win some prizes. But it's not going to be for free, right? So, you'll have to do something. But it's going to be fun. So, show up on the last session as well and join us for some interesting prizes. Before we jump to the keynote, we need to thank the faculty here. We've been organizing this event at Brno University of Technology for about 10 years. It's my guest. So, this is just a so beautiful venue. I'm still mentioning to people that when I studied here, it was monastery back in the days and it looked like one when I was here, unfortunately. And it's so nice right now. So, I want to give word to our friend here, Vyacheslav, from the university. Hello. Thank you. Every time I'm wondering, what is this applause at the beginning for? I didn't show anything. I didn't say anything. Just, I'm here. As you mentioned, we are in the monastery from the 13th century. I hope you will have time to go around and see the beautiful parts of this monastery. The monastery during the era was closed. It was given to army. It's usual to change the soul for the rifle, unfortunately. I think from the 13th century, this was the place where people that were thinking by on the present horizon, they meet together, they think, they share their, you know, the shared possibilities in the time was a little bit slower than today to go from Paris to Bernat, take 20 days. I think they talk about their present problems. They shared their ideas and they were thinking about the future. And I think today you do the same. We moved with technologies. Let's try to think how much we moved from the other parts of our important lives. So, I would like to thank you, Redhead, for doing this conference here. Thank you very much. I think we should officially start, right? Yes, officially. Okay, let's do it. Let's start right away. So, we're starting with our keynote speakers. Come on up, guys. All right, so, we're right on time, almost. Let's start with our first keynote. I'm just going to quickly introduce the folks here. The keynote has become an open-source service. Let's get people a few minutes to enter. Oh, yeah, absolutely. This is amazing. We're having DevCon again. The door closed. The door closed. I think that's a sign, right? Okay, we'll be talking about open services, and this is an interesting group of people because we have folks who've been with the communities in different projects for a while, who've been in different positions as well, right? Dealing with customers, dealing with developers, doing some real work and developing the services. So, it's something that we're all passionate about. We'll be talking about this topic for a while, also during the conference. So, not only at the keynote. I think we should do a really quick round of introduction for people who don't know us. My name is Rade Volkal, and I'm currently working at Redhead as the product manager for a set of services that we call Insights. I've been at Redhead for almost 20 years now. It's a little crazy. And, yeah, Steph, I know you for ages as well. Yeah, I've been involved in open-source for over 20 years. This is not my real hair in case anyone's wondering, but if I don't wear it, I get boot off stage. I need a lot of their rel-engineering, Linux engineering, satellite teams. So many people that I have joy and pleasure of working with are here as well. Thank you. Hi, I'm Simon, and this is my real hair. So, yeah, I've been working in open-source for something like 15 years, and I've been at Redhead for two years, and I've had the amazing opportunity in these two years to work with the very talented people in the image builder team. And, yeah, we are the first service of this HMS Insights Groups rail services to be available publicly, and we're very proud of that, and working to make the service even more open going forward. Hello, I'm Roberto Caratalà. I'm a cloud services black belt, even though I don't know a clue about Kung Fu or anything else, and I'm working for Redhead for more than seven years, almost eight, and a pleasure to be here. Hello, I'm Tomasz Tomeczek, Radek, and Ondra Vashi hired me actually 11 years ago. At Defconn, right? Oh, yeah, at Defconn. And I never thought that I would be doing a keynote here, and I like to make short introductions. Thank you. All right, so let's give you guys a first, like, introduction, right? So why are we even talking about open source services? Why do we care? Why do we care about this? I mean, come on. We've all been part of a massive monumental change that has made the world better. We have brought open source into every corner of software development. I mean, there's proprietary systems that now assume that somewhere in the stack is a component that someone in open source has worked on, someone here, there's open source services systems, operating systems, everything you can imagine that has open source permeated throughout everything. And this has brought humanity much further than proprietary software could. So when you walk into a Redhead office and you see one of these signs from Mahatma Gandhi, you can easily assume that, okay, we won. We're done. This is great. We have now Utopia. It's all wonderful. Let's go home. And we're here to tell you that this is not the case. There is a challenge ahead of us. There is a challenge to open source that if we don't address and we don't adapt to, we'll become a threat. And I'll explain to you why that is. Think back on your first open source change that you made to a component or something you were running. Now for many of you, this has been around for maybe the whole time you've been working on software. But for some of us, it blew our minds. Think back to when the first time you could actually change something, change the behavior of something on your computer, and you finally put in, maybe you changed the color of something, maybe you put in a printf, maybe you logged something and you changed the output of a command. It just blew my mind that this was possible. I'm surprised you found my first patch. This is your first patch. Very good. I couldn't find my first patch. I mean, I didn't keep it. It was garbage. It was changing, I forget, something to pig latin, and it was in fetch mail, I think. But think about what makes this actually possible. What makes this possible is that you have a copy of the software running on your computer. This is what made this possible for me, and the source code was shared in a way that I could actually change it. I could rebuild it and so on. And the problem that we face is a lot of people, a lot of us, but a lot of people in the world no longer actually want to run a copy of the software. They want you to run that crap that you wrote. They don't want to run it. And they want to experience the output in the form of a service, in the form of an API that they call, in the form of infrastructure as a service. In one way or another, they want to use the software without actually having to run it, much less make a copy of it themselves. And so we run into a paradox here, a conundrum that we need, and I'll walk through this with you. The first thing is that open source thrives when it can convert some small percentage of the people who are using that software into contributors. Now, some projects, this happens a lot. This happens at a fantastic pace where half the people become contributors, maybe developer tools or things like that. In other cases, it's a small fraction, but there's some function here of people who are using the project actually decide to make a change or help in some way. And conversely, it starves when that can't happen. So if you prevent that function of users becoming a contributor to the software, open source starts to atrophy, regress. But our open source practices, most of them require that you operate a copy of the software in order to change it, in order to even get the idea that you could make a change in order to play with it, in order to introspect it, in order to understand what's going on. But at the same time, the users of a service literally chose not to run the software themselves. They're using the service mostly because they want someone else to run it, or in some cases, it doesn't make sense to run this thing in another place. It makes sense to run it in one place. It's either unable, unwilling, or just doesn't make sense to copy the software and operate the software. So we're at a paradox, a place where when we put all these things together, it doesn't add up. It is very hard to contribute to open source services. It's not natural. And the mechanism that we use that underpins much of open source licenses copyright is about copying software. And in services, you don't need to copy a service in order to have that software be successful, that software be used. So although all those ingredients, we can still use them. We're not going to throw away open source licenses, for example, but they are not sufficient to solve this paradox. And so I want, imagine a world, we're not there yet, but yes, imagine a world where you can actually go and look at what's running in the service that you're using. You can see the code, you can see the software. Imagine when you call the API, you can understand what the hell is going on under the hood. You can see the same way you can with a stack of Python or Node.js on your own machine. Imagine you could make a change and experience that change without operating it yourself. You can make that printf or that change, or you could translate it into Pig Latin, or you could change the color of something in a service, and you can actually see the behavioral difference. That is a world where open source actually works with services, with software as a service. So... I think we need some guidelines, right? We actually... Yeah, well, exactly. And this is real today. So that's what we're here to talk about. We wanted to introduce the challenge, the paradox, but what everyone, what we are doing to start to address this problem. And so many people are involved, so there's so many different ingredients to this. One of them is that together we figured out what are the basic requirements for an open source service? It's not just an open source license. It's not just sharing the code, although that's important. The first requirement is that you do share the codes, the same with a project, that all components and all the assets in the service are shared under an open source license and available to the public. That's fundamental, of course, but it's insufficient. The second fundamental part that we need is that others can contribute in the same way as a team working on the service can. Whatever that mechanism is that you use to work on the service, to deploy, to review pull requests, to accept those changes, to run tests and so on, others need to be able to do it in that same way. And by meeting these basic requirements for an open source service, people can then take it further. They may not take it further, but they can take it further to perhaps operate it somewhere else or add capabilities and ways of working that people... This is all great, but is someone already doing all this stuff? Yes, we're gonna blow your mind about it. This is happening today. I mean, it's not all perfect, but that's why we need to work on it together. So... I think Tomasz is actually working on something. He raised his hand immediately when we start saying, someone is actually doing this for real, right? This is the slide. Yeah, we definitely need to make it and get it further. Okay, so I think you might be confused by looking at the slide, but I'm worried it was my intention. But my colleague said that maybe I should explain what's in there. So I tried to collect a few techniques we are using in our open source services and explain. So on the bottom, you can see that's my favorite. It's database dumps from production. And they are super helpful for everyone on the team and especially for the outside contributors. When you are working on your change locally, you can load it up with the production data and see how the change will feel in the production. That's really amazing. Just one pro tip, don't forget to remove all the passwords from the dump. And the whole top is filled with documentation. And I can't stress it out how many times this saved my butt when I was trying to work on our service and I didn't know which OpenShift command I should run or how to access this or that. And then I open our documentation, which is really perfect for deployment and for the architecture. And I immediately knew what I need to do to keep the production service running and don't trash the secrets or something like that. So it's really essential, especially when you are struggling or there is some problem and you need to solve it as soon as possible. Make sure that your deployment documentation is flawless. Yeah, on the top right, sorry, that shouldn't be there. Simon will talk about it. But what Steph said, so the minimum requirements are all the assets are open and everyone in the public can contribute. And it's really just minimum. I mean, when you meet that, when you open source your code and collaborate to start piling in, that's the beginning for you. And there are many things you can do. For example, in our service, one thing that really helped us was opening up our planning process because we would always get these questions. Like, so what do I work on next? Or what's the current epic? Or even within the team, we sometimes didn't know what we want to work on. And as soon as we made a Kanban board on our GitHub namespace and everything's open, all our epics, what we are working on right now, it makes so many things much easier. So before I hand out the microphone, I'd like to challenge everyone and think about what should be on this slide because I collected a few screenshots here and now that I kind of see it on the little screen, I know there are many things missing and I would really love to see more solutions and make it easier for everyone to, as Steph said, so I can experience a change I'm working on and I don't need to deploy anything. Someone will do it for me, some system. So think about that and maybe when we do this in one or two years, this screenshot, this slide will be so full that you won't be able to see anything. Right, and Thomas, this sounds way too easy, right? I think there must be something missing. It's not just about the code of the service, right? Do you have to still run it, deploy it somewhere? What are the things that we're missing? Roberto, help me out. When we are talking about software as a service, it's not only about the code itself. The service and the service itself is much more than just the code. It represents, for example, the service represents the tip of the iceberg. For example, we have much more value in the service rather than this. If you show, we have the best practices that represents and manage the different services at scale. We have the automation and the infrastructure that are running as well. We have the operational processes, the interconnected services, and all of these components and pieces are backed up by open source projects and repositories that you can enable the users to influence and to make contributions in order to make changes to the software and the services that they are using. For example, Azure Red Hat OpenShift is a jointly designed service that runs in Azure itself, and is designed by Microsoft and Red Hat. It's open source, so you can go there, you can contribute, you can open issues and influence to the different software that you are using at scale. Roberto, I have a challenge. I know that there's a bunch of services out there. I'm running some of them. I'm using some of them. How can I tell that this is an open source service, and I can just start contributing to that? Simon, you have an answer, right? Yeah, I mean, that is a very, very good question, a very valid one, because it sort of ties into all the other things that were said before me, like the thing that Steph said. We're sort of used to a certain way of working with open source software. We download a software package. We have an idea of where is the source. I mean, we've just downloaded it. There is a disconnect, though, with services where we don't necessarily know where the source is, or we don't know where the documentation is, or we don't know how to find the standard operating procedures or the best practices of the specific team that designed this very service. But you don't even know that the service is open source. Yeah, exactly. You might not even know that a service is open source. It's even more fundamental. And the only way you can do this is by embedding something in the service that guides the user to how to introspect the code, or that shows it at the first glance, this service is open source. The same way that a fork me icon on GitHub tells you, oh, I can fork this code. I can actually work with it. It's open. And this is, I think, something that we really need to work on that we're sort of piloting on some of the services that are open already, as you can see in the screenshot. And the idea is to really sort of connect the two dots, you know, the running service down to the source code and to give users away to, first of all, inspect the code and understand why an API is maybe not behaving the way that you're expecting or that the documentation told you it would behave, or even how you could change it and how you could maybe introduce some typing or something into the API and make sure it doesn't break. But, I mean, one thing that we haven't really addressed is why should any business care about this? Like, why should anybody invest money into this? So, here's the thing, right? So, at the end of the day, the services that we're all working on and that we want to open source, someone wants to monetize them. Someone wants to benefit from them. And I have an observation as a product manager about services that, from my perspective, are becoming successful. And there's a component that we didn't mention here, and I want to highlight here. A lot of these services and things that you see, projects here, open source projects here, are actually opening up their ecosystem for further contribution. Something I would call a mid-stream, right? Where you can build extensions, integrations, plugins, different connectors. And this allows different people to pick up these services, pick up these projects and extend them for their specific use cases and different purposes. Extend them beyond what the initial authors of the service even thought about, right? If you look, again, at some of these examples, you basically realize that these are projects that were very much focused on a single use case at the beginning, single purpose again. But because they thought about, again, plugin infrastructure, extensions, and additional things, these became hugely popular. And again, this mid-stream idea, where different people can contribute on top of the service, they have access to APIs, they have access to a sort of a playground, mock data, and these kind of things, is a huge thing. And again, that's where I see the services being successful, but also then solving problems and solving challenges for some of my customers. That, again, the initial authors of these services haven't even thought about. So I think that's something that we should all think about as well, how to not just open source the service itself, how to open source the ecosystem around that so others can easily contribute on top of your service, within your service as well. We've talked about a lot of things here, and we've touched on these things, right? But we need to go deeper. And because of that, we actually have a couple of sessions that we're doing during DEF CONF, and we welcome you to join these and challenge us and tell us how wrong we are, or whether you're already doing something else, and it works for your community, your use cases. Let's do a really quick introduction about all these talks that we're going to be doing. So Toma, you and Neil are doing the first one, right? Oh, yeah. So we spoke about open source services right now for a few minutes, and you can actually experience the contribution process tomorrow, I don't know which room, but the title of the talk is up there. So we have prepared a workshop for you when you can try contributing to our open source service. It will be just the front-end, not the back-end, and yeah, we invite you there. For any successful contribution, we have prepared some nice presents for you, and for the people who will come there, we really would like to challenge and think about how can we even extend the process to the back-end, for example, so that anyone can make the contribution and experience it very easily and not build thousands of Docker images or whatever, and try to set it up themselves. So please come by. Perfect, perfect. I see Neil over there. Hey. Roberto, you're doing a talk with Marcel here, Torsten as well, right? You're joining the session too? No, just Marcel. What is your talk about? Yeah, if you are wondering how you can start contributing in these open source services, focusing, for example, the cloud services, well, Marcel, Held and I, we will be talking about how much open source is in the cloud services, and we will analyze and discovering the different open source projects that are using in real production, managing the different managed clusters that Red Hat cloud services manage across the world, and we will also present these different repositories, projects that you can get, you can influence on them, and get inspiration or not in your own projects, so you can get some ideas in order to make it better. Perfect, perfect. Simone, you're talking into more details about this small icon, right? Is it just about it or is there more? Yeah, that's it. So my talk will be very short, it will last like two minutes maybe, max. Yeah, no, so my talk is on Sunday afternoon, so the first challenge, of course, is for you all to be there, because it's Sunday afternoon. Well, that's a challenge, everyone. So people don't know that your service is open source. Many of you run open source services. No question, you meet those requirements. People can contribute, and your code is shared. But everyone who's walking through it, the users don't know that it is open source. They don't know what code you're running, and they don't know how to contribute. So come to Simon's talk. Thank you, yeah, very much so. What are you doing? Okay, I will make it longer than two minutes, but it's not too long. Yeah, but the idea is, of course, you know, figure this out. We have a proposal, we're piloting something already, you could see it in the screenshot. It's real, it wasn't a mock-up. Yeah, and I'm very interested to hear your thoughts. Yeah, thanks for the spoilers, Seth. The important thing is that during the conference, you'll find some other talks. I was going through the schedule. I found these interesting talks from Elad Nikolas. You guys should join these. David, I see you over there. You're talking about some of the services as well. I've seen some other people here who are passionate about this topic as well, so go grab them as well, join their talks, challenge them as well. A, that's the main thing here. So, B, we've thought about doing some sort of a competition about you contributing to open-source services. I think we should still at some point do this that we'd love to see your contributions as a result of DEF CON. So after DEF CON, if you contribute to any of the open services here, you want to send us your patch, we might actually send you something, some gift if you do so as a result of this keynote, right? So we know we're successful. That's the way, right? Exactly. And let's not pretend that this is easy. Solving this paradox is hard, but we do hard things, and we do them amazing well, and at scale we have changed. Open-source wasn't an easy thing either. We all worked hard to make that true. This challenge is an exciting one, but it's difficult. And when you believe, oh, this is a problem that's going to prevent this from progressing, for example, with accessing data, that one comes up a lot. That just tells us that we need to work together to solve that problem. Join that workshop, start to figure out with Tamash and with Neil how to solve that aspect. Bring an idea, work together on this, and we will be able to solve it together. We'll solve this paradox. Perfect, perfect. I'm going to say these are the famous last words, right? So again, join us in these sessions. If you have any questions right now, we still have a couple minutes, so bring them up. Okay, there you go, first. Just yell it, and we'll repeat it. Just quickly repeat that, please. So the question is, how do you solve the development environment problem, where you can quickly bring up your change and not have to wait days for an environment that you can run it in? I would say we already have the solution. It's containers. I mean, we have the recipes, how to build them, so create them, and then use tooling to get them all together, and hopefully even with that production data inside, and you can have the development environment locally. Like for me, this is already solved. I know that can still be improved over time in the future, but for me, it's done. So I'm really curious what everyone here thinks about it, so feel free to approach me about it. All right. So truth is that, as you mentioned, some of the services are too huge to be run on your laptop. You need a playground, you need a testing environment. So join this workshop, because that's literally the thing that needs to be solved, and that, like Tomar said, already has a working solution for the front-end, but we really need a deeper solution. That is literally the nut to crack. Totally agree, and I'm glad we're working on it together here. All right. I see one more question in the back first. I'll get to you, Lukasi. Someone was raising hand. Pity you had his hand up. Yeah. So it works for some projects. That's good to hear. Okay. Lukasi. Perfect. Perfect. Thanks, Lukasi. Anything else? Any last questions? Oh, there's more. Okay. Hold on. I'll hand over the microphone to you. Thank you. I noticed that licensing didn't come up as much during the talk. Do you think there's a perception that licenses for open-source services are maybe not so effective? If there is a perception like that, should it be changed? How could it be changed? How do you think open-source and free software licensing fits into the picture that you're painting here? This is a good topic. So the challenge that we have is not that open-source licenses don't work. They work fine. Open-source licenses work. GPL works for services. There's many licenses that are even more aggressive like the AGPL on services. But they are insufficient. Because you don't copy, or you don't need to copy the code that's running in software, they are not the enforcing mechanism. They're not the thing that actually underpins open-source services anymore. And that is the problem. So they're good, or what we need to pull off here. I think that part of the challenge is also that you need to be able, for someone who contributes, I'd say, life to a system, let's say the contributor is a banker or something like that, and they want to really make sure that what they have contributed is actually what is running on the other side. So I think that if you want to do that correctly, you also need to have a really good chain of trust all the way. And I'm not trying to advertise for the talk that I'm going in one hour, but just in case. So chains of trust are important. I think we are starting to know how to build that at the VM level, at the workload level, etc. I'd like to advocate for building that in the programming language themselves, and that's hard. Don't get me started on that, but that's really something that we need to probably put some emphasis on. So really integrating that at the low level API level when you have an RPC mechanism, that this RPC mechanism can just send data over, but send a proof that the other side of the API was this version, that it executed in this environment, and it gives you a cryptographic proof that this went well, otherwise it fails spectacularly and you know it failed. Okay, thank you. So this was very good advertisement for your talk. So thank you for solving one of the challenges. Yeah, data security, data residency, these type of things, those are all challenges for services. And yes, we do have some additional talks about this type of problem here as well. Folks, anything else? Any other question? If not, we'll let you go right now, grab some coffee. I guess we need to be exhausting three days. I can promise that. And thanks a lot, you've been a great audience. I guess we can get started. So hello everyone. My name is Vitaly. I am from virtualization engineering team at Red Hat, and today I work on various public clouds and third party hypervisors, making sure Linux is the first class citizen there. I am also a sub-maintenor in KVM, working on things like Hyper-V Enlightenment and various X86 things. Today I'm going to talk about supporting the new thing in the cloud, confidential VMs and what we can or should or must do in Linux to make this all work. So you've been following this cloud landscape over the last couple years while sitting at home. Then you could have noticed that they all major hyperscalers released something which is called like Confidential VM. I think Google was the first one with AMD SCV option in 2020. AMD SCV gives you memory encryption and basically that's it. Then we have an offering from Microsoft Azure since last June it's even in GA and they do AMD SCV S&P which is like an advanced version of AMD SCV which not only gives you memory encryption but it also gives you like register encryption or register protection and integrity protection. We now have Intel TDX which is a technology very similar to AMD SCV S&P from a customer's perspective. The implementation is very different but for customer characteristics it's similar. Amazon has just released S&P option for their existing C6M6A R6A instance types. So you must have been wondering like what these confidential VMs are about and what do they give you as a user, right? Why are they better? Why would you want to use them? And basically what they promise is that these VMs will keep your data confidential. It's like a bold claim but then what does it really mean? Well, that data confidentiality means that nobody but the owner of the data can get access to the data, right? And in case of VMs there is always the host which runs your VM, right? So the confidential VM technology is something which allows you to remove the host from this trusted base. You don't need to trust your host anymore, right? So even ML issues or compromised hosts cannot get access to your data. However, all these confidential VM technologies give you no additional protections from within the VM. So if you have some application running in your VM and this application was compromised hacked into it will still allow for data leakages, right? Confidential VM technology protects your VM from the outside, from the running host but it doesn't protect processes inside this VM. Another thing that you must also remember that none of the confidential VM technologies give you any guarantees that your VM will actually run because if the host which runs your VM is compromised somebody can easily disrupt execution of your VM, basically terminate it or just don't allocate any time slots for it to run. This is always possible, right? So it only protects your data. It doesn't allow data to leak to the host but nothing else. So when we're talking about data protection normally we talk about protecting data in runtime, at rest and in transit. I'll start with the transit because it's not really confidential VM specific. We've been protecting data in transit for years because we never trusted our networks in the first place, right? We are working over like public Wi-Fi or who knows who administers all these like routers on the way of your data. So there are like cryptography based solutions how to protect your data while the data is in transit. What confidential compute technologies give you on a CPU level is something which helps to protect the data in runtime, right? Because when you have your app with your confidential data being executed somewhere in the cloud, right? Where is your data? Your data is likely in memory because you need to process the data. You have some of this data in CPU registers because you cannot really operate on memory all the time, right? To make an operation on your data you need to put it in the CPU. And these confidential compute technologies which are basically newer CPU features they allow to hide all this from the hypervisor. As I've already said, M.D. was pioneering this with their like SAV. And yes, I have to make a caveat that I'm only talking about the technology which allowed to basically protect the whole VM. There were technologies in the past which were designed to create like confidential enclaves and these are technologies like Intel SGX but they don't protect the full VMs. They protect certain applications which are specifically built to work with these enclaves. So there was M.D. SAV and M.D. SAV protects your memory so your memory is always encrypted and only the guest can decrypt it. However, it gave you no protections for like CPU registers so whenever your data is in CPU registers it can be seen by the hypervisor which significantly limits the protection because basically hypervisor can stop your VM every cycle and take a look at what are the registers. So it's really not hard to steal your data even when the whole memory is protected even by observing CPU registers because once in a while you will have all your data in CPU registers. Then M.D. came with an upgrade which was called M.D. SAV ES which is encrypted state where they also hid all the registers. Now the hypervisor cannot observe the registers. What was missing there is integrity protection. So the hypervisor can do things like imagine you have two encrypted pages and the hypervisor can just swap them for your VM. So it cannot see the data but the behavior of your VM is likely going to change and you may for example do I.O. from the memory you didn't want to do the I.O. right? So there are latest like SAV S&P it gives you these integrity protections so the hypervisor will not able to do any of this. And the last but not least is protecting data at rest because once you've done processing some piece of data you will likely be putting it to storage and the storage comes from the host. You must be sure that it's also protected. So how are you going to do that? Yes, I've already said about these technologies there is support for almost like all of them for the guest side in Linux kernel nowadays. There is still something missing on Microsoft Hyper-V side to support these technologies with their hypervisor. It's coming. The thing is that there are Microsoft they always do things differently from the rest of the world but yes, we are getting there. Now I'd like to talk a little bit about protecting data at rest. First, let's think what do we really want to protect? It's kind of obvious that you want to protect your like sensitive data. The data your application is processing this must be confidential. This is great but then you have your operating system and if we are talking about generic Linux operating system it has a number of things which you would like to protect from the host. Some of this data think about binaries. They are built from open source. They're not any secret. Everybody can get the same binary of bash as you have. You don't want to hide it from the host but you want to at least right protect this data. But some of the data, even like operating system data must actually be even like read protected from the host. Think about your SSH host key. If the host is able to steal this it can try to impersonate you and present some other VM which will look exactly like your VM. So for general purpose operating systems normally to resolve the problem we want to do like full disk encryption because encryption also gives us integrity protections. Even though we have plenty of stuff which we don't want to hide it's easier to think about it let's just encrypt everything and be it. We've done this with Linux for years. We have things like Lux which is great and which works and which is able to create an encrypted volume for you. You can be using this on your laptop and installer, you provide a password and then everything is encrypted. So what needs to be done there? The problem with confidential VMs in the cloud is that you need to think how the guest is going to get the password or the key or anything because these must also be protected from the host. So you may want to do something like when my VM boots I'm going to go to the console and enter some password and decrypt my root volume. But first it's really inconvenient. You don't really want to have to go to the console every time you're starting a VM on the cloud. But that's not the main problem. The main problem is that this is inherently insecure because this virtual console is invulated by the host. So the host can easily see what's going on there. So once you enter this password on the console you're done. The host can easily get to your volume after that. So we must come up with a way for the guest to receive the key in an automated fashion but we need to make sure that we are giving the key to a true confidential VM. So for example that its memory is fully protected from the host that the host won't be able to steal it. And we must be sure that everything which was executed in this virtual machine before that is good to certain things because if the host managed to inject something in your boot chain which is untrusted then you cannot give any sensitive information to such VM because it will be stolen from you. So how are we going to do this? Namely in Linux. So let's take a look at how Linux normally boots. That's a very high level generic picture of what's going on. You have some platform firmware which in more than cases is like UFI firmware for x86. Then you will have a bootloader or actually a chain of bootloaders there. After that you boot your Linux kernel. Linux kernel takes usually like any tramFS and then it has all the drivers to mount your root volume, switch there and start executing from there. So we want to do full volume encryption. Obviously we cannot encrypt everything because then we get into this chicken and egg problem. If the whole disk is encrypted who is going to decrypt it? You still need to have some code which is going to decrypt it and you cannot delegate this to the host. You don't trust the host anymore. Of course there are solutions where it is like host-based volume encryption but you cannot use them in confidential VMs. So which options are on the table? The first obvious option is let's basically keep our bootloader unencrypted and encrypt the rest of the system. Maybe good from some perspective because we are reducing this unencrypted surface and the code which we must verify in some different passion. We cannot rely on the full disk protection for the parts which are not encrypted. But then we need to do quite complex tasks in bootloader and normally we want bootloader to do as little as possible. So another option would be to make Linux do the encryption. So let's keep Linux and it's init.rmfs unencrypted and the main advantage is that we can use standard Linux tools. Linux already knows how to encrypt volumes. We need to do anything much but then we get into a situation where all these three artifacts need to be somehow verified and this includes Linux init.rmfs. So summarizing these two options we can do encryption with bootloader but we then need to have a bootloader with trust and in open source work we unfortunately don't have many options for that. We have to teach bootloader new tricks like work with complex devices like TPMs in a way we want. The problem is that when we are writing a bootloader we cannot use any existing library from Linux because it's a very different environment. We will have to either borrow the code in the bootloader and support it ourselves or write it from scratch. Usually we end up with writing everything from scratch in the bootloader by saying we are going to get away with a very simplistic something then it grows and then it has its problems and remember that this part needs to be completely trusted by you because this is what's going to protect your data. Unlocking by Linux and using standard Linux tools seems to be a bit better at least some major Linux vendors seem to be converging on this because we don't need to reinvent anything, we already have the tools. But then, as I showed you, we have bootloader, we have Linux kernel and we have any traumas which all remain unencrypted. We need to make sure that they are trusted somehow. So how are we going to do that? Well, how do we usually check the integrity of the bootchain in Linux? We have two main technologies. One is called secure boot without a space the other is called measure boot with a space. Don't ask me why. So crash course for those who don't know what these things are. Secure boot is basically establishing a trust chain from the hardware where we check the signature of every binary we load before executing it. We start with Microsoft certificate which is embedded in hardware then we load some bootloader, then bootloader loads kernel and whatever and every artifact which loads the next one is supposed to check the signature on it that it's like a good thing. Then we have measured boot and measured boot means that everything we load and every significant fact about system boot gets measured in TPM which is a device which is a chip in your system, either physical or virtual and which can basically record a sequence of events. And this is great but none of these currently cover any trauma fast or kernel command line to this matter. So to make things better, right, in system decommunity people came up with a concept called unified kernel image and unified kernel image is a very simple thing. It is basically let's take our kernel, any trauma fast kernel command line, bundle them all together in one UFI binary and sign it, that's the most important part. Once we do that, if we load this and the signature is correct we can trust this whole thing, right? So, sounds like great. So how do we build it, right? We take something which is called system desktop which is a really, really simplistic loader which basically will take out from itself the kernel, the any trauma fast command line, put this all in memory, check the signature, okay, it all matches and going to launch it. Nothing else, right? We can use secure boot and now also check the any trauma fast, right? So let's fully trust it. We are going to grant the key to our data to it. In Fedora rail systems, if you use them, you can build a UKI with a standard DRACAT tool. Same way you are building your any trauma classes now but you can say like minus minus UFI and it will build a unified kernel for you and as it's a UFI like PE binary you can load it directly from your firmware or from something like Shimbot loader. You don't need complex things like Grub. Although we are actually working on adding support to Grub to load UKIs even for BIOS booted systems just to be able to reuse the same image with unified kernel everywhere including on BIOS booted virtual machines, right? Because otherwise we will have to provide two separate images one for UFI and for Confidential and other one for BIOS booted. We don't like that. But just around the mailing list eventually we hope that are going to be merged into Grub. So how does this all work, right? Just to give like a high level picture of how we boot and how we get the key, right? So we start with UFI firmware and then we rely on secure boot. So we check the signature of the first stage boot loader which in Linux is Shimbot. Shimbot is signed by Microsoft if you don't know. It doesn't change very often. It's very simplistic and its purpose is to basically carry a vendor certificate in it so we don't have to go to Microsoft with every kernel built we do and every UKI built we do to this matter, right? So as I told you, yes, you can boot your UKIs from firmware but if you want to do it with secure boot enabled you will have to go to Microsoft and ask Microsoft to sign your UKI which may not be what you want. So we do it through Shimb, same way we do Grub but then from Shimb we check the signature of the UKI which in our case, for example in case like Fedora and REL is going to be signed by Red Hat already. Every binary reload gets measured into PCRs. So these PCR registers, yes, and I'm sorry for not explaining much about what PCRs are but think about them as like extendable hash functions so the previous value gets extended with the next one so the only way to arrive to the last value is to go through the chain of extension with the same hashes from when UKI boots, right? At this point we think that we are safe we know all the measurements so we can create a policy which basically says, okay if your system is in this state this is a good state we know and then we can give the root volume key to this VM. Great concept comes with limitation. Your initRAMFS is now static, right? Previously you were building it on the target system now it's built by your vendor so you cannot put more stuff there anymore or I mean you can but then you will probably have to sign your own kernel, right? Which defeats the purpose if you do it. So for example now in REL Fedora we ship a package called like kernel initUKI word and word is there for a reason we put all the drivers which we think you might need on popular cloud and virtualized environments like vertio, VM bus, and VME, stuff like that and we say this should be enough for major use case if you need more unfortunately you have to talk to us you have to open a bugzilla please put more drivers in the UKI. Yes, you can rebuild like literally, right? You have to deal with signatures shim offers you this like mock mechanism so basically you can enroll your own keys which are going to be used in secure mode. That's one thing. Another now static artifact we have is the kernel command line. Previously we used to put things like root equals something on the kernel command line, right? Because we were creating on the target system we cannot do this anymore we need to have a command line which works for everyone so honestly we can really put much there and basically Fedora and Railship with something like console equals TTI as zero TTI zero and that's it, right? So we need other mechanisms for very standard tasks like how to find your root volume, right? What's going to be the root volume if we cannot pass this parameter anymore? We rely on some features from system D which allow for auto discovery of the root volume. You may still need to modify your kernel command line in some cases. For example we realize that we have things like crash kernel equals something on the kernel line and if you don't know what it is you just reserve some memory for a crash kernel in case crash happens you put into the special like K-dump kernel which is just going to save your memory to a file so you can file a bug with your vendor and say oh my kernel crashed here is the logs here is the memory dump and we cannot come up with one size fits all solution there because different systems may need different sizes so there is a recent development in system D project for system D stop there is a signed extension mechanism again this so basically you can produce a signed file put it on your if I system partition unencrypted and then it's going to be sourced if the signature is correct of course they will either come from your vendor or you will again have to deal with issuing your own secure boot keys Yes, I've talked a little bit about the TPM policy when we release the secret, right? when we release the root volume key, right? What would this policy be? What do we need to check to say that your system is a good state? First, as we heavily rely on secure boot you must be sure that secure boot was actually enabled on your system, right? So at least this must be included then you must be certain that the artifacts which you trust were booted and then you have options there you either trust the certificate which signed these artifacts meaning basically saying like everything built by red hat is trusted, right? or everything built by canonical is trusted by me and then your policy will only contain the certificate or you can be more strict and say I want to make sure that these exact binaries with these hashes were used in the boot so I only trust these binaries which I know Currently, for example, for Azure confidential VMs we use the second approach and the reason for that is that we at Red Hat use the same certificate to sign our kernels and our UKIs and we have them indistinguishable from the certificate perspective and this means that somebody can take our normal kernel with any initramFS he wants and boot that and that would pass the policy if we would just check the certificate from a random initramFS he will be able to steal the password for your data we don't want to allow that so we actually bind the secret to this hash from UKI and to bootloader before it so, alright you trust your vendor or you did everything this yourself and put it on the cloud and your VM has booted can you start using it, right? it must be confidential, right? on this web UI tells you green, I booted well, ask yourself a question how do I know that this is actually the VM the confidential VM and not something completely different think about this attack that a host creates non-confidential VM somewhere else on non-confidential hardware but then changes everything inside it so when you log in and you ask, am I confidential? it will tell you, yes, yes, I am I am confidential you need to find a way to attest the system basically to prove its properties that it's like a true confidential VM that it was using the image you expected all the technologies we just mentioned they were actually used, right? and it's not some completely different image and this process is called attestation I encourage you to come to the next talk by Kristoff Kristoff is over there who is going to talk about establishing a chain of trust like to make sure that what you are using is what you expect you're using there are some other solutions like for attestation servers like cloud-based services there are some from your cloud vendors like Microsoft Asia there are some coming from hardware vendors like Intel is working on something called Project Ember which is like a cloud-based attestation server and yes, I also wanted to talk a little bit about the remaining potential attack vectors but I think we are already over time a little bit so I'd rather go to Q&A right away hope we still have some questions thank you yeah, so to rephrase the question is what do we do about the firmware like UFI firmware, right? it's also in the picture it's what gets started first in the VM we need to have some trust in this artifact so two ways first, for example, what Microsoft does with Asia today they have some sort of an attestation mechanism in their infrastructure and the result of the attestation is that the firmware gets the private key basically the state of the VTPM loaded in it so the only way for it to obtain the key is to pass the attestation which means that whenever we see that there is a VTPM with a certain private key we trust that it passed the attestation in this case, we trust Microsoft as an organization that they have this established this is not happening on the host and we trust them on that also they, I mean, not Microsoft but I heard that various cloud providers are working on something which is going to be called bring your own firmware where you would be able to come with your own or the MF Winery is there but then again, you will need to build something like an attestation for your firmware the firmware will load, talk to the server and say, hey, I'm good, I can prove that I'm good please give me some private state so I can continue executing so this is like a tough question with firmware but getting there and I mean cloud providers are actually interested in making this confidential VMs confidential they're on your side they don't want to have access to your data because it helps them a lot to say, we can never get to your data you know, you can use our infrastructure, pay us but we have no responsibilities go ahead yes, yes, there are yes, the question is what about other non-X86 architectures namely IBM System Z yes, they do have second generation of their something I personally haven't looked much at how they do things although we just had KBM forum and I chatted with them and they do something very similar to UKI they just don't call it UKI and they are asking me, how can we get in there I just start calling your thing UKI because what you are doing is already like a unified kernel and you are done right, so there is also this confidential, I forgot the name in the ARM world which is coming in ARM V9 specification right, so yeah, it's not exclusive to X86 and namely like system desktop we can now use to build UKI for both X86 and ARM at least that's what gets booted through UFI right yes, it's just that without all these confidential extensions it's like a pointless exercise right, so you are not going to do much thank you, yeah all right, we are on okay, so hi everybody I'm Timo Terravier and I'm from the CoreS team at Red Hat I'm Sherin Kuri and I'm from Red Hat as well in the customer focus team hello, I'm Alessandro from the Multias team in Red Hat and hello, I'm Christian I'm on the OKD Streams team in the OpenShift Arc at Red Hat all right, so we're here to talk about how we're building Kubernetes distributions and doing that the cloud native way using OKD, using pipelines using Tecton pipelines so what's, first we start with what's OKD for those of you who don't know it's a sister distribution of OpenShift which is a Kubernetes distribution and we're not just bundling Kubernetes we're bringing a lot of things on top so one of the main things that we bring with OKD is that we bring the operating system as well so OKD is based on Fedora CoreS and so when you set up an OKD cluster you set up everything at the same time you set up the system, you set up Kubernetes you set up the applications on top the operators all very nice things to make developing on top of Kubernetes much easier and you manage the thing as a single entity and you update your cluster everything at the same time you update the system as well when you update your cluster so this is a very nice experience of a managed system here in Kubernetes distribution and that's the main thing about OKD so OKD is a community project it's as everything here is open source and so far it's been based on Fedora CoreS so the core essentially of OKD was the operating system was Fedora CoreS which is a next distribution it's based on Fedora it's an official Fedora variant and the main thing behind Fedora CoreS is that we have automatic updates where we try and bring new content updates, fixes, security fixes and new features all the time as they get inside Fedora and you get them and the basis between the benefits Fedora CoreS is that you are using an immutable infrastructure so you are doing provisioning via ignition, you write a config you want what you want to have on your system and you form the first boot you get a system provisioning configured as you like with the thing with your containers and your configuration in your containers and that's what we've been using with the Fedora CoreS is available on a lot of platforms a lot of architectures a lot of cloud platforms you get a list of here I'm not going to repeat all of those we have almost support for four architectures right now so H86 of course, AR64 S390X and PowerPC coming real soon so let's take a quick step back and look at the enterprise Linux ecosystem if you look at the distribution that we have we have Fedora which is upstream which is changing rapidly which has a lot of new features getting in, new changes experiments sometimes, things that may or may not work all those kind of things and this where the community experiments and do things it's where we learn new code we learn new software, we learn new features and that's Fedora then we get now things into a stream which is a shared space which is where we want to try and define where the next version of enterprise, software enterprise Linux is going to go, which feature do we actually want to have there which changes etc and then finally you've got the Red Hat product, Red Hat Enterprise Linux and all the other variants which is product so it's something made by Red Hat and sold and so but what about Chorus then where it does all of this where it does all of our Fedora Chorus Red Hat Chorus edition fit here in the picture and so so far we had been we had only two we had Fedora Chorus which is based on Fedora as it said we had the rail Chorus which is based on rail which is part of OpenShift and we are now introducing a third one which is Central Stream Chorus which is the version in the middle it's keeping the same ID keeping the idea of having a minimal operating systems with just what's needed the container stacks what's needing for Kubernetes and putting that on top of Central Stream so the ecosystem looks like that right now which is like we get Fedora Chorus the version specifically for running containers on top of Fedora the one Central Stream Chorus which is based on Central Stream and the final Red Enterprise for OpenShift so OKD on F-Cos has existed for thank you a couple of years now and basically Fedora Chorus is two to three years ahead of Rail Chorus and that leads to some situations sometimes where features land in F-Cos and they make some OKD components break and the maintainers of those OKD and OpenShift components don't really have the bandwidth to look at it basically that's really sad so if we look at OKD on S-Cos it's a win-win situation because our S-Cos is going to be about six months ahead of Rail Chorus and the components team really benefit from running their components on S-Cos because they get a really early signal of what's going to happen when they land the new Rail Chorus for OpenShift and we are living in really interesting times for this because infrastructure changes and features are coming really very quickly and we want that as early as possible so we need to deliver the new software so that means that we need to deliver the OS now faster this wasn't the case a few years ago if you think about the last version, the OpenShift version 413 we've delivered this nearly in the same time frame as Rail 9 and that was the first time we're still learning and we here and other colleagues are not from the same team we just gather together to make these OKD streams our goal is not to build OKD, our goal is to build the tools that would be needed by the community to build OKD and that means OKD from the grounds up so starting from the CoroS, any flavor of CoroS whether it's Santos or something else we want to test to the OKD components release and later operators that run on OKD why not also go crazy and test changing some component of OKD by something of your own to experiment so when we started on this journey of building these tools we had an easy option which was to use prow CI I don't need to introduce prow to you, it's beyond needing any pitches it's developer centric and the problem with it was that the prow CI specific that we use for building OKD on F cost is not accessible to the community and that's why we chose to switch to Tecton Tecton has firstly is cloud native and secondly it has a very very powerful and active community around it you can see it just by looking at the quantity of tasks that are ready for you to use in the Tecton Hub and also since it's cloud native all of these resources that you usually use like secrets volumes config maps they are ready for you to use and they come with really low learning curve and that's why we switched to it so here I'm going to introduce two of the pipelines that we have the first one here that we're seeing is the one that builds the core OS it can run on any Kubernetes distribution so including kind if you want to all you need to do is clone the repo that you have on the bottom there use customize to apply everything to the cluster install Tecton controller and you have everything ready to build core OS most of the tasks that you see here are based on a container that we call the COSA container the core OS assembler container it's basically just a toolkit around building core OS so it comes with everything that you need any wrapper around RPMOS tree building extensions so we build extensions to the live ISO to bare metal to open stack and that's available for you to use on S3 the next pipeline that we see here is the pipeline that we use to release OKD before we get into that today so far the OKD components are still built in the PROW system but Alessandro is actively working with the other colleagues as well to deliver OKD payload pipeline so you'll soon be able to build those components also in Tecton and so this pipeline is fairly easy it just queries the release controller on the PROW CI cluster to get the tag and digest of a valid verified release and basically signs the release, mirrors the components, generates all the release notes and stuff that we need and updates the channels so that you can as well upgrade your clusters and next I'm going to hand it off to Alessandro OK so what Sherin introduced so far are some of the components that we use to build CentoStream Core OS, the base OS we are working on the pipeline to build the payload itself and what triggers the pipelines today we are still using Tecton as a base that and Tecton provides an extension controller with a few customer resources under the list and we are able to essentially define in an even-based fashion how to trigger your pipelines the main resources are the even listeners and the triggers the even listeners will essentially expose an HTTP handler where the even producer can send requests with a payload and the information about what you want to do based on that evens they are made of three other resources which are trigger bindings, trigger templates and interceptors, trigger bindings allows you to map from the previous payload to the pipeline run that we are going to create, trigger templates defines the pipeline runs, the task runs, the Tecton object that you want to create and the interceptors are able to either map or filter whether to run or not your pipeline runs. The events can be whatever, they can be even from the repository, they can be even pipe periodic, what we currently do is to run a periodic job for building the base OS and what we want to go to in the next steps is to leverage these triggers in order to leverage the multi-arch images for sent to stream course. What's the problem there? Actually what you can do now is to run the pipelines within any cluster architecture there is no specific architecture specific bindings that would avoid you from running the pipeline into another architecture let's say for AMD64 and the S-Cosk manifest in the repository from which we get the manifest, the configuration for sent to stream course have all if any possible architecture specific information that you need to build for that specific architecture and what you get from the pipelines are separate cloud boot images one for each architecture for bare metal, and separate container native images, which are single arch container images. What we want to do is to have a unique manifest list container native image. And we want to achieve this by using triggers because Tecdon does not provide any way to add node selectors, node affinities, and select which nodes you should use in your pipeline to run your task run. You can do that only at the level of pipeline runs for task runs, which are the instances of the abstract pipeline that Sherin described so far. And so the simple way is to just run two times, let's say, for two architectures, the pipeline from the Chrome trigger that I was talking before. But at that time, you continuously get single arch container images that you want to compose into a manifest list. We want to leverage an interceptor and another even listener, essentially, by feeding a config map each time a build is successful for a given architecture, storing this into an array, or let's say into an array, and trigger the even listener so that the interceptor, the file within it, can understand whether it's time to build and compose the manifest list. When it's time, so when all the architecture images, the single architecture images have been built, the compose manifest list pipeline will trigger and compose the manifest list from the single arch ones. Still talking about next steps, we are working on introducing the Fedoragor OS layering approach into CentroStream Core OS because as of now, what you get when you get CentroStream Core OS 9 is an OKD-specific CentroStream Core OS 9 that has packages and configurations within it which are specific to OKD. What we want to go to is a model for which we become, let's say, first-tier providers of the CentroStream Core OS-based images and we also consume at the OKD level, let's say, these base images by extending them through a container file like the one that you see in the bottom of this slide and adding layers, packages, configuration, anything which is specific to OKD and to be then published with the OKD release pipeline that we were discussing before. This means that we get to a model for which we have this base image and any second-tier provider can consume it and do whatever they want with it. You can essentially leverage any of the services offered by RPM OS 3 with this robustness even by the CentroStream level series and deliver them to the users in using the container registers as transport mechanisms. We are, as Shane was saying before, we are doing all of this work to be cloud-native, to be Kubernetes-native, and if you want to try out our pipelines, you can run them locally into any kind of Kubernetes clusters, for example, on-kind. What we do usually is to, at production-grade level, is to run it into a host provided by the NMOSI Alliance in a collaboration that Kristen will now introduce. Yeah, so we have our built farm on the MOC Alliance, which is the Mass Open Cloud or Massachusetts Open Cloud Alliance. MassOpen.cloud is their website, and we have started essentially a little joint venture, a collaboration with them from the OKD Working Group, the CentroStream Cloud SIG, or the Centos Cloud SIG, the OKD Streams Team within OpenShift, and then the MOC Alliance, and the MOC Alliance is a research-focused cloud, an educational-focused cloud, so what they essentially provide is infrastructure for their students, that it's a group of university and other research-related projects that make up this alliance, and they use it for a whole lot of things, and their main thing is providing the infrastructure for their students, their folks to run things on experiments or anything, really, and they essentially approached us needing the ability to spin up OKD clusters on their infrastructure on demand, so their students could spin up a cluster very quickly, and we've been working on enabling this, oh, I hope you've been hearing me, but we've been working on enabling this, and really, well, they were donated, I think about 2,000 servers, and they racked them up somewhere and turned on the power and gave us an IP range, essentially, to go crazy and enable this thing, and under the hood what they did is they put an open stack like API, it's called ESI, Elastic Secure Infrastructure, on top of their bare metal pools to manage those bare metal machines, so what we've ended up is essentially implementing a new platform for the OpenShift installer. We are actually aiming at using the agent-based installer to do this, which is kind of the new way of installing OpenShift on really any kind of platform, and specifically on bare metal platforms, so they provide us with the bare metal infrastructure, with an on-demand API that we can provision notes with, and then it's essentially our task to run the OKD installer and spin over cluster, and we've been, not only have we enabled this platform, we have also been given a built-farm cluster to use ourselves, so essentially, we are now home on the mass open cloud, our cluster was actually down last week for maintenance, and it was brand new, so we haven't configured it properly with everything, but I think we're almost there again, so build infrastructure is almost up again, and we'll essentially use that as our official built-farm for the CentOS Dream CoreOS artifacts, which is the OS Tree native container image, just the OS Tree shipped as a container image, includes the kernel and everything, but you can actually manipulate it and change it like a container, and we just saw that earlier on this slide, the CoreOS layering, so you import the OS Tree with a from directive here, and then you run RPMOS Tree within the container, and that spits out another OS Tree native container image, which you can also run as a container, or you can actually tell RPMOS Tree to rebase your whole operating system to that OS Tree that was shipped within the container image. So lots of interesting stuff. We are now part of the MOC Alliance with CentOS Stream and OKD, and really not only our build-farm is gonna be there, there's also gonna be a secondary build-farm or cluster for OKD working group members, so anyone who's participating in the OKD working group and has some experiments to run, they'll be able to request a namespace on that secondary build-farm cluster, and then even more importantly, the ephemeral clusters, which is end-to-end testing, will be able to run end-to-end tests on that platform, and also, obviously, since they're ephemeral-ly or available on demand, that's also what the MOC university students will use spinning up a cluster and then having it run for as long as they need it for their project and then tearing it down again. So to give you an entire picture of the whole cake, it's all just pieces of cake here, multiple layers. So the base layer, the infrastructure is Massachusetts OpenCloud and OpenStack. It's not a full OpenStack thing, it's an OpenStack API that actually manages bare metal nodes. Then we use OKD's agent-based installer to set up clusters on top of those clusters and especially for our build farms, we use git-sac-ops with Argo CD and that runs our tecton pipelines. And yeah, that's what we run. That's essentially our product or what we own as the OKD Streams team. We make the OKD pipelines that produces the OKD artifacts. We don't necessarily own the operating systems therein. We just own the build system and we want to really enable a kind of self-service. It must be super easy for anyone to replicate a build or to change something up, replace an RPM or add additional RPMs and run an RPM-OS tree compose yourself. And this is with CoreOS assembler and this is what our pipeline does. So you can create your own RPM-OS tree-based operating system yourself or run a created derivation of Fedora CoreOS or CentOStreme CoreOS or even if you're a paying Red Hat customer, REL CoreOS. So it's a really flexible and powerful tool. A tool kit, really. And we're looking forward to you using it, giving us feedback and telling us what's not great yet, what needs to change. And also, obviously, we'd like to hear what works well, what do you like. So this is essentially the call for action. If you're a developer, you might want to try OKD on top of S-Core, CentOStreme CoreOS, because it's more stable than the Fedora variant. Just because, as Julian mentioned, the kernel isn't so far ahead as the Fedora one. For staging, if you already have an OpenShift cluster in your company and you want to kind of run a preview of what's going to come to OpenShift in six months or so, you can use a CentOStreme cluster for that. And then obviously for labs, for any kind of experiments, the students, educational projects, open-source community projects. This is what we want to see use OKD, really. So with that, I think we're through and we can move to questions. So thank you very much. Any questions? Not so shy. There's a question. So what would you say is what the best field for serving and I would like to agree on? Yeah, I think there are probably multiple. Oh, yeah. Why did we choose to go with a bare metal-based infrastructure to set this up instead of a public cloud? And I think the reason is that we just have a really good connection to the MOC Alliance. They need us to implement this and we need them for the free infrastructure, right? There's no, they don't charge us for using this. It's a pro quo, essentially. And we just, like if there's a public cloud operator who would give us those resources and say, look, you don't have to pay for it and just do the work, we'd probably do that too. But this was just the first kind of joint project that we did in this way. And since we're from the community side of OpenShift and MOC Alliance is an educational research project that was just a perfect fit, yeah. So Argo CD, oh yeah. Question was, why do we have Tecton on top of Argo CD? Why isn't it the same layer? Is that right? So essentially Tecton, the Argo CD GitOps, which is added as a GitHub app to the repository, it watches the contents of our GitOps repository and applies anything. It can be a Tecton resource or any other kind of resources on the cluster it's supposed to land on. So really the Argo CD is a more agnostic controller for resources in a GitOps-y way. So it's not just, but we do control the Tecton pipeline runs through Argo in that way. You can, oh, the question is, can I use this to deploy microservices? My own microservices, right? And well, the question, yes, you will be in the future. We are, as I mentioned, still ramping up our build farms and the community cluster on MOC, but once we have the MOC community cluster up, you'll be able, as an OCD working group member or an interested party who presents themselves to the OCD working group says hello there, you'll be able to request a namespace and then use that to run your experiments. There's, we don't have any rules yet on how long these things can run and how compute heavy they can be. I guess if it's abused at some point, we'll kick people out again, but we don't have any tenants yet and we're looking forward to our first tennis and that's essentially, yeah, if they have something they wanna prove out, make a proof of concept or run an experiment, the community build cluster would be the place, yes. How many architectures are we building? Currently, we only build x86 because we don't have access to ARM builders. We don't do virtualized builds, so it's all native builds and we don't have access to ARM machines at the moment and that is something we've been wanting to do for a while now, add multi-arch builds at ARM as a second architecture. It's hopefully coming soon, but we still need to find an ARM builder to actually run those. Yeah, we could definitely do that. It's just we have essentially decided not to pursue virtualized builds since all of our productized downstream builds are also native builds and we just don't wanna do something that essentially we can't use downstream. So all of this is also meant as inspiration for our colleagues within the OpenShift org to show them that you can use these tools to build OpenShift or OKD instead of the ones that we have internally. It's not meant to replace anything, but it's meant to kind of show folks to yes, we can change the build system and then see progress there. Move forward. And with the tecton tasks, it's just one reason why we chose tecton is because it's so easy to share these tasks. There's tecton bundles or catalogs that you can reference and so the actual task, YAML doesn't have to live in a local repository. You can just reference it and the actual task definition or pipeline definition can live somewhere else. So that's really, really flexible. Any other questions? All right, I think thank you again for having us. We have a big break now. Now, okay. Hi, everyone is after lunch. So we are going to walk you through today about hot plugs in virtual machines. My name is Eddie. I work for Red Hat on a project called Kuvert. And this is Andrea. He's also from Red Hat. He's working on a project called LiveVirt. Who knows? Anyone heard about LiveVirt? Oh, everyone, you're famous. Anyone heard about Kuvert? So if you know Kuvert then you also know Kubernetes. So we are good. Okay, so a little bit of background and context. In the beginning we just had a virtual machine and life was really, really simple, right? We just had to manage that one. Then we had many virtual machines on many nodes and it was getting difficult. So we invented management and we had to manage a virtual machine and our projects that we already, maybe you know, like Overt and OpenStack that manages virtual machines and others. And then came the containers which are a soft virtual machine. So it's a lighter and nicer and you can run just application inside of them. And it had the same phenomenon there. We had a lot of containers so we had to manage them as well. So then came the big players and invented Kubernetes. And Kubernetes started to manage pods which are the lowest entity there. And the pods are containing many containers, something like that. So you can consider them as containers as well. So there was just a specific implementation. And then came Kuvert and said, why if we can manage the ecosystem of managing Kubernetes is very similar to the ecosystem of managing virtual machines. So let's put VMs in that ecosystem and put them in pods which sounds ridiculous. And combine them both. So we'll use all the scheduling or all the nice features of management that we had on pods and the ecosystem to do the same thing for VMs. So this is Kuvert. And from then on, we will try to go into the hot plug thing and but first just expand it in order for to define a virtual machine in Kuvert. We just define a manifest which is like a specification. And the whole system creates for us the virtual machine inside the pod. It is powered as usual. I mean, all the open stock and over it in the past and Kuvert as well is implementing virtual machine also using LiveVirt and Timo because you already know what it means. More as if you look at this slide here, we have three levels of abstraction. So we have the manifest. This is how Kuvert looks at the virtual machine. Then we have LiveVirt which manages the lifecycle and it's an abstraction API to KMU. And we have KMU itself which it's actually the application that emulates for us the virtual machine. This is how a manifest the virtual machine manifest looks like in Kuvert. I will not get into the details. It's just an example here. It's declarative and that the whole point of what Kubernetes is. So we got to the hot plug thing. So in previously, like if I'll take over it as an example, we had their support for hot plugs. Can anyone, do you know why we needed hot plugs in the first place? For? No, not for, why? Why do we even need it? Okay, but why is it useful for someone to just hot plug something in the middle of, like if you take a physical machine and put a PCI inside of it while it is running? Yes. So I think one of the most use cases that someone wants to do a hot plug in general is to, in networking, for example, you just want to connect to some other network on the fly or you want to change some network parameters that you cannot do it outside the VM, like in the external network. That's one option. Or maybe you want to add more disks to your virtual machine and you don't, all of this operation that you want, you don't want to disturb the application that runs in your guest. You don't want to shut down the VM and then power it on again, so you want to do it on the fly. It also allows you to scale things later, like you could start with something small and then maybe you find out that you need more things, like more disks, more storage, so you want to hot plug things in to get these features. This doesn't, it's not limited to interfaces or disks, it can be CPUs, it can be anything. So what are our challenges with hot plugs? There are a lot. So it starts with Kubernetes itself. So Kubernetes itself gives us, if we are talking about devices, like PCI devices, for example, the easiest one that I can give is SRV. So if I want to push in an SRV device inside, I need to first move it inside the pod, right? So the VN can consume it. So in Kubernetes, there is a device plugin, a component that allows us to specify a specific device and asking to move it inside the pod so it can be consumed. And in networking, for example, we also have another part, it's the CNI. The CNI defines, goes into the pod and can the pod network in space and can configure it with all the needed tweaks to have the interface of the network inside the pod and for it to access the node. But this is a Kubernetes thing. Device plugin can be only done at the start of the pod. You cannot do it while it is running. Once the pod is already active, you cannot use the device plugin anymore. And the CNI recently, in the last, I think, half a year or a year, there is now, using MULTUS, we can automatically, we can hot plug things into the pod while it is running. So this is the new thing, but it is hot plug. We have a way to overcome the device plugin problem, like with SRV, for example, in Covert, what we do is we unplug, I mean, we need migration. So when we do migration in Covert, the destination pod is created and we can do everything else. So the device plugin can work on the target node. So what we do in Covert for SRV, for example, we unplug everything in the source all the SRV interfaces, devices, and plug them on the target afterwards. It's kind of work around in this code plug, using migration. Now the Covert challenge. Covert, I'm not going to get into this mess here, but Covert has a lot of components and if you want to do one thing there, you will need to most likely synchronize them all. So for example, here, the request comes to a component called the VIRT API and from there it goes to the manifest and reconciles manifest and ask the VIRT handler to start doing some stuff, privileged stuff on the node itself. And for example, the VIRT handler needs to go inside the VIRT launcher that you see inside the pod and do the networking stuff. And then it reaches the VIRT launcher, which is the Covert representative in the pod that does all kinds of things. And there, the continuation of our talk today, there the domain, it touches the domain configuration, it talks with Leveret in order to do whatever is needed. We are going to talk mainly on this part here from now on. Andrea will continue that. So, yes. Quick switch. Quick, it will not be quick probably. So, can you hear me fine? Yes, good. So, when it comes to hot plug at the Leveret level, by the way, if you have any questions, raise your hand. We'll have time for questions later as well. So, the problem when it gets to Leveret, the problem with PCI hot plug is that it requires planning. This is the case for a Q35 machine type, which is the default in Qivert and the recommended one. You cannot just hot plug devices just willy-nilly, you need to prepare for it in advance. So, the way that this works is that you have your machine and so the part like in, which one is it? The part, this part is the, we can consider part of the machine and it cannot be anything here, cannot be hot plug. So, you have your root bus, that's a PCI bus. It cannot be hot plugged. You can plug devices into it, but the devices will also be considered as integrated devices and so you will not be able to hot plug them or hot unplug them. In order to have hot plug working, what you need to do is you need to add some additional PCI controllers called root ports. You plug those into the root bus and those cannot be unplugged or plugged, but the devices connected to them can and at that point you have hot plug, which is what you want. So, here we have two devices that can be potentially unplugged at runtime. If you want to have the ability to expand your virtual machine later down the line, you just create a few spare ones, as many as you need and then you can do hot plug. When it comes to Livert and how it facilitates the hot plug on Q35, it is by doing a bunch of stuff for you. So, if you have this XML, which is a very simple XML that describes like a single network interface, you can take this and provide it to Livert and Livert will add some other XML to it. All of the stuff in yellow is stuff that Livert will add automatically. It's a bit complicated. So, we're gonna go through it like step by step. So, the first controller is PCI, the root bus that we were talking about in blue. On top of that, you have one root port and then you have your device. And so, all of the stuff with address type, all of that is just information that Livert needs to record the relationship between the various devices and controllers. And so, basically the vertical lines. So, this happens automatically. You provide the device, you get the PCI controllers. So, that means that hot plug works easy, right? No, of course it's not the case. There is a problem with this and can anyone guess, did anyone spot the problem? Gone? Right, close. So, yeah, right, I'm gonna repeat the question. He said that there's a limited amount of slots that you can hot plug. They could use for hot plug. Yes, it is correct. Like, more precisely or like more to the point is that Livert can only automatically add PCI controllers for devices that it knows about. And the devices that you are going to hot plug by definition, Livert cannot know about ahead of time. So, it cannot automatically add the controllers for that. That's why I'm saying that you need planning. So, the question is, how do we solve this? How do we manage to convince Livert to give us all of this PCI controller goodness without it knowing the devices in advance? The solution that we have come up with is that of using placeholder devices. So, we'll have an example here. This is a standard, like very simple Qvert virtual machine with just one single network interface. And this will result in Qvert generating this XML, which is like same as we've seen before. Qvert will also add another interface that is marked as a placeholder. Like you can see here, placeholder, right? So, when this definition is fed into Livert, the result is that Livert will add a bunch of controllers. Right, so, resulting in this PCI topology. And you can see that there are two reports because Livert realizes that in its room for two devices. At this point, we take this definition that Livert has augmented with additional information and we remove the placeholder. But we don't touch any of the PCI controllers. So, now we have one empty slot, which is the goal that we had in mind. So, this virtual machine can now be booted and it has room for plugging in one device at runtime. We have decided that four is the magic number. Like, you're gonna get four. There is no particular meaning behind this number. It's just a small number that we feel will be useful to people without being overwhelming and we can change it later. So, for now it's four. This is what we have implemented today in Qivert. Before going with this route, we went to a number of approaches that we consider and ultimately decided not to follow. So, the first one was to ask the user to manage the controllers explicitly, which is what Livert users have to do. That is fine for Livert, where you need to have very detailed control of the PCI topology of your virtual machine, but Qivert is a much higher level tool and so we didn't feel like it would be, it would be asking too much of users. Users of Qivert should just be concerned about how many network devices they want, not whether those are going to be plugged in whatever kind of controller and all of the requirements that come with it. So, we rejected this idea pretty quickly. Another approach that we consider is the use of the PCI bridge controller, which is a PCI controller that looks a bit like the root port, but it has a number of slots in it and all of them are capable of hot plug. So, it sounds like it would be a great solution for this problem. However, the slots on a PCI bridge are not PCI express, they're conventional PCI and Livert will not use them by default on a Q35 machine type. You can convince Livert to use them, but it basically requires you allocating all of the addresses manually. So, Qivert would have to get into the business of picking the PCI addresses for all of the devices, which is extremely complicated. Qivert, understandably, doesn't want to get into the business of doing that when Livert has all of this logic implemented. So, and plus the devices would not show up as PCI express inside the guest. So, there's a number of drawbacks we rejected this option as well. Another option that we consider was to, instead of just saying four, just to making, allowing the user to specify exactly how many placeholder they wanted to have, this is actually what we implemented at first and then we decided that most user would not have to worry about this. We didn't want them to need to worry about it. So, we scrapped the interface and just hardcoded four. This is up for debate. Maybe we will change our minds. We will see. In terms of future work, this has been merged, as I mentioned. So, this works today in Qivert. One thing that we could do in the future is to make this general concept of using placeholder devices to create PCI slots for hotplug could be extended to other kinds of PCI devices. The first example that comes to mind is disks, as the most obvious one. Today, Qivert implements hotplug for disks. The way they do it is through the use of the Virtio SCSI controller, which works fine. But there are some drawbacks to it, as well as some advantages, so it's kind of a toss up. Thank you. So, maybe we could have these extended to disks and make it possible for the user to choose to use Virtio block instead of Virtio SCSI. It's interesting, we will explore it and see what comes out of it. Another idea is to use a much larger number of PCI slots, like 32 to sort of match what you would get out of the box on the PC machine type. This sounds like a good idea. There are, however, some drawbacks in terms of resource usage. Every time you add PCI controller to your virtual machine, you affect negatively the memory usage, the boot time for the guest operating system, so maybe four is enough and 32 would be too much, but maybe the overhead is not big enough that it matters in the context of Qivert and a number like 32 would give enough headroom that most people will never have to worry about it while not having such a big impact on performance, so it could be a good development. Again, we're gonna explore this and see what comes out of it. And this is the end of the presentation, so any questions? So the question is about memory ballooning. Memory ballooning is a completely different topic because the memory balloon is a PCI device, but it's a device that you provide to the virtual machine and then the ballooning doesn't happen through plugging in and plugging out PCI devices. You just inflate and deflate the balloon. So you can create the balloon when you define the machine and it will just be there throughout the lifetime of the virtual machine as you inflate it and deflate it, so you have to do zero of this shenanigans. Yes, please. So the question is can you change the number of PCI slots when you do migrations? Not as implemented today. Theoretically, you could do it. You could have migration hook that changes the, but then you get pretty deep into the inner workings of PCI topology and you get a very low level. So you're basically on your own kind of, but you can do it. Question, is it a limitation of Qvert or Liver? This is a limitation I believe of the PCI spec. You cannot hot plug root ports at the QME level and as far as I understand, this is because of the way PCI works. The number of PCI controllers is detected by the guest operating system at boot time and once you are inside the guest operating system, it is capable of detecting that new devices are attached to an existing controller, but it cannot figure out that there is a new controller coming in. So PCI spec, as far as I could be wrong about this. I'm not 100% sure, but I think it is correct. What about CPU hot plug? Again, like ballooning, completely different topic because the CPUs are not handled by plugging in PCI devices. There is, we are aware of some work that is happening, right? With regards to enabling CPU hot plug in Qvert. So we know that it's happening, but neither of us knows the details. So I'm sorry, but you can search, probably find the open merge request or pull request or maybe it's been merged already. I don't know. It's merged. Okay, okay, so CPU hot plug is a thing, just not this thing. Yes. What about making the PCI bridge PCI express compliant? I think that was raised at some point as a why don't we do that? I don't know the answer. There are various PCI controllers. I think that, okay, I'm probably misremember details, but the, I think the idea behind all of that was that PCI, the conventional PCI and PCI express, although they share most of their name are actually extremely different technologies. And so PCI is like a bus based topology, whereas PCI express is a like point to point topology technology, okay? And so what you would get by having a PCI bridge that is PCI express, it would ultimately be 32 PCI with ports in a row and nothing more than that. So in that sense, this idea would be the idea of having 32 PCI with port in every virtual machine instead of four, right? So in a sense we're considering it's just different controller. So the question is whether these limitations are inherent to Qvert or since they are on Livered they are shared by any virtualization option built on Livered and the answer is yes. They are shared by any virtualization solution built on Livered and QMU and again, I think anything that uses PCI has this limitation because it's not really an implementation problem in Livered, in Livered we are exposing the limitation in QMU and my understanding is that QMU is simply complying with the limitations of the spec. So as long as you're still on PCI, this is what you have to deal with. You gotta plan ahead of time. I think there was a question there, yeah. The question is how do you plug a new volume inside a running pod and I'm not the good person to answer that so I will pass it over to Ed. And not even Ed so apologies, like he's a network guy mostly. Yeah so we know it's possible to do it, we just don't know the details of it. Like I can, I know a bit how it works at the Liver level but not at the pod level. I'm sorry. So the comment from Ed is that it's challenging. Do you want me? No I just, maybe you would translate this. Sure. Right. Right so basically as I was mentioning earlier with the PCI controllers there's an escape hatch for all of this stuff. In Cuber you can expose a domain, it's a migration hook, it's just for migration. It's just a sidecar so you can do sort of arbitrary transformation to the Livered XML, inject your own custom things. Of course done, if you break something you get to keep all of the pieces. So ideally you will not need to do it but the option is there in case you have no other options. I think we might be good unless, let's call. Any last minute questions? Oh, okay. And thank you. Okay, so hi everybody. I'm Arikadas and in this session we are going to see how volume populators are being used when migrating virtual machines and we will try to answer the question that appears in this slide of whether volume populators fit also for virtual disks, disks of virtual machines. A bit about myself, when I joined Red Hat I started contributing to OVIRT. It's the upstream project of Red Hat virtualization. Then I contributed to QVIRT, the upstream of OpenShift virtualization and the successor of OVIRT. I had a short break from virtualization in which I worked on other OpenShift stuff and in 2020 I switched back to working on OVIRT and now I'm working on Forklift which is the upstream of MTV migration toolkit for virtualization. Forklift is an extension to Kubernetes that enables users to migrate virtual machines from traditional legacy virtualization management systems to a QVIRT or OpenShift. In OpenShift 2.4, the available sources are vSphere, OVIRT and OpenStack and these days we work on adding more sources, OVAs, virtual appliance files, as well as QVIRT itself in order to enable migration between QVIRT deployments. Generally speaking, Forklift is targeted for mass migration and in more details, the idea is to simplify the process of migrating virtual machines at scale to OpenShift. We just few simple steps that are listed here and let me show you how simple they really are. At the right hand side of this slide, you see the Forklift UI, the new UI that is integrated within the OKD or OpenShift console and the process of migration using this UI looks as follow. We need to define a source provider. In this slide, you see the complete form for defining vSphere as a source provider. If we migrate to a remote cluster, not the cluster that Forklift runs on, then we also need to define this cluster. But in this example, let's say that we migrate to the local cluster where Forklift runs so we will skip the next step and get to the creation of the migration plan. So in this form, we need to specify general things like name, description, and also pick the target, the source and target providers as well as the target namespace that the virtual machines will be defined in within the target provider. Once we do that, we click Next and we see a list of virtual machines that exist in the source provider, along with more information that users need to know before triggering the migration. We select the VMs that we want to migrate and click Next. This takes us to the network mapping view where we can easily map networks that exist in the source provider to networks that exist on the target provider. With a similar user interface, we also define the storage mapping that maps storage domains, devices, types, data stores, depending on the type of the source provider, to storage classes on the target OpenShift cluster. Once we do that, we will see our plan in the Plans view and we can simply start it by pressing the Start button. Once the migration is done, we will eventually see the migrated VMs in the target namespace within the target provider and we can start using them. But what happens in between during the migration itself? So if we migrate VMs that ran on a foreign hypervisor, not KVM, then we need to convert the disks, we convert the format, replace the drivers, replace guest tools, et cetera. And then we need to copy the rest of the data from the disks to the target storage. Then we need to change the configuration of the VM in order to adapt it to QVM. In this session, we will focus on the first two steps of the conversion and copying of the disks. So that's about MTV. Now let's talk about the volume populators feature, which is a recent feature that landed in Kubernetes and OpenShift. And the goal, the purpose of this feature is to import data from remote sources to our cluster. I will demonstrate how it works with resources that were shared in a blog post that was written about this feature. On the left-hand side, we see an example of a PVC. When looking at its spec, we see the first part of the volume populators feature, which is the data source ref. This allows us to point to another CR, in this case, to a CR called example hello of kind hello. And we can see this custom resource on the right-hand side of the slide. And specifically, we see that in its spec section, there are two fields, the filename that is set to example.txt, and file contents that contains a hello world. When we try to use this PVC, for example, with the job that we see on the left-hand side, we can see at the bottom that it uses the example PVC that we saw before. And we also see that it runs a container that dumps the data of the example.txt file. We will see what we see on the right-hand side that once the job is completed and we look at its logs, we will see the content that we saw in the spec of the CR, in this case, hello world. So this is from the functional point of view, right what it does. But let's talk about how it's done. So when we post a PVC that refers to a CR, there is a controller that detects it, and this brings us to the second part of the volume populator's feature, which is the library called lib volume populator that facilitates the implementation of such controllers. The controller detects the PVC, it reads the custom resource that is being referenced, and based on the data it takes from those two places, it creates a prime PVC, which is similar to the original PVC, and this PVC is used by a populator pod. The populator pod then writes the data into the PV that was allocated for us. Once that's done, the controller attaches the PV that we saw to the original PVC, and in our previous example, the populator wrote the content that we saw in the spec, and now that the PV is attached to the original PVC, the job was able to get access to that and dump it. In Forklift, we implemented two kinds of populators, one for Ovid and the other one for OpenStack. We read the data from there and write it to the PV. And if this picture looks familiar to you, we had a similar design in older versions of Forklift, and also if you use QVIP along with data volumes that refer to remote sources, in both cases we use CDI, and with CDI we get a similar picture. The only difference is that the pod is named differently. It's called Importer Pod instead of Populator Pod, and the sources are a bit different. We don't have support for OpenStack, but we have for others, in this case, I added the vSphere to the slide. So it looks similar, but let's talk a bit more about CDI. So CDI was the solution that was introduced in QVIP before volume populators were implemented to import data from remote sources. It is based on the data volume CRD. We can see a CR of data volume on the right-hand side, and we can see in its spec section that it connects to Ovid because of the image IO section. And yes, as I said, CDI also supports other sources. You can find them in the documentation of CDI. Before Forklift 2.4, we used CDI in all migration flows, and in Forklift 2.4, the latest version, it was partially implemented with volume populators. So I said it looks the same from the diagram that we saw, but looking more into the details, we see that there is a trade-off between the two. In CDI, there is an extension to Kubernetes with data volume CRD. While with volume populators, we get a solution that is integrated within Kubernetes. The data source section exists in all PVCs since Kubernetes 1.24. We talked about the naming of the ports that is different. Another point is multi-stage transfers, which allows us to do what we call in Forklift 2.4 migration the ability to copy snapshots of the data periodically, so that when the user asks us, we can shut down the VM, copy just the remaining of the data that has not been copied yet, and start the workload on the target environment, and that way we minimize the downtime. So we have no support for this with volume populators. Next, when we want to add another source to CDI, we basically need to extend the code base with more logic that is written in Go to support this new source. While with volume populators, we have a pluggable design. We can add a plugin that runs a populator port that basically can run anything inside, regardless of the language that it is written to, and it even allows us to leverage native tools that are provided by the source system. And lastly, CDI is integrated a solution in QVirt, and it is tailored to virtual disks. I will talk about it a bit later. While volume populators are not yet there, there is a walk in progress to integrate them into CDI, but it's not there, and there's no notion of virtual disks or virtual machines with volume populators. The reason we chose to go with volume populators is two-fold. One, we wanted to leverage the pluggable design, and second, we wanted the ability to run native tools inside the populator port. When we planned for Cliff 2.4, we realized that in order to be able to deliver the feature of migration from open-stuck in time, we need to minimize the risk, and that also means to try to avoid changing CDI. So we looked at volume populators, but we also realized that it will take us some time until we get to the point that we can transfer data, because we first had to create inventory, and that took time. So the way we approached it is by starting with the volume populator for Ovid that replaced some of the functionality that we had before with CDI. And let's see how that works. So there is a controller in Forklift that posts three resources, a secret that holds the credentials and properties of the source provider, a CR called Ovid Volume Populator, and the PVC that will hold the content of the virtual disk. Then another controller, a popular controller that is based on the Lib Volume Populator Library, creates the prime PVC and the populator pod as we saw before, and inside the pod, we run the command that we see below that uses Ovid IMG, a native tool that is provided by Ovid to interact with Ovid IMGIO to upload and download disks. And then we write them into the PV. The OpenStack Volume Populator works in a similar way. This time we post a CR that is called OpenStack Volume Populator. Again, the controller creates the popular pod and the prime PVC, and this time the popular pod runs code that is written in Go. Cloud Library in order to connect to OpenStack, get the data from there and dump it to the PV. With these populators, let's start with Ovid. With Ovid, we were able to accelerate the time of the data transfers. In our testing, in some cases, we reduced the execution time by half. So it was significant. We were able to introduce another feature of transferring the data over an insecure channel more easily because we didn't need to change CDI. And we were able to introduce an alternative implementation to the one we had with CDI, and now we can deprecate the relevant code in CDI. And even more importantly, when it comes to OpenStack, we were able to introduce this new functionality of migration from OpenStack in time without changing CDI, as I said, and really it works very similar to that functionality of Ovid. Some challenges and insights that we saw in the process. So the first one was that we initially planned to use CDI to allocate the PVs. We planned to create a data volume with a source that is set to blank, and then the popular pod would kick in and copy the data to the DV that we get. And that didn't quite work. I mean, the popular pod started and wrote the data into the PV, but afterwards the data was written by the important pod because it tried to give us a blank volume like we asked for with the source equals to blank so that didn't quite work, and we had to skip using the data volume but instead to post the PVCs directly. The problem with this approach is that we lost the logic that we have in CDI for this. First, CDI knows how to pick the right properties for the virtual disk in terms of volume mode, volume access modes, and we had to duplicate it. And also when the PV is allocated on a file system, the file system itself has an overhead, so this gives less space for the virtual disk. So we need to take this into consideration and allocate more data than we need for the virtual disk itself. And this is also logic that we had to add to the core code of Forklift. Next, let's talk about progress reporting. VM migrations can take time, even a couple of hours, so it is important for us to reflect the progress of this operation. We tried several things. Our first approach was to make the popular pod update the CR, put the progress on the CR itself. The problem with this approach is that it required another service account in order to let the popular do that and that simplified, sorry, complicated things for us. Another approach we tried is to push the updates from the popular pod to the popular controller. The problem with this approach was that we had to propagate the endpoint that the data will be sent to to the popular pod and that also complicated things for us and didn't quite work in all scenarios. So we didn't have to use this one and we tried another approach of pulling the data from the popular pod to the popular controller. This approach that is similar to what CDI does and we actually implemented it, sorry, in a similar way, works by running a metrics server inside the popular pod and then the popular controller pulls the data from the progress as metrics. And that worked well for us. Next is dynamic volume provisioning. So when we played with our implementation, it worked nicely in our development environments. However, we got reports by QE that on testing environments, migrations sometimes fail. When we looked into it, we saw that it happens when the target storage class doesn't support dynamic volume provisioning, but only statically provisioned volumes. So we figured out that it's an issue in the library, in the LibVolumePopulator library. And we reached out to the maintainers there to explain and to try to find a solution, but we didn't find a solution we agree on. And since it's not that common to have such storage classes and since it's not that important for us, we rather chose to block using such storage classes when using flows that use popular pod. Next topic is the conversion of multi-volume disks. So a bit of a background. When we convert, and I said that the conversion applies when we might get from a foreign hypervisor, in that case, we run VIRT V2V that allocates a local overlay volume that is backed by a remote volume. VIRT V2V then inspects the data of the disks and modifies it as we talked about earlier. When the disk is composed of a single volume, it works well with volume populators. However, when we have disks that are backed by several volumes, then this design breaks because volume populators expect one PV per populator pod, right? That all the data will be written to a single PV and in this case we need more than one and we also need to specify the information about the source volumes and this doesn't quite fit. So again, we reached out to the guys that work on the LibVolumePopulator library. We filed a bug, we had some discussion, but again we didn't agree on a solution for that. So our idea here, we have a plan to change the code base of the library in order to support this, but we need to play with this and see how it goes. When it comes to migrations to remote clusters that I mentioned before, in this case we need to post the CR on the target cluster that we migrate to and the populator pod also runs there so we need to track the progress that it reports. This raises two questions. One, who should deploy the CRDs to that remote cluster? And second, how do we get the progress from there? Which components should do it? Our first attempt was to let a controller that already runs in forklift on the source cluster to create the CRDs and also to get the data, but it appeared to be more complicated than we expected because usually CRDs are deployed by operator and we didn't want to duplicate this logic. And also when it comes to progress reporting, propagating them to another cluster can be challenging. So we chose to avoid this one and rather try to delegate this to CDI so that CDI will, on the target remote cluster, will deploy the CRDs and also track the progress there. And that's a work in progress. We work with the CDI team on this and I hope that it will get in forklift 2.5. And lastly, some more adaptations that we had to do to the LibVolumePopulator's library. Usually we try to create as much as we can of the resources that are per migration on the target namespace. That is debugging and it also allows us to set own references to improve the cleanup code. The library doesn't do it, so we needed to patch it. We got feedback from the guys that work on the library that they chose not to do it because they wanted to hide the prime PVCs and the popular iPods from users, but it's okay for us. We actually do want it. So that's something that we did. Another point is limiting the number of restarts of the popular iPods. We don't wish it to keep retrying, right? We want, at some point, to say that the migration fails and to be able to inspect the logs more easily. So we limited the amounts of restarts of the popular iPod to three. Also related to the cleanup, we want to correlate resources that are per migration with the migration itself. And again, we needed to change the library in order to set this label. And last but not least, when it comes to VM migration, it might be important for users to select the network that the transfer will be done on. It can be because we want to isolate this from the data that is used by the VMs, by the other VMs. It might be because we want to use a better network, faster network. Either way, propagating the transfer to the popular iPod also required modifications. So to sum up, when we look back at the original question of whether volume populators fit also for virtual disks, I would say that the answer is generally yes. It allowed us to implement the original things that we planned, the basic functionality. And we also work with the CDI team to integrate our populators and volume populators in general into CDI. So that's a good indication that the basics are there and it fits. But when it comes specifically to VM migration, in that case, we see that it's not completely fit. We see this by the amount of changes that we had to do to the live volume populator controller, and basically that made us eventually fork it and add the code to our code base. And also, it was challenging to get the more advanced functionality, the remote migrations that we said, the war migrations, and also the conversion of multi-volume disks. So those are all things that we think are possible, but they require an additional effort, and we work on those things these days. And that's all from my side. Now your turn. Questions here. The PV cell. I will repeat the question. The question is how come that when the prime PVC allocates the PV, and then we say that at some point after the write is done, then the PV is reattached to the original PVC, how come that the PV is not removed in the process? So this is part of the logic of the live volume populator itself. I'm not that familiar with the details, but basically it catches this point and it just changes the reference. So it's done before something is lost. It kicks in, do this change, and then we get it attached to the original PVC. Any other question? Okay. If not, then thank you, everyone. Okay. All right. Hello, everyone, and welcome to this talk about Phenops and observability. As last time, I was first talking about this in November here in Brno, and it was also unpopular by the title Phenops. So this time the same, which is cool. I'm first going to reflect a bit about the talk in November that I had about the Phenops, and then we will move on into some more technical details about the talk, about the presentation, actually. So this part was already on November, so I will go through it a little bit faster. The whole engagement started when my friend Andy Thompson and I were engaged in a project with a large company that needed us to sort out their Phenops engagement. Back in the time, I was a DevOps engineer. I didn't have any actual Phenops experience. So Andy and I started exploring what Phenops is, what it means for a company, how we can do it properly, and all that stuff. At the same time, they had the first task for us, and this is the biggest cost they experienced was related to EBS volumes, and everything is happening on AWS, and the overall money the company spent on a monthly level was anywhere between 5 and 7 million US dollars monthly, and a big pile of that money went to EBS volumes. So we started exploring what is going on with EBS volumes as engineers. We looked into the patterns of use cases, how teams are using EBS volumes, what they do with them, and all that stuff. But at the same time, as I said, we were exploring the Phenops, what Phenops is, how it is managed, how it is defined, and what should we do as now part of the Phenops team, DevOps slash Phenops team. While working with them, that was really a great experience because we were able to see firsthand how engineers in this large company were struggling with the idea of being into the cloud and not being able to use the cloud in the way they wanted. The promise of the cloud is that you get your resources when you need them, how you need them, you set them the way you want them, and all that stuff. That wasn't happening for this enterprise because the enterprise had strict rules about how to use cloud resources, how to manage them, when to dispose them, and all that stuff. So that was frustrating for engineers. At the same time, we were, as a team, engaged in a lot of talks with the management, so we were able to tell that the management was struggling with similar issues. They were struggling with the ways to control engineers from spending money on the cloud. So it is like a clash. On one side, you've got engineers willing to do stuff in the cloud experiment and try out new things, and on the other side, you've got management, which is keen to control the spend, which is natural, so both approaches make sense. While preparing for the whole stuff and exploring EBS and exploring Phenops, obviously Phenops.org is the place to go if you want to learn about Phenops. And this is the beautiful definition standing on the Phenops.org site, which is almost completely not accurate. This is really nice when you are in the management and you want to explain how Phenops is the next big thing. So a lot of parts of the company working together towards the common goal of making the world a better place, using less money, and everyone is happy and everyone is collaborating and all that stuff. But in reality, what we have found that Phenops is, after some exploration, is this. So Phenops is purely about saving money in the cloud. The whole talk about changing the culture in an enterprise, so every engineer would think about the cost, did not happen. I at least didn't see it. I was on that project for three years. And the reason why I didn't see it was the miscommunication between the management and the engineering team. And that actually guided my colleague Andy and me into building a system that we hoped that would help facilitate this communication because that was the biggest challenge and the biggest disappointment. On both sides, you've got people who are willing to do something good for the company, but it's just that they're talking different languages. They have different goals, and that was always the problem in this. So going back to engineering, and I will be, during this talk, jump a little bit on the engineering side of the things and then on the management side of the things. So what you see here is a flat, simple model that we used to analyze what is going on with EBS volumes. So we said, okay, we want to rely on AWS giving us the data, so we need pricing from AWS. Then we would like some metrics about EBS, about how EBS is used. If you want to see if EBS volume is unused or if you're not pushing any data towards EBS, you are not using it maybe, and you want to support these claims with actual facts. So if you can pull metrics and if you can prove to your engineering teams that there is no byte coming to or from the EBS volume for some amount of time, that should be good enough for them to comply that this EBS volume is actually unused, I mean physically and actually unused. Also, there were all kinds of specific conditions in this enterprise, how you declare that something is used or not, or how something can be used or not. For instance, they had a lot of accounts in AWS which were not claimed as production account. They are not labeled as production account, but they were used as production account. How this happens is another story. Short story about that is that they would push a new service, they would find themselves a client, they wouldn't get the approval from the company to create a production account to support this client, but they had a client, the client wanted to pay, they asked the management, management said yes, onboard the client, they onboard the client on the development account and they just continue rolling in new accounts, new clients and new clients on this dev account. And at the end, you've got a dev account supporting production. And there are all kinds of things you would find when you dig deep into these accounts that were used. We had communication with around 400 different teams running all kinds of projects in this enterprise. So we had to come up with some kind of custom rules that would help us control the whole thing. On the other side, we would collect the data, apply some pricing, AWS pricing on them, we would apply some rules, and we would do some logic around that and we were able to generate two types of results. And this was very important for us. The one type of result would go into engineering teams, which means that we would meet with the team and we would present the data. And we would say this is what we have found about your EBS volume usage. This is the list of EBS volumes. These are the ones that are not used, these are the ones that are not used properly and these are the ones that are optimized. Then you don't need to touch. And we would support all this with actual data. The teams would get back to us, we would get into all kinds of heated discussions around what is wrong with EBS volumes. These stories like we can't touch this EBS volume. Why? Well, a colleague of ours created that volume and left the company. And we don't know why this volume exists in the first place. And you would say, yes, there is no communication with that volume, there is no bytes going to or from that volume for three months. So maybe it's safe to delete it or at least to snapshot it and delete it. But they would be reluctant because they would say, well, we've got contracts and we've got SLAs with customers and we would rather continue paying for that volume even though we don't use it than risking some outage of service or something like that. So again, these are real life stories, not something that we read in books, how we do stuff. One fun fact, the management didn't like this approach. They preferred this. This is the same stuff, but a little bit with colors and different shapes and it looks a little bit more modern and this resonated well with the management. When we had to present to teams what we do, we would use this model. And engineering teams responded well. This is simple, this is, you know, very simple design. But when we went to talk with management, this was the one because then we could say, yes, this is something magically that happens and then we process some data and then we generate some results and all that stuff. Apart from the first report, that was a recommendation for engineers. We generated a second report that went to the management. And what we did is we defended, well, we presented some recommendations on the company level. These recommendations would be like this. There is an opportunity to save X amount of dollars per month if you would be willing to perform these, these and these activities. We already conducted talks with the teams and we already got the estimations on how much time would they need to implement the changes. So that usually would be, you know, like open the Terraform file, change something, apply changes and that's it. But we went through the talk and we collected the data about how much would it take to perform these changes. So now we would prepare a report for the management and we would say, if you want to save this amount of money, it is possible, you need to come up with resources for engineering teams to spend this amount of time to execute these changes. And now management actually had something to, you know, make a decision. Some enough data to make a decision. It was, it was palpable. It was not tacit anymore. You could actually anticipate how much time you would need, how much effort you would need and what kind of money you would save. So again, this was, and this is, this is one of the, I love on these talks to present these guys. This is Niklaus Wirt. Niklaus Wirt is most famous by being the father of the Pascal programming language, but some of his work is much more than that. I personally like this plea for Lean Software paper that he wrote in 1995. It's a five pages paper that explains benefits of a simple design. You can find more modern books about this, like David Furley has this book, Modern Software Engineering, this great talk about how you simplify the design, how you start from a simple thing, how you add stuff into this simple thing and how it becomes more and more useful. The whole talk, the whole approach is, don't try to be modern, don't try to be, you know, tech for tech, but try to build tech that would be actually useful, that would sort out or deal with some problem. So again, this is the influence of the guy. This is how we build the models. After this initial success of the approach, the company said like, this is all one of my friends, a friend like to say, nice and dandy, but can you do it for some other resources? Like we've got DynamoDBs, we've got RDS instances running around, we've got EC2 instances running around and we have no idea how to save some money on these. And again, my apologies to all the people that are dealing with actual software that was built to generate metrics or to generate the observability in different definition scopes. This is a different kind of observability that we are doing. We are not getting so deep into how many processes are being run on certain instances and how much memory they use or how much processing power they use. This is a little bit different approach. We wanted to be in the middle between engineering and what can be good for the company. So we said like, okay, if we are going to do this for another EC2 or DynamoDB, let's say, we would like to build a framework. We would like to build something that can be reusable, that we can extend, that we can just add new stuff into the play. So we would have an engine and then we would just, well, you can say add plugins to this engine. So the engine would become aware of how to analyze DynamoDBs and all that stuff. So we started simple. One of the colleagues on the previous talked about ECS and it is the simplest form of running containers on AWS that you can actually orchestrate a little bit. What we wanted to achieve now into technical stuff, what we wanted to achieve here is try to minimize the need for orchestration and increase the choreography. So we wanted a lot of independent services running around not needed to be orchestrated because we wanted to avoid this. If it comes to orchestration, then we will deal with that. But until we have to, we would like to avoid it. So we said like, okay, let's treat this as a microservices architecture. Let's put the analyzing engine for EBS in one container. Then we put another one into a second container like that one would analyze DynamoDBs. The third one would analyze, I don't know, Aries instances and stuff like that. So we started adding more and more services following the same pattern. Every service would have a different model, but the model would contain same stuff. Like get me AWS pricing, get me some metrics that I need to assess whether this information is useful for me or not, and some costume parameters like the stuff that I talked about, whether some account is production account, so I cannot really apply some of the rules on the production account. Again, bear in mind that metrics and observability in terms of, let's say, how mid-CTC is explaining observability is different than this. This is like a high level of observability. Let's call it like that. We would rely on AWS to give us data. So what we did is we put these containers in place. Then we needed to somehow start them. We had an idea of running them every day, selecting the data, and once a month we would generate these reports that we would push to teams that are targeted for actual optimization and to the management so they could figure out whether they want to invest resources into optimizing the infrastructure and saving money eventually. So we said, like, since we don't have a lot of orchestration, let's do it simple. We created events that are timed. In the morning, we will schedule these containers. They will spin up, do some stuff. It was serverless. It was Fargate enabled, so all that is fine. After that, we figured out that a lot of these containers would actually go and fetch prices from AWS. We would need some secrets to store them somewhere in the AWS. And then somewhere around this point, we started thinking, like, okay, how about we make everything idempotent and serverless? That means that if we are going to build it, I can run it many times a day and generate the same results, so I wouldn't mess up with my previous data. That was idempotency. And then serverless, well, events, ECS Fargate containers and all that stuff, still is serverless, so that was cool. Then we proceeded and we said, like, okay, we want to go through accounts and we want to analyze what is in these accounts. We need some permissions. We need some, you know, account roaming stuff that will go around. And around this point, we started thinking, like, maybe this orchestration, microservices organization is not the most optimal one. Can we put aside the tool, the piece of code that would go through all these accounts and do this for us and just hand us the token and we continue doing it like that? And that is doable as well. In this case, when once we put it in motion, it wasn't really a good use of our time. We figured out that it was, even though it's elegant in the code way of things, in the architectural way of things, it wasn't really elegant in the ways of us having to maintain that. And we always had in mind that at some point we will hand over this work to the team that should contain to maintain this. So we wanted to keep it simple and follow this approach. So we said, like, no. Every container will have its own engine that will roam through accounts so we could separate them even more so we could avoid the orchestration. That was another reason why we avoided this because this would introduce some piece of orchestration and we couldn't scale them in the same way. So we kept the isolation thing. And then we said, like, maintain this serverless approach. Let's write some of the data to the Aurora serverless. Now, still idempotent, the data written in the database is not going to be there longer than a day because we don't need it longer than a day. So if I delete, if I rerun the task, it would clean the data for the task, it would regenerate the data, and the data would be fresh for that run. So this is why we could keep the serverless approach. We needn't keep this data for, I don't know, how long. And that was fine. Also, obviously, something on AWS, you need to log stuff. You need to send information. And this is the final stuff. This is where we, at the end of the day, the only piece of orchestration that we had to introduce is once all the tasks finish, we would push the data into the S3 bucket through some reporting mechanism and from that point on, the data would be picked up by some of the Power BI or Tableau tools. It would be invoked into the Snowflake. It would be reorganized or done something different with it. At that point, company said, okay, can this be used for something else? And this something else we didn't do because the contract was expiring and they didn't want to go through with the whole idea. But the question was, can we use this approach to make company more advanced on the market? Like, how do you do this? Well, you make your company serverless or use more serverless stuff. This company used a lot of EC2 instances, a lot of monoliths running around. So if you want to be more modern company or more competitive company on the marketplace, you should follow from the tech side of things how we see it. You should approach the way you build software a little bit different. So this is the slide that we prepared for the management, actually, when we went on a meeting and we said, okay, if the management would like to be innovative and competitive, and if the tech has a strategy how to support this. So that means, I as a company want to be competitive, cool. What the engineering stuff can do? Well, we can build serverless, cool. So we've got two ifs. Then, score teams by their current level of serverless adoption. And then finally, management needs to approve resources to re-engineer and re-architect and rewrite these applications. This is how it goes. And the final stuff is this. When we went on a meeting and we discussed with them and all that stuff, and you know what it turned out? The company was more keen saving money than becoming competitive because the company was already dominant on the market. And the idea of being competitive on the market was based on, okay, we can do it with not a lot of effort and create a bigger gap between us and the competitors. That was cool. But when you introduce them to the idea about how this works in engineering mind, like you tell us what you want, we propose how we can assist you in the journey. We give you the estimate how much time, resources, money, people we need to make this happen. And then you say, yes, we want to make it happen. And then you follow it through. And then you can support this approach using this tool to provide you with a proper analysis on how much teams are serverless, how much teams would be capable of follow through the whole stuff. And this is for us, for all of you who are my age appropriately, let's say, you remember Hitchhiker's Guide to the Galaxy and the Bubblefish, and it was all that in our heads. Like, you create a tool that is a good tool to communicate between engineers and management. They usually don't know how to communicate. I'm sure all of you have been in situations where you wanted to talk with the management, but it doesn't really go both ways. So this, us providing this tool, felt like providing the Bubblefish. On one side, we were able to explain to the management what can be achieved with certain amount of resources. And on the other side, we were able to provide the engineering teams with what needs to be done, what should be done, and why we need to do this. Unfortunately for us, they didn't go through with the second part of the idea of modernizing the approach and making something better with the whole stuff. But the first part was good enough. So the Bubblefish, in our case, really worked for the first piece. And it was very much thanks to Nicholas for the whole idea about how to keep it simple and how to be. There is one thing I want to mention at the end. It's really tough to build a software that you believe will create the benefit for the company and not introduce some nice and flirty technical things you can brag about because we as engineers like to be proud of what we do. And if you're not using the latest grade, you didn't see Kubernetes in this presentation. You didn't see any, well, not any actually fancy stuff that we did. It was simple, plain engineering that we did that did some good for the company. But I need to say it's not really easy to defend this approach when you go into the purely engineering audience. You would get all kinds of questions like why didn't you reorganize your microservices in a different way? Why didn't you do this AWS service or that AWS service? And it's really hard to defend when you have to offer an explanation like this. Okay. We knew which team is going to continue doing this after us. We were there for a limited amount of time. We wanted to build an application that will have a long activity, that will be able to continue operating after we are done and that will be able to be successfully managed by the team that we give it to. So that influenced a lot of decisions that we made during the architectural phase of the stuff and all that. If you're building a product, if you're building a service that you want to commercially offer to someone else, that's a different thing. You're building it for yourself with a different goal in mind. This was made especially for this company with the idea of then being able to take over the product for, well, the service from us and continue using it and it saved really a lot of money. Well, it's not that difficult to save money when you have a company that spends like five to seven million dollars per month, but it was fine. It was cool. In the end, I think the last year went through like the company said, if they don't save us as a team, Andy and I, if they don't save more money than they cost, you don't have to pay for those two guys at all. And that was really easy sell on the management panel. But I wanted to say, I wanted to re-emphasize this one more time and it is the end. If the company has a strategy and if the company is able to communicate through the strategy and if the company is able to stick with the strategy, then tech guys will follow. Otherwise, you will have all kinds of clashes between the management side of things and the engineering side of things. And this is it. And once again, thanks for Nicholas for his appearance. Okay, questions. Yes. Oh, to repeat the question. Okay. So the question was like I started as a DevOps and then I went into the FinOps and then I worked in establishing the communication between engineering side and the management side and was there enough management knowledge in the company to utilize, to use, to ride on this experience? I would say yes. The first piece of the puzzle that we put in place is a nice prove that it is. The company management really went along with the approach like okay, if you give us, if you give, I mean the ABS, the first thing was the key. This is the key that we use to unlock the company. So when we proved that we can give you the estimate and the amount of money that you can save if you follow the estimate, if you facilitate the needs for the estimate and when that happened and when the company had actually the results to look at, that was the moment when the company said like, okay, this can be done. But that was the piece of the work that was following the company's strategy. Like we want to save the money and that's cool. The moment we get to the second part of the puzzle when we said like okay, you want to innovate or you want to be more competitive, that was the moment when the company had to say, no, we are not really into that, which is cool. You know, which is fine. The bad thing is if you as a company want to propagate this idea, but in reality you don't really want to follow through really. You want to save the money and you want to look good so you would talk these things because they sound good. And a lot of people can align with the ideas, but not in reality. So my answer is yes, because it was, I think yes, because it was properly aligned with the current company strategy. And this is why it worked. It was still amazing how these two, I would say gangs in the company didn't communicate well. You know, everyone wanted their stuff. If you talk with engineers, they would moan about not being able to use the cloud freely to create new clusters to do stuff, you know. They would moan if someone would warn them that they need to clean up the stuff after themselves and all that things. On the other side, management also only wanted to control. You know what they did? They imposed strict policies on how much money you can spend. And you know how engineers got to this? They said like, okay, we will build something that is below that limit. And as long as we are below, no one will bother us. And then in this threshold, they could do all kinds of stuff and not a lot of them were good. So while analyzing stuff, we discovered that actually a lot of accounts that were below the budget, the limit, were very not optimized because they didn't care. They cared only on doing the daily stuff and be done with it because they were below the radar. The company didn't even notice these accounts because they were below the limit. Some manager that was leading these teams was good enough, I would say, fighter to go on a meeting with company management, upper management to fight for a good budget. And then after that, he would bring the hunt back and then he would say, hey, I fought for this budget, we have the enough budget, now you go and play. And these teams were really happy but still not optimized. This health company actually analyzed and isolated these accounts as well. So again, back to the question. I think yes because of this. The strategy in place was correct. It was about saving the money. The engineering was able to follow. The model was simple. It was easy to translate. We would invoke new... A lot of that was written in Python. Some of that in Node.js. So both very popular languages. It was really easy to invoke new engineers into the game and they picked up things pretty fast. Small isolated pieces of code. So relatively simple approach. Yes. Why we didn't? Accounts management. Cost management, yeah. So that is absolutely good question. The problem with this approach was that they were using this before us. And to properly investigate 400 different accounts using AWS Cost Management required them to spend a lot of time over and over looking through that. And you know, in that big company, things change on a daily level. Not on a weekly or monthly level. That is a drastic change is up and down in the use cases, in the usage, in the patterns, in all that stuff. They wanted something that would be automating the whole stuff. So cost management is a good approach. It also fails to fit in the... To fill in all the gap about custom stuff. Like I said, they would have production load running on a dev account. And you can't just undo that load. You can't just take it out and move it into the product account. So a lot of these goaches, an example, they wanted us to introduce something they called Big Red Button. You know what Big Red Button is? They wanted us to create a UI with a Big Red Button. They would click on the 15th of December. And they wanted us to shut down all kinds of resources throughout the company because they would go into the low usage mode until, I don't know, 10th of January. That was also part of the whole activity. But we started with one idea. 10 days later, we would have all kinds of exceptions from that rule. Because again, large company, large organization, a lot of managers were able to fight their way and get the exception from this rule. So we had to avoid them. And when you go, I mean, a lot of these AWS solutions were really good. And you can use it for that as well. But when you go into an enterprise, having so many different tweaks and things, if you are not willing to dig in and build something custom, not really going to work for them in a way they expected. In the end, we were using the same stuff as cost management. So we would execute API calls towards the AWS pricing API. We would execute CloudWatch calls to get the metrics about EBS volume. So it's underneath. It's all the same stuff. It's just that it was not really usable for them on the company level. They tried two or three different FinOps specialized products for that. So products that are fully baked with all kinds of reports and all kinds of things you can think of. The answer would always be, like, tag everything properly and we will handle something for you. That tag everything properly does not go well with that kind of a company. It's tag everything properly with two teams, three teams, five teams. That's really cool. But when you think about 400 teams and you have to go in them, we also had a situation when we would go to a team and we say, hey, you need to tag things properly so the tool would be able to pick stuff up from the account. And they would say to us this, and I quote, like, when you come to do our job every day and when you have to also manage the existing infrastructure and deliver new features in this amount of time, then you come and you say to us, implement this. And they would just refuse. And they knew they were below the budget limit. They were bringing money to the company and they would just say, now, not interested. And that's the reality. It's one thing that when we talk about it, but this is like the real stuff out of time. So thank you for your time. I hope you enjoyed and see you next year in Brno again about something else. Thank you. Test, test, test. Hello, can you hear me? Very good. So out of curiosity, not a full room, half, maybe quarter of the room, how many people here have used CEP before? Almost everybody. Okay, cool. How many of you are still using CEP today in production? About half of you, okay. And how many of you spent time benchmarking CEP in the past? Three, four people, five people. Okay. And how many of you enjoyed benchmarking CEP? One person, okay. Okay, cool. So today I'm going to be talking to you about SciBench, which is a new way to benchmark CEP. It's an open source project written in Go. And I'm going to give you a bit of background about the project and kind of why we did it. But first let me talk to you a bit about why benchmarking is painful. So it's painful for a number of reasons. You need to run tests many times. You need to run tests for a minimum length of time in order for it to be a good test. You need to, there's lots of variables, loads and loads of variables, especially with CEP. When you make changes to a workload, you need to be only changing one of these variables at a time as you go through the tests so you can kind of figure out and diagnose what changes are having, what impact. And then when deducing those results, even if you've only changed one variable at a time, there's a good chance that more than one thing has changed and you just didn't realize. So that's always fun. So with CEP specifically, benchmarking becomes even more complicated. The first reason is CEP has many interfaces. So we're not just interacting with CEP one way, we're interacting with it many ways. You have block, file, object, or you could talk directly to the RADOS API. With CEP, you have massively varying workloads. So the different CEP protocols all have different workload characteristics. And so you need to take different approaches and kind of measure each approach in CEP independently, figure out, okay, well, so for RBD, this actually works really well, but for S3, this is terrible, or whatever. So that's kind of another issue. Another issue is that distributed systems need distributed benchmarks. So because CEP's performance scales linearly as you add OSDs to the cluster, you also need to scale your benchmarking tooling in a very similar way to make sure that you're not limited by your benchmarking architecture. So the drivers or worker nodes that you're using to benchmark or client nodes that some people in benchmarking world call it need to not be the bottleneck. You need to have enough client nodes to fill up the CEP pipe. So another issue is that workloads can be invisible to you because many people that are operating CEP are like, oh, yeah, I'm operating this as a service, and they go and talk to their customers and they say, hey, what are you using CEP for? And their customers don't tell them, they just say, we want an object store, and they don't say what, because in many cases their customers don't know what they're going to be doing. So a lot of people that operate infrastructure as service providers, and so the eventual need is kind of opaque. So you kind of figure that out over time. That's a bit of a challenge. And then another issue with CEP, which is always fun, is that the background work in CEP can also get in the way. CEP's doing all this stuff in the background like opaque deletes. It's doing scrubbing. It's doing self-healing kind of checks. And those things are things you have to be aware of as well when benchmarking CEP to make sure that when you look at changes you've made, you haven't inadvertently introduced another issue. So first, I work for a company called Softline, and we use CEP as a core part of our product. And our last product, Hyperdrive, which was an appliance for CEP, we had this deployed in dozens of sites across a very large customer base, loads of different workloads, and we needed a way to figure out when we architect a new Hyperdrive cluster, what's that going to look like for a customer. And today we offer a similar product, which is a full end-to-end cloud, and we have the same problem, which is a customer is going to be using CEP in a certain way. How do we estimate what type of hardware to provide them? Are we providing NVMe? Are we providing SSD? Are we providing hard disk? And in some cases kind of what combinations. So we needed to figure this out as a critical, it was a critical blocker for our business. And so we started with this tool called Cosbench. Anybody heard of Cosbench? Also the same guy that loves benchmarking. A few more people. Okay, it was originally created by Intel, it's open source written in Java, and Cosbench was, for a while, the only tool you could do benchmarking of CEP with, it was targeted at S3 and Object, but it also did have some limited support for RADOS as well, and it was originally designed also to benchmark other Object stores so you could compare like your CEP deployment to an S3 deployment or your CEP deployment to Google Cloud or something like that. So we used this for a while, and we didn't really want to reinvent the wheel, but there's some problems with Cosbench. The first problem is that the JNI is expensive. If you want to use the Java native interface or use Cosbench to measure performance for anything other than S3, because Amazon does have a pure Java implementation of S3, for everything else, it has to... You have to traverse this native interface, and that's very expensive, so it completely defeats the point of doing benchmarking from Cosbench. There was a number of other problems with Cosbench, which is, first of all, it didn't really have a maintainer and didn't see any code contributions for years, so the Intel guys kind of abandoned it after a while, I guess they either moved on with their lives or stopped using it. There's also using this thing called OSGI, which was this horrid, kind of bundle-based Java kind of... It was very much built as a monolithic application, and so Cosbench was originally targeting lots of different object protocols, and that may have made sense for that, but it really was a very fragile structure and didn't really work for us. So there were a number of issues with Cosbench, the manual workflow, no-builder and sole system, and we kind of spent a bunch of time with Cosbench trying to figure out how can we make this better, so we tried to build it with Maven, how can we package it, document it, kind of use it in a sensible way, but at some point we kind of just abandoned the project because it was more effort than was necessary, so we ended up writing our own benchmarking tool and that's Sybench. So what was the goals of Sybench? Sybench, we wanted a tool that is simple and lightweight, easy to read, easy to run, easy to debug, so you know what's going on. We wanted it to be linearly scalable in the same way that Ceph is linearly scalable, so we didn't want it to get in the way, or to benchmark Sybench itself, we wanted to benchmark Ceph. We wanted to benchmark all different Ceph protocols, so this is a tool designed from the ground up for Ceph. It had to be efficient and low level, so we wanted something that lets us call out to C libraries like RADOS with our performance implications, and we wanted it to have similar performance to FIO because FIO is kind of an industry standard for benchmarking and it's pretty low level itself, and FIO is great frankly, and so we kind of wanted to be able to look at FIO and see some of the C7 numbers and look at Sybench and C7 numbers and see something that made sense as a result of doing that. So those main goals, and then the final thing is that we wanted to have some framework that gives us control over the data that we use to run the benchmarks, so basically we want to also control what data we're generating, not just using like DevRandom or something like that. So what's the architecture for this? It's written in Golang, so it's almost free to call out to C, it's both a daemon and a CLI tool, so you use it like a command line tool, but it's also running on your driver node as a daemon. It handles auth, so it takes the set keys where necessary and the S3 keys where necessary as arguments, and it passes them to the monitor the gateways as it needs to. It's multi-threaded, so because written in Golang it's pretty easy, but also by default every Sybench driver spins up a thread per CPU core on the worker node, and so you can control how many threads you want and play with that as a variable as well when you're doing benchmarking. And that could be useful if you have for specific workloads that do that. It does both bits or bytes networking people who seem to love measuring things in bits or people who seem to love measuring things in bytes. No one cares, divide by 8, whatever. So then it also has this control of ramp time, which many benchmarking tools do, so you specify a kind of uptime and a downtime, and then it won't really measure the first three seconds or the last three seconds. And then finally, Sybench also is not focusing on... It's only focusing on the benchmarking. It's not focusing on all the setup and talking to the monitors and capturing the data and figuring out... We figured we wanted something that just did the benchmarking and then we ended up writing another tool, which I'll talk about in a sec, called Benchmaster, and that was a tool that kind of helps you orchestrate your benchmark, run sweeps, run multiple things. So I'll talk about that in a sec, but Sybench is just about running one workload and doing it well. So what's the architecture of Sybench? You can see on the right here, you've got a self-cluster, you've got LibRatos, LibRbD, LibZephFs, Rados Gateway, LibRbD, and LibCephFs twice to demonstrate the file system. So there's... From a Sybench worker, you can either talk to Rados directly, you can use RBD images, you can use a mounted file system, you can use Rados Gateway, or you can talk the last two are kind of native. So this is like either a native file mount or a native block device. So you could theoretically use Sybench to benchmark stuff that isn't self. You could benchmark anything. So that's kind of what the architecture looks like. So for LibRatos, you just provide a self-pool, a self-key, and a monitor address. RBD, you just provide the same thing, and it handles the RBD images. So with Rados Gateway, and this was something that Cosbench did very well, is you don't have to worry about load balancing HTTP requests or figuring out how to do HAProxy or something similar in order to deal with benchmarking S3, because it kind of... Every worker can talk to a different Rados Gateway server. So there's kind of some inherent built-in load balancing just due to the fact that you have loads of workers and you can provide it loads of endpoints, and then it will do that for you. So that's pretty cool. So what some of the other stuff you can currently do, you can do bandwidth limiting. Some customers have requirements such as, hey, we want to have 100 millisecond response time when doing 30 gigabytes a second of traffic. So this is a good way of getting latency numbers at a specific bandwidth. When you max out bandwidth, that's not always a good indicative measure of latency. So it's a good way of limit workers from maxing out the pipe. So you get a really good view of what latency implications is this worker going to have. It has a slice generator, so by default it generates random data, but it's not very useful when you're trying to measure things like compression or deduplication, so you can put that random data into a buffer and then use the same data again for future workloads in order to see the impact. You can do read write mixes, so it's not just about doing a read workload, doing a write workload. You can also do like a 30, 70, 50, 50 split at the same time, because most workloads are going to be combined. Syringe also has support for individual statistics, so you can write out all the stats from a worker, and then you can go in and do statistical analysis and a whole bunch of other kind of investigations to figure out what actually really happened if the numbers don't make sense. This ends up being a really, really big file, so it's only worth doing if you're really confused. And then finally, Syringe doesn't delete anything. It doesn't clean up after itself by default. You can't tell it to, but it's because deletes instead have a huge impact on performance, and they're also silent. It's hard to tell when they're happening, so we don't do delete by default, but you can turn it on, because you can have a very representative view if you're doing loads of deleting. So Benchmaster. Benchmaster is a small wrapper for Sybench and also for Cosbench as well, because we kind of had a migration process moving from one to the other. It's for running a series of benchmarks rather than just a single one, so it allows us to kind of provide a set of options that we can sweep over, so I can say, hey, go and run a workload for 1k object size, 16k object size, and then come back and give me all the answers. And then finally, it also writes to Google Sheets. So with Sybench, I can generate a Google Sheet, and then all my kind of workload data ends up in there, and that's just a very easy way to draw graphs or use the output collected over time, so it's much more organized. So I want to show you guys what a Sybench worker looks like. Hopefully the conference... Can you still hear me? Hopefully the conference internet will allow me to do this, because I'm actually just doing it on a remote machine. Is it here somewhere? There it is. Cool. Cool. So, you can see I have this SAF cluster. It's just a small SAF cluster that we have in one of our Berlin labs. It's backed by... It's actually backing an open SAF cluster, so I could theoretically make lots of people's lives miserable if I do the wrong thing. That's fun. It has 36 OSDs. It's just three nodes. You can see... Yeah, they're all hard disk across three nodes, and they're all up. It's looking fairly healthy. There's some data on the cluster, not too much, about three terabytes used. So that'll all look pretty familiar. So, what's sidebench? The first thing is... I want to show you that sidebench is actually running. For this demo, we're not really interested in the actual numbers, like the benchmarking data, because it's a very small test cluster, so it's not really about hitting really big numbers. This is just about showing you how it works. So here, there's sidebench is running as a demon, and that's kind of how it works, and then we also have the command line tool, which has a fairly big help menu. But basically, for any one of the sidebench commands, I can then provide it a list of servers, and those servers will be the worker nodes, and you can run this command from any of the worker nodes. It doesn't really matter. As long as it has the demon there listening, it has an API that we'll just talk to, and it will send out the benchmark. Other things to point out... You can see the different sections. It has a man page as well, which is always nice. Some information about benchmarking with sidebench specifically, some guidance, more detail into every different command and option for the tool. There's also a website and some information on it. There's also a website which will give you detail on how to download it and package it and stuff. Is that better? Not big enough? I hear that a lot. Next, we're going to do a basic benchmark. I've got some historical commands, some of which worked, some of which didn't, so I'll run the ones hopefully to do. This is going to be uptime1, downtime1. It's just going to be a five-second workload, and we're not going to measure the first or last second. I'm providing it the SEF key, which is I'm just including the command to print the key, and then I'm giving it the monitor at the end. Let's see what happens. Very simple. We've got a write stage, prepare stage, and a read stage. The prepare stage was actually skipped because I didn't clean up the data, so it could just go back and read the written data. You can see that the reads were faster than the writes. Very simple. The next thing I'm going to do is I'm going to use Benchmaster to create a sheet. I've actually done this before. This is the Benchmaster help page. You can see here I have sheet create, and I'm going to give it a sheet name and an email address, and it will create a Google Sheet and share it with me. Let's try that. Benchmaster, oh, can't spell. Sheet create devconfc, and then I'll send it to myself. I just got an email, put it up, and that's a very small spreadsheet. Oops, that's a very big spreadsheet. There you can see the spreadsheet, and then let me go away and run a benchmark. I'm going to juggle seeing things on this screen and that screen. Now I'm going to run a benchmark, and the sheet I'm going to provide is the one we've just created. Devconfcz. You'll see here that I've added a new option, which is the object size, and I've added 128, 512, and 1 meg. We can see the different speeds at these different sizes. I'm going to make this a 5-second runtime as well so that it doesn't take absolutely ages, because it's going to do three. I'm going to leave that to work for a bit, and as it does that, I'm going to talk to you about the rest of... Let's go through the next slides. I can figure out how to manage this. Some of the ideas for the future with Sidebench. We thought about doing a workload generator. This is something that over time we realized it's quite hard to map a customer workload or a set user workload to something that we're going to be actually creating a benchmark that represents that. Also, the invisible workload problem, we had service providers that just like, we don't know what people are running on our system, sorry, so just give us something that works for everybody. It's like, well, but that's not really how benchmark or computers work. We wanted to have something that we could sit there, like a generator, and say, okay, here's the workload that's been on average over the last month, and here's a benchmark for Sidebench that you can run to test other potential clusters for this workload. It's something quite cool that we haven't got around to doing yet. We also wanted to do sweeps over OSD counts, so this is quite cool. Actually, I wrote a script that did this, but I never actually patched it into Benchmaster, and so the idea here was if you want to see, like, prove to yourself that the CIF that you're deploying is actually scaling linearly in performance, what you can do is remove a whole bunch of the OSDs from your CIF cluster and then kind of add the men gradually over time per node, so let's say you have 20 nodes, you just scale it back to three nodes, so you could just do a basic triple rep, run a benchmark, add in the fourth node, run a benchmark, add in the fifth node, run a benchmark, and then watch, and then if it's a linear graph, then you're happy because it's scaling linearly as was promised by Sage and the other gods, but if not, then you're in trouble, right? So I actually did this many times, and it did work really well, but just never got it into Benchmaster, it was really nice. So meta operations, so there's a whole bunch of stuff in CIF that isn't just like running like reads and writes, and there's other stuff that you might want to do, and so it would be cool to look at, like, how do you measure the impact of a snapshot or other things. There was support for Kubernetes, so obviously Kubernetes has a fairly mature CSI driver for CIF, it works really well, both file system and block, and so it would be cool to figure out, hey, can we have Benchmaster go away, spin up a Kubernetes cluster, spin up a bunch of pods, map the storage class or the persistent volumes into pods, run the benchmark, spin it down, and then spit out the results and make that repeatable process. So that was something we never really got around to either. And then, yeah, if you guys have any ideas, either leave an issue or some merge requests, patch is welcome, it would be cool to see what you guys have, if you have ideas as well. So while I've been talking, hopefully, if I can find my cursor again, we can see that this benchmark has completed, and remember I ran three benchmarks with Benchmaster, so it would have gone and told Sybench three times, go and do this, go and do that, go and do this, and then spat out the results in the spreadsheet, and there we go. So we have three results. We can see they're written with Sybench, the different sizes, the time, we can see that the right bandwidth differed massively between the different three object sizes. Same with the read bandwidth. It seems to have interestingly gone down between 1 to 8 and 512, which seems kind of dumb to me, but probably because I did a 5-second run or something. And yeah, that's the benchmark. So yeah, any questions? So is your question, the question is, can you take an FIO configuration file and use it with Sybench? There isn't, but I actually do really want that as well, because doing it on the command line is very annoying. So yeah, we can add, that would be a very trivial change to add as well. It's just about, yeah, exactly. But yeah, that's a very good question, and I think we should do that. So that's another idea for the future. I'll be at a minor one. It does, it does. Oh, yeah, the question was, is it also possible to count the IOPS? And over here you can see there is a latency value which translates to IOPS to multiply. That's basically IOPS, but it's just latency in many seconds. Cool. Any other questions? Pavar? Well, so CBT, so the question was, have you finally given up on CBT? And I never gave up on CBT. CBT is a wrapper for lots of tools. And so, in a way, it's kind of like the Benchmaster thing that we did. And actually, I gave this talk in one of the Seth days a few months ago, and the guy that wrote CBT was in the room. And he was like, hey, we should talk about integrating Sybench into CBT. So I think that would also be a very cool thing to do. But I haven't played enough with CBT to be able to say, the value of like comparing it to IO or I know it has support for Rados Bench, it has support for Cosbench. I think it has support for another one of the Go-based benchmarking tools. So yeah, I mean, I think that's a worthwhile thing to look at, but it wasn't useful for us, for our purposes. So we wrote this very much, kind of, for what we needed. But that's a good question. Cool. Any more questions? Well, thank you very much. Should we start? Okay. So, hello. Good afternoon, everyone. My name is Miguel Duarte. I'm here with my colleague, Daniel Meliado. We both work for Red Hat. I work in the OpenShift virtualization networking team. And he works for OpenShift monitoring. So we're here to present a talk titled Mayday, CNI Overboard. This will make sense in a few minutes, I hope. So, first thing we'll do is explain a little bit what is the current state of the CNI project and what actually led us to care about this. Then we'll introduce CNI and MULTUS for us to get, like, a whole understanding of, let's say, of the lingo and be able to properly specify the problem. From there, we'll enter a new CNI enhancement proposal about multi-networking and go to the upcoming CNI release. Well, this is not an upcoming CNI release because, right now, it's just like a set of requests for enhancements. Like, there's just a list of issues. That is something that is on scope to be worked on CNI 2.0. But it's very important. And then we'll finalize with what we've learned here, we see this going forward. So, the first thing we should spend a little time to discuss is the relationship between CNI and Kubernetes. So, for that, first thing is Kubernetes networking model is something extremely simple because it basically just says that every workload, every pod, gets a single interface in it with an IP address and every pod, where it is scheduled, we'll be able to communicate with any other pod in the system through that single IP on that interface. That interface is created and configured by CNI, which stands for Container Networking Interface. Now, you look at this like that, you get the impression that these two things are actually, like, bound together in a way. The reality is this thing in the middle actually does not exist. And what you do have is that Kubernetes understands something called CRI, which is the runtime interface, and this thing actually speaks to CNI. What this means is that there is no way for CNI to actually communicate to Kubernetes in any sort of way, nor Kubernetes knows anything about CNI. It really does not know. It even exists. On top of that, another thing that is missing from here is... So let's say that your workload, for whatever reason, it actually, like, requires more than one interface. You have one, Kubernetes gives you that, it manages it, but it just gives you one. What if you need more than one? So for that, well, we didn't create, but there's a project called MULTUS that is responsible for that. Its value proposition is that enables the pod multiple networking interfaces. On top of that, this MULTUS thing actually understands Kubernetes. It speaks its API, and it actually speaks also with the CNI API, which we will see later on exactly what it is. But yeah, so we have MULTUS that is responsible for these two things. It speaks Kubernetes, and it grants a pod the ability of having multiple interfaces. These multiple interfaces can be only a virtualized interface. Sometimes you actually need to tap into, like, a physical host interface, like, for instance, SROV, you might need that. And for that, you need to add more stuff into the picture. You need to use a device plugin, you need to use a thing, a SROV network operator, which requires MULTUS, by the way, to be able to get your pod, your workload, an SROV interface. Like, the picture is getting, like, more and more complex, depending on your use case. Like, the more things you need, the worse the picture becomes, and the complexity increases. And this is what we have today, like, all these things. And there are certain, there are new initiatives coming up, like, for instance, right now, this DRA, Dynamic Resource, it's assignment, this thing got merged on, like, I think it's alpha for Kubernetes 1.26 or 27. And now what we have is a new emerging community enhancement proposal, no Kubernetes enhancement proposal, for native multi-networking. And this is what we'll focus on later on this presentation. Let's set all this. So, I'll hand this over to... Can you hear me just fine? Okay. So, Miguel introduced us to a little bit about what's C&I, what's current status, what's evolution, but I just wanted to go ahead and go a little bit more on that. So, first of all, C&I, it's basically, everybody thinks that it's basically Kubernetes networking, so we've got all the plugins, but there's so much more to that. So, I would know if you have ever wondered why, if this is Kubernetes, why a C&I plugin doesn't have, let's say, a native config, in case you don't know, so you may be wondering why JSON and no Jamel and why not a CRD and even more, why there's no demon. So, currently, this is just a quick overview about what's going on in a Kubernetes node when we are just having all the components over there. So, currently, C&I plugin, in the end, it's just a binary, which speaks JSON, and it's a binary that's being run by the QLED or a container runtime CRI. That's what Miguel was saying before. What do we want out of that? In the end, we just want a network namespace with an interface or more, but we'll get to that. But in case you're familiar with any other projects of visualization infrastructure whatsoever, for instance, let's compare that to OpenStack. In case you're familiar with OpenStack, such as I know a lot of people here are, you can just get a VM with whatever amount of subnet interfaces you'd like to, but you can't do that natively because you don't want to do it on your own. So, yes, as we're calling that, this is a binary, which is installed by a demon set. So that means that you got a copy of the binary in every node of your Kubernetes cluster. In the same way, it has C&I playing path, which stands for where the hell is my config. So when you get the binary, it's going to be looking for a config file, what is the C&I aspect? So far you know that C&I, okay, is it a plugin? It is a protocol, it's an API. So in the end, it's just a specification. It allows you to use four primitives that are C&I add, del, sec, and version. And even more, out of those four, most of the C&I plugins that you'll see around only implements two of those, which is C&I add, and that means, okay, give me a port, okay, delete that. And there's some complexity out of that, because as I was mentioning before, this is not a demon, so all the implementation gets down to the C&I plugin writer. So if you want to have a controller out of that, okay, go ahead and implement that. The C&I aspect, it's totally okay, and it only expects to read a case-on-config file, some m-bars from the system, and it will just execute the binary and give you a result. So I think it's worth just taking a quick look about this. So you've got a couple of environment bars here, C&I command. Okay, what the hell am I running? I'm speaking about, like, these C&I add and del commands, and then I need, basically, which name I'm going to be using, which interface name, recall, just like the Highlanders, but, you know, there's more than that. And then the container ID, and if you see, again, here, I can point out, but it's over here. This is just some kind of case-on in which you have the C&I version. Depending on the version that you're using, you may have some limitations, but I won't get there unless somebody specifically asks for that. And you get your C&I plugin name type, and then you can do whatever you want to. Once it's there, it's going to be executed, and it'll give you an out. If you have been to some other sessions, you may have been hearing about the prep results, and that's because you can also, yeah, do C&I plugin training, which means that you can put several of those in a pipeline. And basically, if a thing goes well, you get an index code, like a zero, everything goes well, okay, go on. Then those multis. What's the problem here? I've been speaking about that we only got one single interface per pot. That's it in acute limitation, especially if you are working with some telco environments because the most basic telco workload is going to be a firewall, and how do you do a firewall if you only got one single interface, good luck. So multis is a project which aims to be some kind of meta-plagging, and it handles, well, first of all, it handles several interfaces over there, and it's going to be, first of all, allowing you to use a CRD, okay, hey, finally this starts to look like some Kubernetes native, but don't get me wrong. In the end, this CRD, it uses an object on Kubernetes called network attachment definition, but hey, it's going to be just a wrapper for this old-style JSON. This is something that we would like to change in the upcoming C&I 2.0, but yeah, I'll get that for later. So for now, what I want for you if you get anything out of this is that, first of all, Kubernetes native only allows you a single interface per pod, use multis, you got several. So this is a quick assemble of this. So you got to use Flannel, but I don't really care about this, which plugin we are using, you can use any plugin for the community. Flannel, OVN Kubernetes, but most of the using multis, you get another, if you see here, another net zero interface, net one, net n, so you have several of those, and those we can start doing something which is pretty much more interesting. So what are the pros about the current approach in C&I? So first of all, it's super simple. Basically, even if I stated that there are four primitives, basically I don't give a damn about anything but two. So I want to create a port, a completely port, and well, second version are interesting, are cool, so you know which version I'm using for the plugin, but if you see OVN, let's go save some fancy C&I plugin if I recall correctly, so if you go and do a git blame, you may get me wrong, but Cilium doesn't even have C&I second commented, so it's just nothing fancy. But again, so this is no demon, so any developer should have to reinvent their wheel and create their own reconcile loop for any C&I plugin, which kind of sucks because why would you like to reinvent the wheel for every C&I plugin? And also, for the same reason, it's lifecycle somehow limited because it says, oh, you got one C&I app, it doesn't give you an act, it just gets you a result, and it's fancy if you fail for deleting something. You may be leaking resources, and you may even not know that unless you implement your own solution or your C&I plugin. And also, C&I, to be honest, although I think it's a super cool project, the community is getting somehow smaller, which is really bad, so feel free to go join app and contribute if you want to. We are totally happy to get patches for requests, docs, comments, whatever, and the thing is that its docs could be really enhanced, again, request accepted. And also, there's pod manager somehow replacing that with some newer implementation, and as you were seeing, and the original Diagrams Lameagle was showing, it's not really Q&A's native. You may say, why, but you are saying that it's a C&I container native network interface and what, and this is because original C&I was meant to be used with Rocket, which some of you may know, it was a container interface, we were able to, some of the other implementations that we have, so it's like it goes a long way backwards, and it's not really that native. Why haven't you just migrated that to Jamel and proper CRDs? Well, it just evolved that way, so we plan to change that, we are coming in new aspects and we do things that is super simple. What happened here? Okay. Thank you. So, what's the deal here? As I was saying, this is super simple, just four different primitives. It's still as simple to do a C&I plugin and a scale. There's even some examples on that. If you go to my GitHub and you say the KubeCon C&I, you may, you can just see that I wrote an example C&I plugin for the KubeCon in Valencia, and it does nothing, but just fork it and it'll allow you to get it fully working C&I plugin for your own. But it also has some limitations, and in the end it's just a trick, let's say, pony, which is add, del, and that's it. So far, and now I'm going to be getting back to Miguel because I want for him to highlight you about there's currently a cap, which is Kubernetes Enhancement Proposal, which isn't going about okay. Let's make this really Kubernetes native and how so we would like to use CRDs we would like to use objects. There's a caveat to this though because the current proposal aims to be implementation agnostic, so that means it could evolve in a way that would totally ignore the currency and I because it's not, let's say we would like at least C&I maintainers and C&I developers would like to avoid that because we would like to see that it's backwards compatible and it works fully, so C&I gets evolved rather than you know, substituted. And also this may lead to if some of you guys are working on support, if this goes on we'll like trying to do some migration path anyway. So I'll let Miguel go ahead and explain you a little bit about this proposal but we'll get back to you later with C&I. Okay, next slide. So I'm going to explain the current state of this Kubernetes enhancement proposal for multi-networking. So the first thing about it, it's being split into three different proposals actually, like the first focus exclusively on use cases. It's a quite far list of use cases that actually takes into account lots of things that we think are missing from the original specification like it has as use cases stuff like hardware devices, it's mentioned there a hot plug into the pod basically like this means that you'd have to react and actually introduce a new interface to a running pod. This is something that it's not very Kubernetes but there are some use cases that require it. Then the second one is about defining the API. It's like the one we are currently on and finally it would be a final cap that would basically introduce these code changes. As Anil said before this will be implementation agnostic. So every time in these discussions that you ask something specific like you want to for instance, plug some runtime information from a pod. So a pod request a particular IP or a pod request a particular MAC address. The reply usually is that is implementation agnostic. We might not care about that and we are providing one single implementation and it seems to be a little bit cloud focused. But on the good side, if you look at this there are a lot of things you'll get for free. You'll get a Kubernetes native way to interact with the network plugin. You'll get an entry Kubernetes native way to have multiple interfaces on your workloads. This means like all the entire ecosystem that I've shown in the beginning with multis and all that, like half of that is actually not required. You have stuff like dynamic interfaces are featured in the community enhancement proposal. Hardware-based devices are also mentioned as an objective which pretty much will simplify the original, like a diagram by a lot, like half of those balls will disappear. Which will simplify by a lot the complexity of the solution. And finally you get a native integration with things like network policies and services which are of course and native to Kubernetes. And I would just like for you to take a look at, so here in the left this is what the multi-CRD looks like. So you just you just you just give it a YAML with a packed JSON thing in it where you can put pretty much everything you want. So if you forget to put a comma here this thing will not work. Like it's unpacked within a YAML. It really looks bad. It's error prone and it's hard to get right. While to the right what you actually have is a YAML, simple like it's very easy to understand and obviously to get right. Now another thing that happens nowadays with CNI is whenever you have to address one of its shortcomings like for instance reconciler resource like if you are an IPM plugin you manage IP addresses you need to reconcile those IP addresses. This is done at the case by case that's done case by case. So every plugin that you have will need to be able to do that. If you want to have dynamic interfaces you will need to find another way for that. So for instance what we did was to create a controller that looks at the annotations of the pod, sees an annotation change, it computes like the delta between the interfaces it has now, the interfaces that you want to have and will then like either add or remove interfaces. So we had a new controller, we had to redesign multis in order to be able to receive more inputs so it's like a huge amount of work whenever you need to CNI to do something it was not built to do. Again, the one trick pony knows one thing. This one actually knows two but that's what it does. For instance for Slack there's no way for you to do this like natively and there are people that are after these things. Now not everything is bad, you have the upcoming CNI 2.0 like as I said in the beginning all you have is like a list of issues but you can opinionate on them, you can say what's your opinion, you can try to raise their priority but there are lots of things that are being considered right now that will make this easier and better so it is considering demonization so instead of it being like a binary file just on the host file system that will be invoked it will be run in a pod that is managed by a daemon set. That's a lot better it helps you to have another life cycle life cycle methods it will be easier to deploy and of course it's Kubernetes native you'll get an enhanced life cycle so there's a particular issue for this and the most important thing here is they're planning on adding garbage collection verb so instead of you having to write a controller to reconcile your IP addresses for instance you'll have a verb in CNI that will actually do that for you so it will simplify your life quite a lot again this is just like an idea right now and it's being planned another thing plug in events so this is actually a way for CNI to feedback information into the into Kubernetes so for instance let's say that CNI actually sees that your pod got a new IP because of Slack so it will introduce that new IP address to Kubelet and finally also device interactions okay as conclusions what we have right now is that this multi networking enhancement proposal is very strong in terms of use cases it basically addresses like all the current limitations of CNI at least as we see it and it kind of addresses the entire multi feature set so hardware devices it speaks it is native to Kubernetes and it allows you to add more than one interface come on it's multi network all of this put together makes that well CNI 2.0 better be good it has like one chance to do things right or it will actually probably become extinct or at least that's what we are concerned about and well you can help like you could give feedback to the existing issues like if you think that you as a CNI developer to get something like your life could be easier in any way just comment like ask for it and it will probably be like taken into account quite for sure and yeah we have no doubt this is where we are like Kubernetes users will go on and live a long happy life but CNI is freezing in the water and probably like it will probably die the fun thing is that probably there's room for everyone and they can still fit together and find a way to be happy and yeah thank you any questions oh that's very important so there's a weekly meeting about bi-weekly so like twice per month of this multi-network community and Kubernetes enhancement proposal and all the caps are online you can look at them comment so welcome to do that yeah if you have any questions like now is your time and if not well thank you for your time oh sorry no totally true like oh I need to repeat the question was very very very long and I'm really sorry I would really hope that we could kind of make this a little bit more interactive because there's one thing that I notice that if you ask me that question it's because I did not explain this properly I'm not saying that this kept has anything to do with CNI 2.0 those are two different things CNI 2.0 is on one track it is trying to do something and at the same time this thing is trying to do kind of what these guys are also doing but in a native way and they're trying to specify it so both of these entities are trying to address the shortcomings of what you have right now right and one of them is native and the other one is like an improvement over what you have nowadays so would you like to rephrase your question with this in mind like I'm not saying that the the cap is about CNI 2.0 because they're really not yeah the kind of the so the question is if we are scared of the multi-net cap as someone that looks like we are pro CNI 2.0 so the thing is I think there's room for everyone involved like first of I really think like all the advantages that you get on this cap like those are real and that's the real way to do things forward now what we want to have is a CNI 2.0 way of being relevant we still think it can still be relevant even if this thing enters because like part of its problem is also part of this multi-net cap power is also its weakness or the other way around like its representations agnostic like it can be just a wrapper over the existing CNI or it can be something totally different and replace it or ignore it somehow you can just combine them which is be a wrapper or combine what there's plenty of ways this could go second wow so the the question is this multi-net cap it sounds like it be ambitious enough and traumatic enough to the existing code base that you would actually need to bump a major release of Kubernetes and the question is I really don't know it might I don't in the meetings we've been that was never addressed that I remember not at all again like the as far as like the way that you hear the default network actually not I think a little bit about the answer is no like because human is only cares about one interface right that thing will be configured the same way this thing will just give you the ability of adding more things to it so you can preserve exactly what you have today I guess that eventually they will instead of have they will adopt the common or the pod default network as one of these multi-networks I guess that's the overall direction they're going and once that happens it might make sense to do what you're saying plugins so you want to take it I really don't know so the question is if in this cap does it still makes will the hardware devices still be will it still use the device plugin to actually grant the pod access to the physical network device so we really don't know but because that is only really listed on the on the use cases but I do think like the device plugin still has a role to play in it probably the thing that Malthus does nowadays of acting like a man in the middle to kind of instruct the CNI plugin of which device it allocated well probably this implementation agnostic thing will have to do something quite similar but we really do not know it's a use case it's listed as a use case so it'll probably address it when by whom we don't know thanks for your time okay so hello everyone good afternoon my name is Miguel Duarte I'm here with my colleague Kike we both work for Red Hat in the OpenShift virtualization networking team and we are here to present a talk about titled Qvert VMs all the way down custom sized networking solution for the cluster API provider Qvert okay so before all that clicks into place and you understand what we're talking about because we really don't know like how savvy you are in either Qvert cluster API provider and all that we're going to introduce these three projects Qvert cluster API provider Qvert and often Kubernetes once we have a common understanding about those three projects we can actually explain our motivation like why we care about this and what problem we're trying to solve the goals for the network plugin that we want to develop and after that Kike is going to walk us through the implementation details of this solution and show us a demo for it okay so the first thing we're going to introduce Qvert so Qvert first thing is a Kubernetes plugin it allows you to run virtual machines and pods in the same platform it essentially runs like live version QM process inside of the pod and that's pretty much what it does the tricky thing here is if you spend a few seconds to think about this you have like a virtual machine which is inherently a stateful entity that is scheduled and running inside of a pod which is essentially a stateless entity on the cluster those things will make some tricky things in the future and one last thing we should always have in mind in this scenario is that the networking requirements for virtual machines are a lot tougher than the ones for a pod mostly because of live migration that's kind of the feature we will live and die or live or die based on like live migration will be our bread and butter for this presentation so let me just introduce to you this project the cluster api provider cluster api is something that in their own words what it does is provide a declarative Kubernetes style api to cluster creation, configuration and management so all this means that the same thing you can do with I don't know Ansible or Terraform or whatever to provision a new cluster you can use this tool and it will deploy will end you a new Kubernetes cluster it has different types of providers AWS Google, Azure there's also a particular provider the one we care about that is Qvert so this means that the cluster that you get is implemented using Qvert virtual machines as Kubernetes nodes this begs the question of what would you want to do this one of the reasons like cluster scale you can have like one very dense huge cluster with thousands of nodes and tens of thousands of pods but I'd say that's really hard to manage and you won't see many of those it's a lot more common to have a lot of tinier clusters interconnected between themselves and our use case for this is like you have a cheap cluster provisioner that you can use for stuff like for instance you want to test your feature or you want to test how does I don't know your application survive a DNS upgrade or something you just create a cluster you run your test you tear down the cluster at the end and you're done finally let's introduce the Oven and Oven Kubernetes projects so Oven is essentially like an SDN control plane that orchestrates a bunch of open V switches that run on your worker nodes it's a value proposition is of allowing you to use like higher level abstraction that you get from open V switch so instead of you getting to manage like logical to manage open flow which you get manage our stuff like logical switches logical routers ACLs and these will afterwards be compiled into open flow and the nodes of your cluster so if this is often we then have Oven Kubernetes which is a CNI plugin that provides an opinionated topology and essentially translates Kubernetes objects to Oven logical entities so let's say that you provision a network policy on your cluster Oven Kubernetes will translate that into a set of ACLs and those ACLs will essentially be translated to open flow that will be installed on the nodes same thing with let's say services these kinds of things that's its task, translate from Kubernetes object to Oven logical entity okay with all these things in mind we can we're good to go on the motivation so our thing is we want to decouple the node updates from the tenant cluster VMs using live migration what I mean by this so remember that your cluster API provider Qvert it gives you Kubernetes clusters and it implements its nodes as Qvert VMs which is a Kubernetes plugin so you get essentially like Kubernetes inside of Kubernetes and we call the let's say the top most cluster it's the infra cluster the bottom ones the ones that are being provisioned by this tool are the tenant clusters so let's say that you want to upgrade your infrastructure cluster well that we don't want it to impact the workloads of your tenants underneath that cannot happen at all for that we will rely on live migration the entire time and essentially what we have today does not provide live migration it simply does not give us what we want and for that you see the wacky thing that Kike came up with why often in the middle of all this why should we go for often well other projects like OpenStack they're already using that technology and with really good results like you have like using some improvements you get like a migration downtime of around 100 milliseconds which is extremely good and we want to strive for those numbers so that's what we're trying to go for okay so we know what we want but now we have to set explicit goals on our network plugin so the first thing is the TCP connections that are established on the node basically for KubeNet and for the workloads of your tenants they must survive the migration like once the let's say the Kubernetes node that is essentially the KubeVM migrates from one place to another that it is running must survive moving to a different node another thing the IP and gateway configuration on that worker node it must remain the same it cannot be updated during the migration why well for instance KubeNet is bound to that IP address if it changes the IP KubeNet will basically go bananas and like your workloads will be impacted another goal that we have is that a tenant cluster cannot access anything on another tenant cluster unless that tenant expose it via services also a tenant cluster cannot access anything in the infrastructure cluster unless it is also exposed via a service and we need to do that for two types of services node port and load balancers and I'm now handing this to Kike so so hello my name is Kike I'm a software engineer at KubeNet working and we have tried to find multiple products to have some kind of like migration proof of concept so what we are going to see right now is like the big point that we need is implementation wise so what we need is to implement migration in the cluster the file network not doing it in a secondary network not doing it in the feature from KubeNet as multi homing in the file network we will see later on why is that why we want it in the file network we also don't want to set IP address in the pot we want to bypass as much possible the network inside the pot and pass all the information to the VMs so kind of KubeNet is not in the middle so for that we configure this HP options from OVN in the logical switch port which means like we kind of prepare a DHCP server so the VM can consume this IP configuration right also we copy some mechanism from Calico like they use something that is called layer piproxy what it means is like some parts of the topology you can configure some parameters that is able to answer from a foreign IP address that doesn't correspond to the subnet at this level this is how we implement ARPs for the file gateway so the VMs always the same the file gateway independently of the node where they are running and also the native or cache is exactly the same ok so now we are going to see the topology of the communication for the north-south communication right the most north part this is going to be exactly the same before and after the immigration but the important part here are this IP address here and here because this is the IP addresses that we use to redirect to do the point-to-point routing during the immigration right as we see here in these tags alright in the next we see the lower part of the topology like the slide before is like the upper part of the topology this is the down part of the topology and this is where the point-to-point routing is being done and for that we have two important OVN resources to configure egress and ingress we have something that in OVN is called policy for egress so we say ok for traffic going from this IP we want to go over this node with this IP right this IP is the same IP address as I have like pointed out in the slide before then we have like another thing which is the static route and we use it for the ingress communication so it's kind of the opposite if traffic is for this IP address you go over this port so also another thing that we configure in the topology is the IRP proxy and as you see and this is very important the IRP proxy is exactly the same for both nodes with this the VMs are going to have exactly the same default gateway independently of the node the neighbor cache is going to be the same so there is no need for updates in the neighbor cache so with the downtime after line migration this is lower so the idea is to maintain as much as possible the network configuration during line migration ok another important part here is as we say the configuration of the dhsp options so this is a kind of OVM terminology to start a kind of dhsp server right so what it does is it serves the ip configuration to VMs over dhsp and another important part is here the port the interface is not configured at all so this is just L2 communication so the integration doesn't make let's say noise during line migration and it's just L2 link between the VM and OVM and of course the VM is residing the IP address over dhsp ok then we are going to see how this bottom part of the topology looks after line migration as you see is kind of the mirror the only thing here is exactly the same the only thing that it changes is for the egress traffic this ip which means like ok now in this node so I want my egress traffic to go over this node that has this ip address and the same thing for the egress traffic right so what we do is the same ip but we say ok now I want to redirect this to the core of the nodes and this is the topology now we are going to do a real demo this is I'm going to explain with a pair of slides very quick the demo we are doing it's super simple but maybe it's good for illustrating things up so what we have to do is we have like a pair of nodes we have a hustler cluster and we have the worker pms so we have a client a client pod that opens yes a one tcp connections to a server with some super dummies somewhere that we have put there with the name tcp proof we just open one connection and if you get broken the server goes down so it's easy to see like the pod see if the pod is down the tcp connection is broken which is very important for us and then what we do the hustler cluster the pod goes from one node to the other and the tcp connection all right so let's go with the demo okay so before I start let's explain what we have here so the first part is the latency between request and respond in the client in the client what we have seen and in the bottom part of it what we see is like the the two pods that implements like immigration in cuber because in cuber as Miguel has say like all the vms are back up by a pod and during like immigration you have one pod in one node and another pod in another node and there is one moment where the will communicate state between one pod to another and then one of the pods dies and the migration that's end so we will see here the latency and here the pods of cuber and we see here one pod is in one node and the other is in the other node and we see here the state okay let's start okay now it has to start the latency in one moment we will migrate it all right and then you see here in the status that is the target pod in the target node is running now they are communicating the state between the two nodes using a liber mechanism like memory page and like they are communicating and now the old pod in the old node is going to be not ready and immigration is done so what is happening here is like this proof of concept we are doing is not perfect but what we want to accomplish is like the TCP connection is kept which is good enough for us it's not perfect but it's good enough for us and this is what it is there is nothing more here and what we have the rest of it is like we do again a live migration so we see also the same AP address all right and same happens but in the opposite direction it's like we are up here I know it's not perfect but it's good enough for us for now we have some ideas to improve it like in OpenStack they use something that they call multi-requested chassis and we will see what we do with that and that's it okay conclusions so now we explain what we are doing in the default need war instead of using a secondary need war where we have like more possible to change things up well in the world of tenant clusters we need to use a lot of Kubernetes mechanism to implement communication with the API server that is running in the infra cluster in the management cluster so for that using the full need war we have a lot of staff for free we have access to services also we have isolation we have like the network policies we have all of this then we have seen that using point to point routing in the primary interfaces so we can make the TCP connection survive and has a consistent IP address that follows the VM during the immigration and now with these points what we discovered is like we know how this this proof of concept works and we can start to implement it and improve it like little by little and that's it questions I mean you are going to have so the question is about what's going to happen if I understand correctly what's going to happen when you when they try to access the port during live migration right so it depends like it's already open before live migration and we have say the connection is not going to to broke but you are going to recite some latencies in the packages if you are going to establish a new connection during live migration it's possible that you are going to your client is going to retry until it can establish the connection so it's something like half a second now something like that I know it sounds like super bad but it's just a proof of concept so this is how it's going to behave ok thank you and no because in OVM node OVM sorry OVM Kubernetes you have different logical switches per node so all the L2 traffic doesn't escape the node so then the MAC address and the default gateway has the same IP address since L2 is cut in the node it's not going to traverse to the other node like you have different switches for each node so if it is the same the ARP you know when you use the full gateways what it does you just replace the MAC address with this thing doesn't go to the other node because L2 is going to be only in that node and you have the distributed router in top of it you know it makes sense yes we are we are not breaking it we are like the question is like this feels a little if I understand it correctly it feels a little like we are breaking what is expected for Kubernetes networking you have two different pods they should have different IP addresses that's why we are using a point-to-point routine because let's say the pod is in a very foreign node that only understands our subnets that's why we need these mechanisms we are kind of I don't know how to work to put it making it more flexible fighting it a little it's a very it's not a general purpose pod it's for a very specific thing like it's a backend for Kuber I don't know if we are going to use it for everything for something that is not Kuber but yes we know that people is happy about it because we prefer to have the immigration and maybe be strictly about this one let's see okay something else okay thank you I am an engineer working in the OpenShift networking team at Red Hat and this is Patrick my colleague he also works in the same team and we are both together here today to talk about OVN Kubernetes which is the default CNI in OpenShift from the 4.12 release this is the brief agenda that we'll be covering today we will first start with the basics of what is Kubernetes in OpenShift networking can we have a show of hands what is OpenShift before? yeah that makes my job easier then so once we finish the basics of that we'll go over to saying what is OVN and what is OVN Kubernetes and then why did we move to OVN Kubernetes and I realize I have not explained what the acronym OVN stands for so we will also look at that in a minute and what are the differences of OVN Kubernetes from its predecessors the legacy CNI plug-in STN software defined networking we will also look at what that plug-in is and finally we'll talk about how this works how does OVN work what are all the components under its hood we'll try to show a live demo of how everything works together and how do you get networking on your cluster so Kubernetes networking so the fundamentals are every pod must have a unique IP so you must allocate an IP to a pod that's unique across the entire cluster the pod must be able to talk to other pods in the cluster to other nodes in the cluster and even to external entities outside the cluster so basically Kubernetes networking model defines that you need to have networking set up as soon as the pod comes up and this is like the basic however we all know that Kubernetes is just a bunch of APIs so it does not really provide a networking solution by default and that is where we need portable plugins to come on top that can work with Kubernetes cluster and such plugins that provide networking to a Kubernetes cluster known as C&Is container network interface and this is an upstream community that defines a set of standards and how a C&I is supposed to be written and as per those C&I specification standards OVN Kubernetes has been written so it is one such C&I plugin that provides networking in a Kubernetes cluster there are plenty of others out there like Calico, Silium and Tria that you might have heard of but OVNK is just a counterpart or another yet C&I OpenShift networking so like I mentioned OVN Kubernetes is the default C&I in OpenShift so OpenShift is Kubernetes plus a bunch of operators and even for networking we use an operator it's called the cluster network operator and that is in charge of deploying all the demon sets and the deployments that are required for the network infrastructure to come up and C&I supports two plugins like I mentioned the legacy one which is OpenShift SDN and the new one the shiny one OVN Kubernetes and that is the default from the 412 release and between the C&O and the specific C&I plugins we have a jam which is Maltis and Maltis is a meta C&I we say what it does is it allows an inception so it invokes other C&I plugins so it is a C&I that invokes other C&Is so let's say for example in a pod you want to set up multiple interfaces on it so you can invoke the OVN Kubernetes or the SDN C&I to configure the primary interface for a pod and then you can have a secondary interface using SRIOV maybe you want to have a secured connection to a storage which is isolated so you can use another plugin and Maltis lets you call multiple plugins to get networking set up for a pod so these are some of the components in OpenShift that are involved in networking and we thought this slide is pretty cool because it shows all the C&Is that are supported officially by OpenShift but the first two are the plugins that Patrick and I, our team manages the rest of them are third party supported vendor C&I plugins so coming to OVN Kubernetes so what is OVN Kubernetes? let's start with defining the acronym OVN but before I move to OVN we need to look at SDN because I told you that we are moving from SDN to OVN in OpenShift and that is our story, we are here to see why that happened so let's take a look at SDN first and SDN stands for Software Defined Networking it's a legacy network provider in OpenShift it is also written as per this C&I standards and it uses two main networking technologies one of them is IP tables and the other one is OVS which is a virtualization technology which creates a multilayer distributed switch so you get a virtual switch on each node in your cluster that's how you can put it simply and using OVS SDN provides networking on a part right so it hooks up and creates the necessary entities, the floors, everything you need to get the part up and running in the networking arena but we also have services in Kubernetes right so we have cluster IPs node ports, load balancers, in Kubernetes clusters how are they implemented in SDN that's where IP tables come into play so SDN uses IP tables, a lot of nested IP tables for setting up services and OVS to set up the part networking and using these two technologies it's able to achieve the entire it's able to manage the entire networking life cycle of a part in a cluster another interesting point is when you have pods across the nodes you need them to be able to communicate to each other and pod-to-pod communication on a node often known on a cluster often known as overlay networking is achieved in SDN by using BXLAN the virtual extensible local area network protocol and this supports east-west traffic and now that we've seen what SDN is if you now move towards OVNK which is the star of our show it's an open source project which provides robust network solution to your clusters in Kubernetes clusters it has OVN open virtual networking stack on top of OVS so that's the extra piece here which is really what you should take away from here so SDN is also using OVS like I said but what does OVNK do extra it brings this new piece called OVN on top which is the abstraction layer on top of OVS and that lets you create simple logical network constructs so it maps things like nodes to routers and switches so that makes it easy for administrators and infrastructure developers and even users to be able to see those virtual entities that are very similar to your traditional networking stack like this is what you see day in day out as network administrators so that layer of OVNK abstraction gives you this huge advantage whether you're a developer whether you're an operator so that's the key and one of the main reasons why we moved in fact and another subtle difference is that we use Genif as a protocol because it also offers some other subtle advantages to VXLAN like in its header you're able to store more information for example so if you were to compare these two technologies on a high level right like the highlights of them both of them use OVS in their base both of them have an internet protocol address manager controller well don't get too worried it's just an IPAM controller that allocates an IP to each pod in your cluster so it's centralized pod IP allocation both of them use OVS flows at the end of the day to create their network policies or any other things that you want to implement they use either the VXLAN protocol to achieve communication across nodes but that's just a subtle detail STN uses IP table rules like I mentioned for implementing services OVNK is using OVN abstraction and constructs and I keep using the word OVN construct over and over we will see what that is in a moment and natting network address translation this is required for every Kubernetes cluster so that the pods can talk to outside the cluster so to internet and this is achieved again using OVN constructs in OVNK versus IP tables that's the SNAT and SDN now that we've seen the what let's look at the why why did we move to OVN Kubernetes OVN is the new abstraction layer I think we already mentioned this this allows for easier development so as developers if you want to add new features on top we have a whole new engine that can let you create such abstraction constructs on your network topology and that can help you easily add on new features versus in SDN you had to directly touch open floors and this is complicated and you had to have more understanding of another technology and that made things a bit convoluted the other thing is the flexible and expandable architecture that OVNK provides using OVN so you're just connecting nodes across the cluster using routers and switches we will show this in detail in a demo versus in SDN you were using IP tables for services and OVS floors but it was too restrictive you could not do many of the things that you would want to do with telco use cases the final thing open source so SDN was something specifically created for OpenShift and we wanted to take our project in an open source direction so OVNK has contributors who are not just from OpenShift or not just from Red Hat we have a vibrant upstream community so we wanted to intentionally move in that direction so these are the 3 wins that I would say that we get out of OVNK and finally coming to the how part of our story so how does all of this actually work under the hood with any component in Kubernetes cluster even the networking component has some parts running on the control plane nodes in your cluster and other parts that are running on the data plane the control plane parts the pods that are in violet or purple here the OVN cube master that's the OVN Kubernetes component right and the pink ones they are just details you can even ignore them but they are the core OVN pieces so they are the bits that provide that abstraction that I mentioned so what OVN Kubernetes master does is mainly 2 things one it gives out IPs to pods IPs across all the pods in your cluster the second thing is it's continuously watching for all the Kubernetes API objects that are created it translates each of them into OVN logical entities and it stores these entities in a database so this is happening across in a centralized manner in the 3 control planes and once these entities are created let's say OVN and OVS are doing some magic right so they are creating they are taking all these entities and they are doing the necessary plumbing and creating the flows that are required for networking to function on a cluster and we talked about OVN construct so this is where they come in like the logical entities so if you take a node in a Kubernetes cluster as an example and when you create a node a corresponding thing that OVN Kubernetes master does is it translates this node into a specific switch, a router some policies so it does that abstraction for you so you have a switch for this node I need to now create a router for this node and then imagine multiple nodes in the cluster so you are going to have multiple routers and all of these are connected by switches so you get this whole network topology, logical apps traction at the end of the day and this is very similar to how you see in traditional on-prem networking and another thing like let's say namespace it's a very common entity that we create namespace is just a bunch of pods so if you take pods in a namespace as an example we might want to track all the IPs of those pods so we create a logical entity called an address set which is a collection of IPs and that's used to provide a specific functionality for features if you take a pod as an example a pod is translated into a logical switch port so it's a port on a switch so you have a switch creator for each node all the pods on that node will be a port that are hooked to the same switch and the switches amongst different nodes will talk through a router so it's basic networking and the same for services and endpoints so each of them are translated into entities called loadbalancers and Patrick will talk about them in detail so those are the things happening on the control plane and once the control plane aspects are dealt with on the data plane that is running on each worker node in the cluster we also have some a lot of things actually going on so like I mentioned the master has now allocated an IP to the pod in the cluster once the IP is allocated the CNI aspect the actual binary that is doing the CNI add event is located on the node so the OVNCube node pod is continuously also watching for these IP annotations that the master does and once master's finished its job the node takes over and the node is literally creating a node that is required for the pod it's actually setting up the plumbing inside the OVS that is required for the pod to have its networking up and so on so the CNI executable commands like we mentioned which is done as per the CNI specifications is all executed by the node component and in the node also we have these pink magic containers that are translating these entities which we created in the database in your control plane for simplicity and now that I've covered all the fundamentals in the theory aspect of it we go to the core part which is the demo aspect where Patrick is going to show everything that I said but on an actual live cluster so over to you Patrick Hi everyone, my name is Patrick I'm from Red Hat and I work with Surya on OVN Kubernetes and general OpenShift networking so what we have prepared here is how all of what Surya of us ties into an actual cluster so for our general upstream development we are using kind so in the recorded demo I'll first clone our upstream repository which all of you can easily do and then we'll move to actually starting up a cluster so as I said we are using kind kind is Kubernetes in docker so so for every container that docker creates you will see a Kubernetes node so we've prepared in our upstream repo a very neat script that allows you to with one command to create a kind cluster that actually uses our CMI of Kubernetes as it's networking as you can see there is a ton of options that you can choose from for scenarios like IPv4 IPv6, dual stack and all of different features that are available in OVN Kubernetes as I said to set up the cluster you just need to run one command it will take around 5 minutes to set it up I'll fast forward here so we don't waste too much time as you can see we build all of our containers we've set up the Kubernetes cluster at the end we apply simple YAML files that take care of deploying OVN Kubernetes into a cluster that initially doesn't have any CMI on it ok, just tying it back to kind if any one of you is not aware as you can see we have three nodes OVN control plane OVN worker and OVN worker 2 these are of course represented as docker containers and we have a little bit of an exception here to exit into any of those docker containers you can see that there is cryo and you can see that there are additional containers running on our worker node now moving on to OVN Kubernetes as Surya said we have a control plane and the data plane the two top bots contain our control plane components that contains OVN databases that we use to store the logical entities that's the guy here and the second one, OVN Cubemaster that's the main piece that does the translation between Kubernetes and OVN so this is the component that will create logical entities in our NVDB that I'll talk about more in the next slide of course we also need to provision the data plane components as you can see the next piece is OVN cube node this is a Damian set that runs on every node in the cluster and you can see that in the nodes section and it contains OVN cube node pod that pod runs the actual CNI so when a pod gets created we call the container runtime which will then make a call to the CNI and we'll take that call and provision pod networking for every pod that's clustered additionally and that's something that's upstream specific you can see that there are OVS nodes pods these are running OVS upstream so these are responsible for setting up actual flows and applying them to actual packets so after all of these layers of abstractions coming from Kubernetes to OVN we configure flows that get applied to each packet that flows through our cluster this is not something that we have downstream and in downstream we have a host network service sorry we have a system D service that runs on the host and it's compatible with a simple kind cluster since we have just Docker containers additionally please note that we do not have HA here I tried to keep it simple for the demo purposes but of course you can enable it so all of the control plate components would be multiplied to provide the HA functionalities the key in the eye of you can see that all of these that are actually from the host network this is because we are the CNI there is no CNI to provide us with any cluster network level networking so that's something that you will often see in OVN Kubernetes that our data plane pods and control plane pods are host network and yeah you can see that they share the same IPs as the nodes let's get back to the slides for a second and talk about how we represent a cluster in OVN ok yeah as Seria said the cluster is represented with a bunch of switches and routers and that's it it's very simple to grasp if you know what a switch and a router is so as you can see the dotted boxes represent every node in a cluster so here we only have two but you would get everything that's in the dotted box you would see on every node of your cluster up top you can see the underlay network this is the network that allows us to connect the nodes and to provide external connectivity to the cluster and this is not something that we configure and it comes as underlay network so it's provided to us we can use it to join the nodes you can see here that every node in the cluster gets a host subnet this is a subset of the whole whole cluster network it's specific per node and all of the pods in that specific node will only have IPs from that host subnet starting at the bottom as Sria mentioned every pod will get its own logical switch port every pod will have it created by us so during the pod creation will create a VIF pair and one of the ends will get connected to the pod net namespace and the other would get plugged in into an OVS switch that would then be represented as a logical switch in OVN so this gives us the advantage that if there are two pods on the same node they will use that logical switch and it's local so there is no need for the traffic to traverse node boundaries now if the pods are located on different nodes and they need to communicate with each other they need to have something that connects them so we use centralized cluster router for that this is a singular router that will be only one per cluster so this gives us an opportunity to connect logical switches from different nodes and additionally it gives us a path to north south connectivity so if a pod wants to talk to the internet it will go through the gateway router so the gateway router is a node-specific router and it is responsible for providing the ingress and egress functionality out of and through the cluster as you can see at the top here it actually takes over the default interface that's on every node so when you provision OVN Kubernetes that interface that was the default interface connecting the nodes will get taken over by us and will manage it and that thanks to this we can provide external connectivity and direct traffic as we need to and one note I wanted to make here if your pod is host network which means that it doesn't need the cluster network IP it's not going to be represented in OVN it's not going to be represented in OVN and most of the traffic flows that's to external entities will not use OVN so thanks to it our CNI uses that approach and we can set up everything before we start handling pods let's see how all of that looks in an actual cluster so I've prepared the trivial YAML file that contains a client pod and a service backed by a deployment of three replicas so I did that to showcase the very minimal approach that you could take to see all the components that OVN configures to provide service and pod to pod connectivity so how do you actually see what's there so we applied the pods we see them here and they all are running and right now to look at the logical representation in OVN what I do is I enter one of our pods so we can directly interact with the database so for the remaining part of the demo you can treat it as the left side contains what's in Kubernetes and the right side represents what's in OVN and so we configured that allow the connectivity between pods and services as I said in my previous slide we have three nodes and all of them get a node specific gateway router that we use to provide external connectivity basically and additionally as you can see there is just one central OVN cluster router that gives us the possibility to connect pods and services that run on pods there are also logical switches so every node gets its own logical switch and additionally we use external switches to take over the interface from the host and we use one joint switch that allows us to basically connect gateway routers into the cluster router okay moving forward on the left here you can see pods with OVN that you can additionally look at the IP addresses that we've allocated and the nodes that the pod is running on so here I show the OVN worker switch and on that switch the first port that you see is one of our packing pods for the example service I created additionally you can also see that on the same switch we have more ports that are not related to that deployment so all of the cluster network pods that are on a particular node would show up here we of course have a port that allows us to connect to the router and a port the worker port that allows us to provide a cluster network to host network connectivity same goes for the other switch that hosts the default client and additionally one of the packing pods for our service so you can compare that to what we see in the Kubernetes world and it matches and additionally we see that we have the IP addresses and MAC addresses that we've allocated in OVNK let's take one of the logical switch ports as an example and this is the object that we store in the database that represents one of the logical switch ports and one interesting thing I would like to point out here we are using OVNK port security which is a feature that allows us to restrict the traffic that's allowed in the switch so anything other from this IP coming from this particular pod won't be allowed the same goes for the MAC address so if some rogue user enters the pod net namespace and changes the IP address on the vif pair from that namespace we won't allow that traffic it will get dropped so that's a neat feature additionally we also see this this is useful if there are any issues with that port but here you can see that it's up so it's functioning correctly okay so we've talked about port connectivity and how those ports are connected to a switch but what happens if a pod wants to connect to the internet right what we do in the gateway router we create an S NAT rule so we're going to translate the source IP of that particular pod into the IP of the node so to the outside world the IP that's egressing the node will be the one of the nodes so that's the IP address we take from the default networking interface and then we'll an S NAT so we'll translate it back when the result reply from the external line so as I said at the beginning of this part we've also created a simple service and you can see the details of it here so it's a service with that IP address listening on port 80 and these are the backing ports that are the endpoints for the service and let's see how that behaves in Oven so in Oven we have a construct called load balancer and it does basically load balancer traffic from this IP address on that port into all of our endpoints that we've defined in the service the name of this entity is matching the service that we've created so as you can see you can easily get the networking representation of what's actually in a Kubernetes cluster you don't need to look at particular OVS flows you don't need to dig through layers of IP tables and it gives you an insight view of how the networking is configured in OVM of course all of this logical representation of the network it's not something that will actually handle your packets so what you can see here is a dump of flows that we configure in OVS so every logical switch every router and every nut will eventually get translated into an open flow rule that will be applied by OVS and its module in the kernel so it's extremely valuable when you are deploying features and when you are doing it fast to have an abstraction layer that allows you to not deal with all of that because it is really difficult to dig through all of the layers and steps that the packet takes when it traverses your cluster and that concludes the demo yeah I'll just briefly go over that okay we do have some additional features in OVNK but we unfortunately don't have time to get over them sorry but you can check them out on the slides or you can check out our OpenShift or OVNK documentation for that and that would be it thank you all and do you have any questions so the question is is OVN better at throughput compared to OpenShift SDM that's an extremely difficult question because to get the answer for it you would have to specify the exact traffic flow that you use to pull that throughput because it's different when there are multiple hops between pods when there are different traffic paths and especially if you consider packet sizes protocols and all that it's very difficult at the lowest level when everything is configured it's OVS so it will traverse the OVS data path and then based on the connectivity between the nodes this will be the engine that handles the packet so this OVN as itself is an abstraction that in the end will still configure OVS flows so I would consider them but to get any detailed results it's not really easy to answer that we use Geneve so with all of that with either VXLAN or Geneve there is an overhead and it's something that you can alleviate by using something like hardware offload and as far as I know it's more difficult to get hardware offload for Geneve but yeah, there is an overhead and I wouldn't think that Geneve is much better than VXLAN in terms of the overhead especially that we pump more data through Geneve because we use the optional header options in Geneve that will increase the packet size yeah, you just need to use JumboFrames and one traffic path and you'll be good it seems we are out of time