 Ok, hola. Ok, un minuto. Ok, thank you, thank you. Hello, welcome to the DEFCON 2023. Thank you so much for being here. Thank you to the organization for the opportunity. Well, we would like to talk to you today about progress in delivery and argorollout. Ok, the idea is to review more or less the basics around progress in delivery and with argorollout. Well, my name is Asier. Asier Cidon for Spanish speakers. Asier Cidon for English speakers. Both are ok. Well, I'm a senior cloud architect in Iberia. I'm based in Madrid right now. Well, I joined Red Hat four years ago, four and a half years ago. And I have been working for 12 years more or less in open source related projects. Well, at the beginning I started as an infragai working with different infrastructure and automation solutions. For example, OpenStack, Open Nebula, Ansible, Rehab Virtualization for example. But in the last few years I have been spending time working with containers, with Kubernetes, OpenC for sure. And all products or all solutions on top of OpenC in order to deploy microservice architectures. Well, today with me. Hi, hello everybody. Thanks for coming. I'm David Severiano. I'm a DevOps update architect in Iberia. I'm based in Madrid also as Asier. I grow as a backend developer mainly with Java application. But now I'm pretty close with Kubernetes, XICD, DevOps. I'm so happy to be here and I hope you enjoy this presentation. Well, the agenda for today is more or less simple. We will review some basics around progressive delivery and Argo project. And then we will have or David will will give us a demo. It's a real, real life demo, ok? For the fingers. And we will review the Rehab OpenC if GitOps roadmap. For us it's an spoiler but it's Argo CD. And then we will have the Q&A, ok? Well, progressive delivery. Before starting talking about progressive delivery, the main important thing here is to talk about deploying new applications releases, ok? That's more or less the idea or the key aspect in this presentation. I would like to review with you some basics around CICD. It's only to be sure all of us are in the same page. Well, continuous integration, as you may know, is a development practice where developers integrate some changes in code in a shared repository with the idea to include new features or something like that. And well, you have an automated process to review this code, unit testing or something like that. And the ideas generate an artifact, ok? This artifact is in Kubernetes, it's in much container. And the idea is, yeah, it generates this image container. Well, after that, you have the continuous delivery and continuous deployment. The idea in continuous delivery and deployment is to take this artifact and promote or deploy this artifact in all environments. You have, for example, development, reproduction production and the idea is to promote, deploy this artifact in all the environments. Well, the idea in general of these procedures is to deploy or deliver new product releases faster, improve the customer satisfaction, improve the innovation processes. And from my point of view, another important thing is to be sure that our application complies with different level of quality and performance as well. Well, what is progressive delivery? In summary, progressive delivery is an evolution of continuous delivery, ok? Básicamente, with progressive delivery you deploy a new version of your application, select as subset of production users with the idea to use this new version. Analyze then, you need to analyze the behavior of this new version. And if everything is ok, the idea is to increase the traffic or the number of users using this new version and decrease the traffic in the old version or the current version. And you repeat this process until you have all the traffic redirected to this new version. That's more or less the idea of progressive delivery. How is it possible to implement this strategy? You have both two options. You have blue ring and canary. A blue ring deployment, the idea is to deploy the new version in parallel in offline mode. Why in offline mode? Because you need to test this new version. That's more or less the idea. When you are sure your new version is working fine, the idea is to move all the users from the current version to the new one, ok? In a specific moment. Well, you have different constant pros in this strategy. Regarding pros, for example, you have... It's more or less easy to deploy the new version and to perform rollbacks because you have both application running in parallel with the same configuration and it is more or less easy. And in contrast, you have to duplicate the compute resources, for example, because you need to move all the users at the same time. And for this reason, you need to have the same configuration in the new one. In canary, it's a little different because you need to deploy the new version, for example, select a subset of users in a minimum representation, I mean 5% of the traffic or 10%, the idea is to redirect some traffic to this new version, analyze and observing the current behavior of this new application based on metrics or service level indicators and if everything is ok, the idea is to increase the number of users in the new version and then decrease the number of users in the old or current version. Well, you have... The idea is to make this change in some steps, por exemplo, 10% and then 20% and then 50% and then 80% and then 100%. That's more or less the idea. Of course, you have cons and pros as well, regarding pros, for example, you have zero than time, that's more or less the theoretical idea. And you have a cost effective strategy because you have only the replicas required for all the traffic, ok? And in contrast, you have probably some complexity in terms of managing all of these tasks, deploying the application, verifying the new version, managing the traffic, you have some complexity. It takes time because you need to perform some steps and you need to analyze the metrics and probably takes some time and you need power compatibility because you have users in both versions using the same database, por exemplo, or these kinds of things, sharing some resources. Well, what is Argo? Argo is an open source suite of projects that helps developers in their day-to-day operations in terms of software, ok? It's a trendy project, I think. Argo was born five years ago, but he joined the Cloud Native Computing Foundation in 2020 as an incubating project. As you may know, the Cloud Native Computing Foundation for sure, that supports and helps open source project to grow, ok? Right now Argo has the graduated certification. Graduated certification means you have a project consider stable, production ready, you have thousands of contributors and for sure this tool is designed to be used in production. Well, Argo CD probably is the most known project in Argo community. Basically, Argo CD is a tool that supports continuous delivery strategies based on GitOps. In GitOps you have a code repository, the idea is to... It's a controller that manages the configuration, obtain the configuration in these repositories and then use this configuration to configure Kubernetes clusters, Kubernetes open safe, you know, different clusters. And well, that's more or less the idea of Argo CD. Of course you have different features, you are able to read this configuration, you can detect a config drift, you can perform rollbacks, you can implement quick and easy recovery plans, you have different features with Argo CD. Well, but Argo is not only Argo CD, in Argo project you have a lot of projects, you have Argo workflows, for example is the continuous integration engine, you have rollout, Argo rollout, that we will review in the next slide, is a Kubernetes deployment... Advanced Kubernetes deployment strategies tool, Argo event, ok, is to implement event driving architectures, ok. Well, regarding Argo rollout, is a Kubernetes controller that manages... that manages advanced deployment strategies, ok. E, por exemplo, proving and canary, of course, and the most important thing, everything automated, that's the main reason. As mentioned before, progress delivery strategies involves a lot of tasks, no, I think for example you need to deploy the new version, you need to manage the traffic, you need to observe and analyze metrics in order to analyze the current behavior, you have to do a lot of tasks. Well, Argo rollout helps you performing all of these tasks, ok. Thanks to many integration, of course with Kubernetes for sure and with third party solutions, like Prometheus for example in order to analyze the behavior of the applications, Istio for Istio, nginx, I don't know, you are able to to integrate with third party solution in order to perform all of these tasks. Regarding the architecture, Argo rollout is a unique controller, write and go, and well, the idea of this controller is to manage some custom resort definitions, CRDs. You have, in order, regarding the workflow, you have two important objects or CRDs, you have the rollout object, this object includes all information the regular Kubernetes deployment has, and then you have an extra field in order to define the specific strategy, ok, in order to configure the deployment strategy for sure. And then when you create a rollout, the Argo rollout controller takes this information from the CRD and creates a set of replica sets and manage the services, the Kubernetes services with this information, ok. We will review this procedure with David. And then, on the other hand, you have the analysis template, ok, where you are able to define the analysis strategy in order to get the different metrics and then be able to be sure this application is working fine in order to take decision for promoting the application, the new version of your application or not, ok. And that's all for my part, David. Ok, demo time, great. But you will have to wait a little bit more, ok. Let me introduce a little bit the demo that we are going to execute. We are going to execute a Kubernetes deployment. You already know what is it, as yet has already explained, so I will go on. We have developed a very simple application, we have a frontend application built with React that calls a backend application that is built with Quarkus, but it's just an API that will answer with the version of the application. So what we are going to do, we are going to deploy a new backend version, we will start with version one, we will deploy version two and we will see how the version changes. Ok. In order to deploy, to deploy this demo, what we have done is for the backend, instead to use a deployment, a Kubernetes deployment, we are, sorry, we are using a rollout kind, ok, also with the API version. In the rollout kind is almost the same as a deployment, the only difference is that we have to add this strategy part, ok. So we have the same as a deployment, but adding this strategy part. For this demo we have decided to add a canary deployment, ok. Here we also have the analysis team play, that we will talk about it later. We also have traffic routing, for this demo, we are going to use Istio, but our rollout is able to execute a canary deployment without a traffic management, ok. Istio is not necessary, but if we use Istio, another traffic management, is the only way to achieve the exact amount of traffic that we want for each rollout, ok. So for example, for this demo, we have defined three steps, ok. In the first step we will send 10% of the traffic to the new version, for 60 seconds. Sorry. Let me do it, let me do it. Spoiler, spoiler, spoiler. Ok, here we are. First step, 10% of traffic. Second step, 50% of traffic for 60 seconds. Third step, 75% of traffic for 60 seconds. And finally we will have 100% of traffic to our new version. Ok. Then analysis template. Our backing application is sending metrics to Prometheus. What we are going to do with our analysis template is we are going to collect with a specific query and define a condition. If the condition is success, the rollout will continue. If we have errors, we have bugs in our new version, the condition will be run and our rollout will automatically do the rollback. And our rollout will automatically send 100% of the traffic to the old version, ok. That's very, very important because this allows us to execute more safely deployments. Ok, this is more organized architecture for the demo. We have our frontend that is calling our backend inside service mesh and we will start with version 1 with 4 pods, ok. Then when I deploy the new version, our rollout will do two things mainly, ok. First thing, it will create a new replica set but only with one pod. Why? Because I have said first step, 10% of traffic to the new version. So our rollout make her best to achieve this percent of traffic. But because we have service mesh and the virtual service, our rollout will do the second thing that our rollout will also change the virtual service setting 10% of traffic to the new version, 90% to the old version. We will see that in the demo. So, I think more or less it's clear. Let's go with the real demo. Let me show you. So, first of all, we have already deployed our application using of course Argo CD eTops model. Here we can see deployment for the frontend. We can see our rollout instead of the deployment for the backend with a replica set that has 4 pods. We have our analysis template here working and all the distribution role, the AV and virtual services. Great. Then this is our frontend application and we can see that it's answering with version 1. So what I'm going to do is I'm going to set this frontend application to execute one query against the backend each second. So, we can see that it's changing. The time is changing and the answer is always version 1. Great. We also can see that 100% of traffic is going to version 1 and this is the Argo rollout key. Here we can see that the stable version is version 1, we are in the first revision and we have 4 pods for stable version. So, let's do the change. I'm going to deploy what we have to deploy the application. So, sorry, no this. I'm going to change my health value to deploy version 2. Great. I'm going to make the commit commit and the push. So just to make it faster, I will reference automatically Argo CD and we will see the magic. Argo rollout has seen that there is a new version has created as I told you a new replica set with only one pod. It has also changed the virtual service and if we see the differences we can see that we have 100% to stable but Argo rollout has changed it to say 90% of traffic to stable 10% of traffic to canal. I didn't do nothing. It's Argo rollout. If we go to the UI from time to time and watch version 2 it should appear. Because we are sending only 10% of traffic to the new version version 2. And also in Kali Kali has a little bit of delay. But also in Kali we can see that right now 70% of traffic is going to version 2. And sooner after the 60s Argo rollout will continue with the rollout and I don't for today. I have to do nothing else. It's the easiest demo I've ever done because Argo rollout will do everything for me. I can sit down here and see how Argo rollout is then in step 2. Argo rollout has created a new replica set. Argo rollout is also changing the virtual service. We can see here 5050 we can see in the UI that version 2 will appear more frequently here. Also important. Analysis run. There is an analysis run that starts at the beginning. I forgot to tell you. Will be here from all the rollout. It is getting metrics comparing against the condition and everything goes well. It will allow with the rollout. If the metrics are not success Argo rollout will automatically make the rollout. The rollback. And that's all for me. After 60 seconds we'll see that Argo rollout will go to step 3. Remember that step 3 is 75% of traffic to the new version and 25% here it is. I'm not doing nothing. Easiest demo. So here we have three regular also very important. Argo rollout as you can see is deleting the bots from the old version because he knows the amount of traffic so with the weight that we have defined now we are in step 3 so it is 25% to the old version he said only one point for the old version so he is doing the scale up and scale down also automatically. And again changing the report service to achieve the exact amount of traffic that we want to each version. Also we can see in Kiali that the traffic is going up to the version 2. We are already in 80% of traffic in version 2. And after another 60 seconds we will see how Argo rollout will automatically create the fourth here this. The fourth bot, the last bot it will send 100% of traffic to the new version Argo rollout will also delete all the bots for the old version now as you can see we don't have differences in the virtual service because Argo rollout has said version 2 as stable, remember at the beginning we had version 1 as stable now the rollout is finished we have version 2 as stable also the analysis run has finished successfully because we were lucky and everything works and in the virtual service there is no differences because we are sending 100% of traffic to the stable version Sooner we will see in the Kiali UI that 100% of traffic is going to the version 2 That's all Let me a little bit more things that we want to review with you and then we will go to the Q&A Ok, he tops a roadmap and version 1.9 of OpenCVG tops was released a few days ago and in that version we have Argo rollout as text preview Ok, so great we can start playing with it more safely we can start talking with our clients about it but take it to an code to account that this is still text preview but it is there already It means that you are able to install Argo rollout with OpenCVG tops Last but not least Saturday 15.30 we have a workshop about Argo rollouts It's very similar like the demo that they have done already but we will deploy cloud native applications using Argo CD using Helm we will use GTOPS model and we will have 3 exercises Canary deployment without service mesh we can execute canary deployment with Argo rollouts and it is not necessary to have a service mesh or traffic management and the third exercise we will use Argo rollout, service mesh and everything together but if you want to enjoy and play a little bit with Argo rollouts come next Saturday and that's all I think we are on time for the questions so please don't hesitate if you have any questions I think we have to switch We are online We have questions online, no? Ok, perfect I'm not sure if I get your question Ok, so then the application doesn't behave I don't know the method and I hope I said that you also know the application with an assistant way You have different integrations you have integration with Prometheus we will do dv with a core the specific core you have different ways or here we have the Prometheus the query that we want to execute against Prometheus and the most important thing you have here the provider here in Prometheus you have different providers and then you define the success condition based on the metrics Yeah, right now we are playing with in this example with Prometheus but we have different Yeah, there are another provider that is web that you call an API and based on the answer you decide what to do You are going to answer the 30 bar for example actually here also about and let's play I see that they are possible and are often the providers of words so I guess there is there is a there is a question and another one Sorry, I will answer first I think he is asking about we can see here a password this is something that has to be improved right now there is not a way to tell Argo roll out get this password from this secret so it's something that is still I think improved Right now it's the only way Another question Is there a multi cluster Yes So in case when because Argo perfectly managed with more cloud environment so in case when we also have multi cluster environments and we should use analysis template and we have external Prometheus cluster so how should we solve but I guess from also this secret Yes First, he is asking about multi cluster Argo roll out right now do not support multi cluster so if you want you have to go to each cluster install Argo roll out controller and imagine that both controllers, both applications have the same yit repository Argo roll out, Argo CD or Argo roll out will get those changes but the roll out will be independently you can point to a Prometheus outside, we still have the problem of the password but the roll out will be independently for a reason could happen that one roll out finish with the roll out and the other not if the metrics are the same it is not right now this is my guess I think they will make it work for multi cluster they also have set it in the same operator as Argo CD that is multi cluster so I guess I know sure that sooner it will be also multi cluster but right now it's not remember that the roll out the kind is the same as the deployment so you have your replica for the demo so Argo roll out play with this number based on the weight he said ok 50% of traffic from 4 pots it's one pot you mean he makes his best based on the number of replicas that you have defined and based on the weight that you have defined 50-50 is here you have 4 replicas, 2-2, ok this is how it works because of that traffic management because when you don't have traffic management Argo roll out make his best with the number of replicas so you have 4 and you want to achieve this 10% of traffic the only way to really achieve this 10% of traffic is traffic management with Istio or whatever in order to achieve this there is a lot of configuration properties where you can say more replicas replicas how long the all replicas are there for a rollback there are many things that we don't have today time to review if you want to check Argo roll out page in Argo project you will see many many information come here we are out of time thank you very much everybody for coming is that the mic I'm supposed to use ok so no streaming sorry for folks online you don't have the size on the streaming sorry about that blame max so hello my name is Christophe Dinsha can you hear me ok with that level my name is Christophe Dinsha try to say that three times fast I'm working for red hats on various things including configuration computing and today I'm going to talk about change of trust in configuration computing how you build them and knowing verifying what you run so the things we are going to talk about today include a quick overview of confidential computing what this is about talking about what is attestation talking about various use cases for confidential computing who knows what the picture on the right is what, only one hand? are you kidding it's a hack going from root of trust going from root of trust to actual trust various platform specific details and supporting technologies and this one is harder to find as a picture any idea? someone who is not at KVM form so these are the foundations for the hack about so this is really I'm going to talk relatively fast and brush over a number of topics because this is really a summary of a blog there is a blog that is on the red and a blog so you can scan that QR code or work later on the website and define links with modifiers so what is confidential computing confidential computing is mostly about predicting data in use and there is this quote from a guy named Kevin Mitty maybe you have heard about him saying I compromise the confidentiality of the proprietary software to advance my agenda of becoming the best of breaking through the log hackers typically try to do that but the problem with infrastructure as we shall see is that your infrastructure today is where you run your stuff we saw that in the keynote today but why should your infrastructure see your data your software now runs on someone else's computer also known as the cloud so you have some sort of virtual machine host for instance and then it has various resources that is going to provide for you this networking and so on and these resources are used to run various workloads so typically that could be containers that will run inside your host now there are various sandboxing technologies and they are essentially designed to make sure that your containers cannot escape to the host cannot damage it cannot do whatever they want with the resources they are not really designed for the other way around for someone on the host picking inside your worker inside your container and so that means that if you want to run competitors on the same machine they will typically be unhappy because they are not sure that some rogue admin was not paid by the other guy to pick into their data so we have some technologies that have been established for a long time for this encryption network encryption you are familiar with those and what is really missing is that what is in your memory is essentially an open secret to a host admin so they can read that and it's maybe usually it's secret because it's in memory but it's really again an open secret so what if we added some memory encryption to a product against that and the memory encryption doesn't need to be super strong but it has to be something that happens all the time for all the data that goes to memory and then later there were some technologies that were added to protect the integrity of the CPU state because if you run that in a virtual machine you don't want your virtual machine to be able to change the register state to jump anywhere in the code or stuff like that and finally that's going to be most of the talk you need to make sure that what you run is exactly what you want to run so the decision is there to prove that what you are running is running on the right context in the right environment so first let's start with rule of trust so you're probably familiar with something like a TPS for instance and that's the rule of trust that measures and lasts the next step and then again and again until you reach typically your workload and so each step there will as it goes measure itself and record that in some physical device and then you can run it and then you can run it and then you can run it and record that in some physical device that is going to have a hardware enforced record of what happens so this leads to the idea of trust domains and in the case of continental containers which is illustrated here you have a number of things ok so a lot of stuff is misaligned there what is the resolution of this screen sorry so the you have a number of security domains that you need to consider the first is the trusted platform which offers a number of security guarantees that are essentially enforced by hardware cryptography and so the platform will just give you these guarantees but it doesn't really know what data is in there and then you have the host which will provide things like resources like physical memory devices and so on and again the idea is that they just provide the resources but they don't have access to what's inside because the data is encrypted all the way and finally you have the tenant or owner which is all the stuff that you care about and it's not just running on this piece of infrastructure you see on the right something called the relying parties that's typically that could be something that could be on premise for you or something in a place you trust maybe another enclosure and that's where you're going to do things like verifying what you are running in your trusted system so what kind of guarantees does confidential computing really provide well the thing that we really care about as the name implies is confidentiality so what this means is that we will protect data in use from leaks, from tampering from things like that however we will not protect against crashes as a matter of fact we might make them more frequent because there are cases where we will just say stop we won't go there we cannot product disk or network data that's really your role to do that and there is no kind of guarantee of service and again there is a more an increased risk of not having any kind of forward progress and finally it's all how to a base cryptography typically for memory and so on so that's something that with sufficient effort if you have NSA level access to the system then maybe you can actually encrypt stuff more importantly it's really highly implementation dependent what you see on the right is a chart that is just for AMD generations so different generations will give you different level of production so the bottom line is you don't get automatic security out of this right and the next question is how fast was this car driving so what is attestation it's essentially proving what you run that you run what you want to run exactly and where you want to run it specifically in a continental environment so let's start with a little bit of terminology with the rats model that the IETF has established for us so you have a component called the verifier the verifier is really in charge of checking your policies and this starts by an endorsement process where an endorser is going to say for instance this hardware I put some trust in this hardware for this or that reason for instance because I built it then you have reference values providers that are going to provide to the verifier values reference values like for instance this is the list of hashes for the software that I accept to run so I've measured my software ahead of time and I know that this software can run that and the reference value provider host that and finally the verifier owner is going to have those appraisal policies that it can put in the verifier that are going to say this evidence is accepted this evidence is not accepted so for instance while we develop confidential containers we have this appraisal policy which is anything goes because we are developing it and so we just accept the workloads run essentially this base or whatever and then you have an attester that will try to prove who they are that is going to send some evidence to the verifier that goes through these various steps and then the result of this attestation process go to a relying party that could be used for instance based on other appraisal policies this time set up by the relying party owner and that could decide for instance to release some secrets or do something like that so the basic concept again is that you are offering some proof about the configuration of a system in general generally speaking attestation is really proving some kind of property in our case what interests us is is this system actually running with encrypted memory on with the right firmware with this kind of properties for those of you who saw Vitaly's talk just before this one he was mentioning some of the properties that are verifying now one important kind of attestation is remote attestation and that's when you decouple the evidence from its verification and you all know that when you have a lock and you give the key to let's say your girlfriend and then you no longer trust the girlfriend so you change the lock and you've decoupled the two just random example so there are two big models for verifying this evidence the first one is passport check model while you present the evidence it's exactly what happens with the passport at the airport so the attestation is going to send its evidence like who am I etc to the verifier the verifier then issues some kind of ID and you present this ID to the relaying party so in a cloud typically that would be some sort of secret internal to the cloud that you're going to present each time you want to use an API in the cloud another model is the background check model which is closer to what you do with biographic measurements where the attestor present the evidence directly to the relaying party and the relaying party presents a variation of this evidence to the verifier and the verifier says yeah this guy can go through so in order to try to compare various ways to analyze evidence and do attestation I suggest we use a relatively simple pipeline that I call Dremits just to model the change of trust so the R stands for root of trust that's typically certificates or hardware components the E is for endorsement so typically you have a signing key that is issued by and validated through the certificate then the M is for measurements that's typically in the case of a TPM that hashes of the data that you're looking through so you decide to run some kind of boot loader you measure this boot loader you hash that and that's what is recorded in your TPM I is for identity that's for instance a reference value that you would pass to your verifier T is for trust that all the aspects related to policies I decide to accept this evidence or not and ultimately what you get out of that typically is secret so this could be passwords this could be description keys and so on so let's see a couple of examples to understand how this works so if you see secure boots the root of trust is the TPM the endoso is the manufacturer of the device what you would measure typically would be something like a firmware or boot loader or stuff like that what you get out of that so in terms of what identifies your device is some kind of signed attestation whether you trust that or not is then defined by the policy that you have in your system you can decide to boot without secure boot at all if you want and typically the secrets that you get out of something of this would be so Vitaly was explaining how in his case you would get a disk encryption key that you can only get out of the TPM at this specific step but you can get other cloud API secrets and so on now to get to something maybe a little more familiar when you go through selling a property you have these same steps except it's a notary that has signed records and there is a deal that measures what you are talking about you get a property description and the trust policy is do I give money based on what I know or not and the secret is you get the keys of the house and it's really the basis for historical basis for money as well and the sense that gold or silver is really the root of trust ultimately at least initially that was supposed to be what the bank notes were representing the entity endorsing this root of trust is a government that has some of this gold somewhere in the banks the market value is what you measure for how much you have in terms of money and so on which is of identities identified by the number you see for instance on the bank notes handing over cash is how you accept the policy or not, you know, I want to buy this well then I hand over cash and I'm happy with that and of course the secret you get in that case is the secret recipe of grandma that tastes so good and there's a view for money so the attestation again in our case will be what you measure and you do that using cryptografía so the same three domains we had before now there are multiple ways to do that so for instance AMD with a CV started with something called pre attestation and what pre attestation does is essentially measure some specific components and you measure your payload before you even start running it and so the hypervisor decides you know it does some operation and then you decide I launch that or I do not and I'm going to have a talk later today where I show you in practice how this works post attestation is slightly smarter you can automate things better because you can essentially measure from the guest itself and get the measurements for your own identity that you can then transmit over the network for instance to the relying party or whatever and ultimately you probably care most about the workload so at some point there will be some kind of workload attestation though very often the mechanisms for that are not very different from what you use today like having this specific container hash is what I trust etc so now continental computing is a relatively large field and there are many ways to deploy it and use it various use cases essentially going from virtual machines to complete clusters so the base technology is in general today based on virtual machines there are some continental computing technologies like SGX or SME that are based at the process level but the ones that really interest us are based on on virtual machines so from that you can build functions that boot very fast etc and the best example of this today is something called Keran VM or confidential workload so you essentially boot something very fast and you do one thing and then you exit now if you want to orchestrate on a larger level then you need to integrate that within an ecosystem like Kubernetes, OpenShift etc and that's the purpose of something like confidential containers so in that case you will use confidential machines as a runtime for your for your containers and you'll be able to deploy your containers the usual way and do all the scaling and all the things you're used to and the whole until the day is if you decide to put the whole cluster inside virtual machines so in that case even the control plane itself is in continental virtual machines and you have to be very rich to do that so the base technology behind it all is continental virtual machine and if you're curious about the picture on the top right that's what you get from DuckDuckGo at least that's what I got last week when I searched confidential virtual machine I have absolutely no idea what this is it's confidential even to me so first you need to have new hardware and firmware binary interfaces that expose the new features this is highly hardware dependent at the moment the host channel is no longer trusted however it has to expose these new features so you trust it to the extent that you can access something but the real trust will be in the cryptographic operations that happen same thing with the hypervisor the hypervisor needs to expose new features it needs to know new ways to do IOs, new ways to expose measurements and so on but it is not trusted even if it does that so where the boundary of trust happens is when inside the VM so the VM becomes the confidential enclave that you are starting to trust and inside there you have a guest firmware and a boot sequence that is typically measured as well as a guest scan same thing you want to measure that and make sure that you know what you are running so I mentioned confidential workloads for things like very simple functions here is a Karen VM in action in that case on this specific laptop that is not the confidential version that you are seeing here but you see that you this is really the real time operation so the way this works is essentially that a VM is exposed as a library to the host through a project called libkaren and then you have a direct integration with podman so if you are familiar with podman you can download images that way and so on you can run relatively well and that project got very early support for SCV compared to the rest of the projects I am talking about here in particular they were the first to do working remote attestation so it's interesting that confidential containers as a project had defined a protocol and we were not the first ones to implement the protocol that Karen VM project did so for confidential containers in our use cases we essentially use Kata as a basis so Kata was already running workloads and containers as essentially a virtual machine was a pod and the Kata run time will transmit the creation of containers etc so that it happens inside the virtual machine so there was already some level of isolation from Kata itself and continental containers containers builds on that to use the the continental computing technologies so these are the components that are impacted essentially the next step up is as I mentioned continental clusters so one example of this is a project called Constellation by Edgeless so in that case I cannot show it in real time the real thing takes something like 40 minutes or something like that to bring up a cluster but then you have a cluster that is completely confidential incluindo the control plane and so you say I want to start with for instance 3 worker nodes and 2 control nodes and you have 5 confidential VMs that authenticated each other and after that it works mostly like a standard Kubernetes cluster so you make again the whole cluster confidential and it works at the cloud provider level so you start your session by saying I am on Azure, I am on whatever so you can see what it looks like and if you are familiar with Kubernetes you will probably recognize a few things here and so what you see here is going inside this container and checking that SCV is active there and that's what you have inside so one interesting aspect of this project is that they rely on something called attested TLS and that is essentially a way of attesting the other side of a TLS transaction so that's interesting because it lets you build things that span more than one container they also added something called a joint service and that's a way to when a new node wants to join the cluster make sure that the new node itself is confidential they also have this at the user level a verification service that is user facing and lets a user check before I deploy my workloads there I want to make sure that the cluster is known to be confidential so how do we build actual trust and keep the trust alive along the way so how does attestation really work you start as I said by doing some cryptographic measurement for instance of your confidential VM or some enclave that you care about this is done by hardware or firmware in a way that cannot be tampered with that's the important part and once this is done you get some sort of reduced version some ID, some condense version or hash or something like that of that measurement that you can send to the attestation service and that you can then compare with whatever is in your database of of identities so attestation service is happy with it it will typically request something from a key broker service or something like that and send you back the keys that can be a decryption key for this that can be whatever now of course the interesting point is that you can say no and when you can say no to something you said yes before so that's another important aspect of remote attestation so let's say you discover that this there is a flow in it and can expose data then you can decide to exclude that now from your database and it will no longer boot it was accepted before so the actual flow is slightly more complicated than what I just showed because the attestor first does a request but to avoid replay attacks and things like that the relaying party typically responds with a cryptographic challenge with the nonce in it and so on so you have to encrypt with that nonce responding by replaying something or stale data so then you present your evidence encrypted with that nonce in it the evidence is relayed to the verifier and the attestation result is in return and then if that passes the secrets are retrieved from the secret broker and send back to the attestor so we have the same remix flow there in terms of how this happens in practice so I've tried to make it general because the name of these values components are different between AMD and Intel but roughly it's the same process so another interesting aspect of attestation is that different kinds of proof are needed for different kind of consumers what I mean with that is that for instance when you're booting the system the thing that you care about is is my firmware and Linux kernel are they the one I want are they not compromised, are they versions that I trust and so in that case you're facing the system to try to build a trusted execution environment and make sure that it is trusted user facing is you want to prove to some user of the system hey, is that system actually safe and of course you have to to have a way to trust that which the user can verify so typically that relies on some end also publishing the public key somewhere and so you can validate with the public key that what you got was actually emitted by that end also we all know how this ends some day the private key is leaked and that whole process is invalidated but at the moment this is safe for the existing technologies another kind of attestación is workload facing checking if the runtime environment of a specific workload is is valid peer facing is two workloads that want to check one another and make sure that they are not talking to the one guy and leaking through the other side and cluster facing is essentially nodes who want to join a cluster they want to check that before I admit that node in the cluster I want to check that it's actually running what I expect now there are plenty of platform this specific details that are really skimmed over so the vendor landscape today for continental computing looks roughly like this you have the AMD secure encrypted virtualization is sort of the first implementation of that that was really widely available this relies on a separate processor essentially a small arm core on the side which will store all the really important stuff and you have to go through that processor to do things like getting a new key, getting encryption activated and so on so that's AMD's approach is having this separate core on the side that does this kind of things now there are two later iterations SCV ES stands for encrypted state and adds encryption of the CPU register file which otherwise in SCV the hypervisor could modify almost at will and SCV S&P is a larger change secure nested pages adds some integrally production and adds things like product interrupt etc the Intel equivalent is called trust domain extensions or to DX and it's a very different approach because on that approach it's essentially a new mode in the processor called SIEM secure arbitration mode and so the processor goes to that mode where only Intel stuff runs when it needs to do something complicated IBM does it completely differently with something called secure execution which is firmware based par has something called a product execution facility which is a little closer to Intel than AMD implementation and ARM has something called confidence computing architecture which is like for like Intel based on having a layer below that the operating system cannot touch now what these technologies share is that they are all based on virtualization but they all work differently and so essentially you're finding a bunch of random zombies when you want to so AMD was the first one and that was essentially the message side it was somewhat flawed there was memory encryption through hardware which was good but and it was built on top of virtualization unlike that other encryption technology called SME as I said it relies on a separate security processor and only features pre attestation now the problem is that several vulnerabilities were discovered relatively quickly that gave it a bad rep which passes to this day so the mop up crew were SCV ES and SCV S&P ES as I said products CPU state from tampering no major impact on attestation S&P does have productions against physical access to pages the hypervisor remapping and so on but more importantly you can get attestation data from within the guest and so another interesting is something they call VMPL which lets you build enclaves that are more privileged than others to implement things like virtual TPMs etc so Intel TDX relies on the SCX to create secure enclaves where they do a lot of the stuff and it's virtualization base but there is no separate security processor and essentially you use the secure arbitration mode to invoke services that are provided by SCX enclaves on the side so various binary modules provided by Intel will expose the required services and they have to be measured at boot and so on so in particular attestation is provided by a cooling enclave in terms of supporting technologies well you need the host and guest linux kernel support hypervisor support guest firmware support host provisioning for supporting tools like for instance self-cuddle for starting a system I'm going to show that later today generic key broker services and attestation compatibility layers to try to mimic what exists and to be able to reuse the infrastructures that exist like for instance a virtual TPM that mimics real TPMs so this is provided by something called secure virtual machine service module or SVSM so the blog details all this so that's why I'm really skimming through go in the blog and you'll have pointers to the various documents and references etc so my conclusion is essentially that attestation means different things we really scratch only the surface it's a large collection of technologies even in a given context you can have attestation mean different things so folks tend to talk past one another you have to be careful about what you're talking about preserving the chain of trust correctly requires really careful thinking and again if you you see my talk later today you'll see a bit surprise there on what can go wrong technologies are not consistent across the board so Intel and AMD do not think the same way again you can take a picture of that if you want more details for the six part blog series now it's time for questions sorry I was a bit longer than anticipated so only two minutes yeah so I'll try to summarize the questions for online and you correct me if I summarize one so the question is really about what kind of hardware provides this and how can I get that hardware and how can I go without it if I don't have the hardware is there an alternative that I can do in software so for the first question these are typically new instances that get deployed as we speak so for instance there are new CVS in P instances on Azure that you can order today where they will have this pre-configured for you and you go in that case to the attestation mechanisms provided by Azure regarding whether you can implement that in software you can mimic some of that in software to for development purpose however what the software cannot do for you is product from someone on the same machine being able to pick up the memory and I'm going to show to give examples so the other talk is that I think 515 or something like that and I'm going to show you how you can actually dump the memory of a VM and see what's inside and search for root passwords and stuff like that and you find them so the question is whether I could encrypt the memory in flight encryption by definition has at some point data in clear that data today has to be stored in memory and unless your whole message fits in registers which let's say you have 16 registers on x86 64 bits that's not a very long message and plus yeah I'm out of time but yeah so the short answer is I don't think so I don't think you can do it safely unless for very very short messages I'm away from Australia by the way it's a very very long trip so I'm super exhausted but let's get started so because we don't have much people here today we can do it in an interactive mode two free to ask questions instead of questions at the end and we do a combination of that is that clear cool excellent let's get started yeah so my name is Dan Teller I'm a security engineer at Red Hat Australia and I've been doing information security staff for about a decade IR, DevSecOps development of some cool stuff as well as ethical hacking on some of Australia's largest companies and currently at Red Hat we are part of the information security department focusing on internal security so this is internal security not that customers and within that team it's a very large team globally we are managing vulnerabilities security vulnerabilities and this is a new team we have established we have new processes new technologies as well as new challenges and within that challenges dealing with a lot of information at once we show you how we managed to overcome some of them by creating a single automation script which then evolved into an asset mapping tool and together once we had these two we decided let's take it to the next step and create an entire asset hybrid cloud security solution which is similar to an attack surface mapping tool this is our vision with what we have now our vision as a team was to get assets at Red Hat systems servers and virtual machines into our security tools so we can see vulnerabilities specifically operating systems third-party software vulnerabilities things that are installed in the systems from an infrastructure perspective well at Red Hat like similar to large companies finding owners and the systems is a big challenge for the companies, CMDBs but often it's very hard to get the exact data you need and getting in touch with teams it's a big deal by itself and the assets themselves just keep changing all the time right so we're looking for alternatives we're looking for options to get that mapping thing sorted so we started looking at different solutions that exist open source, commercial tools and they just require a lot of customizations to get the exact mapping of owners and assets at Red Hat to our needs and some that did sort of fit just didn't have the exact features we needed and still required a lot of work so that didn't really make sense at that time so we decided let's put some scripts together to get at ok Jennifer today we're discussing what my team does from a higher level perspective we'll go through some demos we'll have some screenshots of the asset mapping tool finally vision of what we'd like to achieve and we can do some type of Q&A internally at Red Hat this is an ongoing process we have it's a new process, a new team where we on a daily basis handle a large amount of vulnerabilities from assets the main goal here is that we need to discover vulnerabilities affecting assets at Red Hat and tell the team there's a serious thing go and patch it quickly before it poses a risk for the company main vulnerabilities we look at as I mentioned before our planning systems third party we have future on our roadmap to look at more container vulnerabilities and applications and etc a bit more on our high level process for the team once the data is loaded in the security tools we have visibility on the vulnerabilities and we assess the data for the risks and we just engage the teams go ahead and patch that thing it's a critical for the company so with that challenges we had the objectives or one of them at least was to get the assets into our security tools then how the process initially we had looked like we had to connect the business owner usually a manager at Red Hat then he referred us to other technical owners well they usually knew where the assets are and their state at that time we had to exchange more technical requirements specs with the technical owners tell them hey we need to install security tools to get visibility and etc then we exchanged more agreements and finally we managed to unbolt some of the assets into our security tools there was a lot of manual work for us to achieve and obviously in such a lot of companies as Red Hat that's very time consuming and not scalable assets just keep changing all the times teams keep changing all the times we really started just with a spreadsheet to try that information we had who we contacted how many assets they had and we just put it in each spreadsheet like that we start coloring that and that's not very scalable it's a lot of hard work for our team to get all the teams at Red Hat there was a lot a lot of work so there was a definitely need for automation no doubt and we started looking at how we can really put that in scripting so what you see in front of you here is our design of the current system is designed on the right hand side is where we are collecting data from different sources so we have different environments AWS, Azure, different cloud, different infrastructure you name it pretty much everything Red Hat has we also have been collecting information from internal services naming a few there and some other third party such as security tools that we use at Red Hat all that information partially some of it is being collected to a dedicated server here where we have different containers running as if in business logics or functions the way we design that is these containers act as plugins to scale the system so the more services Red Hat increases the more services we we create we can put a plugin per that to pull that information and that data and then push to the destinations here on the what it says in blue marked for Google Sheets and Splunk at this stage in addition we also scan external parameter with the full center so what we can have really here is that we match we have rules that we match different ideas and to find owners and their assets because there are so many services at Red Hat it's a big challenge to find that and the way we managed to get some of that mapping is there were cases with just ideas very straight forward other cases we had to scrape names from tags labels or even combine multiple services just to get a mapping who owns what and where I prepared a quick demo and weathered some screenshots if I could learn that where is the mouse here we go and in this example specifically we are getting data from AWS pushing it Google Sheets and here in the middle there is a tool called Qalis it's a security tool and we just obtain all that information to get a mapping that here you consider getting account ideas and all the assets and that is being pushed into Google Sheets let me pause that here we go so what you see in front of you are the accounts and the assets from AWS but we can see which account has how many assets at a given time and a state so with this information we started engaging the teams we can see we have 20 assets let's start getting them to our security tools that was just a starting point then with the information already had assets owners and their assets we mapped that for our existing security tool and here in this case we can see how many systems are registered in our tools compared to the ones we have been told they have so in this example specifically let's say there was an owner here who was telling us he's got 20 systems and there are less he's got more and he's also told us that all the systems are already been registered well we saw some are missing now why is that really important that no systems are registered in our tools well if any of these systems could potentially contain a security vulnerability a serious security vulnerability and we don't have this ability in our tools that's a big risk for Reddit but now with that we are able to see and tell him a dude no all systems are registered go and fix these ones as well so we managed to increase the coverage of systems in our tools with that further with the same mapped information we started scanning the external perimeter of Reddit and as you can see here some systems we found to be exposed to the internet the white section here there are more walls the systems are not exposed to the internet the part where it's in blue in the middle we found systems which are exposed to the internet the ones the two rows at the top highlighted in red these are systems that are exposed to the internet but are connected to the internal networks of Reddit so that's a higher risk for the company and we managed to find a few of these finally with all that map information we had we also had the actual security vulnerabilities from our tools and in this case so we could see how many assets he has where the state's additional metadata if the system is exposed to the internet and if it contains a security vulnerability so we have all that information together in one view so what this all gave us really in this chart for a period of one year the number of assets that we managed to register to our tools so one by month month by month spaces in the first six months a lot of this manual work we didn't achieve a lot but once we had the tool in place you can see the assets being registered the amount of being the amount being increased so what this tool really gave us two main things we mentioned increase the assets to our tools secondly greater visibility and coverage to our risk profile for edit and finally just before we wrap up our vision to expand the existing asset map tool into an attack surface solution so this is the interesting part we take it into the next stage to the next level we want to get more visibility we want to see exactly the weakness points this tool can detect on the existing data in this example specifically once we have all this map data in our records system ABC is found to be exposed to the internet there's been a developer he's been doing some work on that he opened the ports and that system is exposed we can also see that system has the vulnerability XYZ in this case and we can also see who owns it at that given time now that system that we have automatically find that as a weakness something flex that is an important thing system is exposed with the vulnerability and that's a high risk so this is where we would like to achieve a system that automatically detects this kind of weaknesses for a company alright so just to conclude what you've seen today the last slide was a vision of what we have and we did manage to accomplish a few things here with a lot of hard work and data and we did manage to find and we're still in pauses to find risks associated with internet facing systems at Reddit we're further liking to have a simplified view to find these weaknesses end to end what an attacker could exploit the entire thing and finally reduce the entire risk for Reddit e our customers well thank you very much that's the last slide for today and any questions ok so the question is how do we detect an internal IP for a system what's the mechanism behind it what the logic we do there's so many systems there's so many environments each one has its own case it's case by case there's not one particular for all system at Reddit we get the IP for several environments we managed to find are from the actual network devices themselves they contain it there's another service at Reddit it's a complete NAT solution in other more than environments such as AWS it's very straight forward you can just get the IP for the some other ones it's just a lot of work to scrap that and find that and then once you find the IP it's to find the other values associated with that to get the complete time mapping information good question any other questions hello alright I hope you enjoyed the presentation hello happy to hear you today thank you for coming to our talk my name is David co-presenting with me will be Erika about me I love open source and free and open source software I worked through my life since I was like 15 and start playing with Ubuntu everything from GNOME to Linux kernel libraries and stuff and I like it a lot so and I will having this talk mostly because I love to optimize for performance and recent years also for simplicity because having fast code and fast stuff isn't always the best thing if no one understands it let's go for another slide the introduction of this talk first I will talk about introduction what we using what software we using what hardware we using then I will go very quickly through level 1 and level 2 that's like basic usage of CI the stuff you if you used ever CI you usually do and then in level 3 I get some interesting things like device under test and Erico will continue with level 4 which is running to farm in the real hardware and taking care of it let's go then so what is Mesa3D CI Mesa3D is itself graphic drivers for Linux you have one part of the driver inside Linux kernel and the everything which takes care about rendering and OpenGL and Vulkan Mesa so that's Mesa3D and Mesa3D CI is solution built on GitLab CI because we use GitLab we have own instance free desktop GitLab and it's a solution built on that to test on the real hardware so before we start my talk will be mostly focused on the pre-merge testing we do also other testing but what is important when the developer submits the code to the project we need to quickly test the code and ensure he doesn't break anything on the other hand we don't want to block other developers or the merge request for too long so we have to like take it in reasonable time frame so our goal is about 20 minutes to do testing and of course make developers happy because said developers and broken testing that's never going well so level one we do basic things we build containers for a few distributions we're using Debian, Fedora and Alpine right now and for testing later we're using only Debian images but we use we built also these distributions because for example Alpine has muscle lipsy so it has different environment so we want to build it on it and Fedora you know because why not and for this we're using CI templates which help us prepare images and stuff and it's provided by free desktop git lab infrastructure so it's of this thing we're using GCCC lang we have like around 20 built tests just for building different combination with different options like the LTO and we use address sanitizers, memory sanitizers like test like everything we can offline without hardware and for linting we use like rustlinting, ceiling format and for our CI scripts we're using shellcheck because we will get to it why so level 2 testing without hardware after we compile to Mesa we do simple compilations on some jobs testing unit test basic functionality of the library and we testing shaders shader db where we fake the GPUs we say like we have GPUs and we test shaders on the back ends so we know to generated graphic shaders are ok then we use runtime testing and we can do that without hardware because we have few drivers which runs on CPU just emulated so we can does Vulkan and OpenGL on CPU thank you so this is our small pipeline some people has 8K monitors so they fit the whole pipeline on the screen you probably won't be able to read itself jobs don't worry if it gets at night and we are like tired we can either so this is like around how much 200 to 250 jobs every job isn't currently on this slide so I had to try to fit it so this is kind of level 3 most of these jobs are from the devices so what we use we have multiple solutions for testing because many companies contribute to this solution so we have many approaches integrated into GitLab CI we have LavaFarms that's first thing LavaFarm is automated validation architecture originally it was built for ARM devices by Linaro to get their testing you know to get their testing and these days we using it for AMD64 we using it like for everything and these farms has advantages in like things you have like monitoring on top of them you can set priorities priorities for the jobs so you have like lot of abilities how to handle stuff which we sometimes use in RCI for example we use a lot of priorities because since we allow developers to run manually to jobs and test if they need we need to march which is the merge pipeline which going like pre-merge before the code gets in so we need like get into 20 minutes and if someone starts excessively testing their jobs on the CI you know we block the other people which want to merge so the prioritization in these farms is very useful for us and then for example we have LavaFarms they using a little bit different approach they using container containers so they boot into minimal interface and they just load the container with tests and these tests are same as we using on other devices but differently packed you know in the container and we will talk about containers versus rootFS which we use on other devices very soon then we have Barbón devices which have no prioritization and these stuff there is like multiple farms so a lot of people some people running these devices at home without farm and just for tear testing some people some companies running these also like Barbóns like Google for example because they didn't want to use Lava and these farms every farm has like a little bit different of handling so on every device you have to count with a bit different environment so when you writing test or you want to like have reliable results you always have to test against everything what is good that like one test can be run everywhere so like the test are shared so if we write the test and test them like they are everywhere same and sometimes some devices have different kernels for example Raspberry Pi has custom kernels because you know of the reliability and everything but on other hand if you want to enable some kernel feature which is very useful for our testing you cannot do it because you know these kernels are shipped and you cannot update them so for this part of hardware we also using smart logic so for example if you push code into our repository and it only touches for example Intel code and not the shared parts only Intel code gets tested so you don't need to excessively waste cycles and you know you see the pipeline and waste so much energy and power and time to test everything also we have kill switches for farm we currently have like 7 kill switches so when farm starts failing like because of network because of out of space something something breaks randomly you know then shut down the farm and continue working and testing not having to disable whole CI so environment every when you want to run the job on every device it will probably run a little bit differently so you need to test on almost every device when you developing for it you have to take extra care about some variables and stuff because you know the variable might be not set in your test on that device so there is some extra complexity do that and containers or root fs we have two approaches one is the container so you test you just load the container on the device Valve for example use it non hardware jobs using it and it's great because of the developer can just download the image and run it locally on his computer without setting up stuff on other hand it's a little bit slower for example for lava farms and some bare bone devices we using root fs which is which advantages is performance because you know you just unpack some root fs on NFS NFS server or send it to device and that's all no overhead probably over time we get to containers because for developers it's more useful every test is different because we need to cover a lot of topics every testing suite has different inputs and outputs so you need to handle that somehow some test handling flags some test handling failures differently reports are different for some things we try to we try to wrap it into something and provide same way of output for some test we just adapt the test a bit so we send some patches to upstream or like keep patches aside but we always try to keep small amount of patches as we can because we have to maintain it and it's like huge pain let's switch to other slide let's switch to another slide nice most of interesting in our terms is stability because when you test graphic hardware stability isn't a strong part first let's talk about parallelism we using huge set of tests which takes for example like 8 hours or 10 hours if you want to test them before the code gets merged in 20 minutes it can be kind of issue so first thing we do is we use parallel jobs so of course we just shard the test over like 8 or 10 device so we also shard the time which is needed to run it this is first part and second part is that parallelism inside jobs because the tests the GPU test usually don't utilize the system for 100% so we use parallel runs even inside the runners for example we run in 8 threads the tests and there is some cost of it and that's like flags because the test are usually not meant to be run in that huge parallelism so sometimes something fails and it's very hard to debug what failed and why like we know what failed but it may be just one job run with another job once so that's hard thing to handle and we handle that and we will get to that with flags point so flags yes yes no yes you fix something it works but you know once in 100 in 10 in 1000 runs it doesn't so we figured out we cannot like get into state when everything will work every time because this is like a lot of thousands of tests we run so we have multiple layers of handling this stuff first GitLab is a wonderful piece of software but sometimes you know sometimes you say like retry job when it fails this is one level of handling our flags because when we want to merge stuff and something just fails once for some no explainable reason we don't want to block developer so we want to retry at least once so we retry once but just recently GitLab had back which means like when you retry once it just stuck the job in a queue and until you send another job it gets stuck but we are constrained by time limits when we are merging so that's fail pipeline and developer is unhappy so what we do we send dummy jobs to it just to get it working and recently GitLab fixed the issue so this is first level then you have infra level because we have farm in different locations they are connected over internet and internet sometimes fails some even data center fails some switch somewhere fails so sometimes you have issues with transferring route fast transferring test jobs happen sometimes sometimes storage fails it self GitLab run or fails it happens so you have to retry that's still handled by retrying the job then you have like the device itself where you like getting data over the serial port so originally when you boot the device you getting data over serial port and these devices which converts serial output from the devices to USB and to some some machine which takes care of it they sometimes fail or misbehave which is very unpleasant and it happens only time to time so for example recently my colleague implemented like SSH so we just use a serial port for beginning and when we can we switch to SSH to have like to be sure that we get all the inputs and the outputs of the machine correctly and we can parse it understand it and of course then you have like GPU level flags as I said you run a lot of stuff in parallel sometimes driver in some rare occasion have not handled the corner case and then you know one test fail from ten thousands and you have to rerun so we are able to mark these test usually inside the testing suits we just mark them as a flag and if there is flag it gets reported but it's not going to fail so it's nice what is very useful is that part where we like monitor everything so we have like every day we have reports what test was flagging how often it was flagging and depending on this we can update the expectations we can report to developers we can like you know handle it somehow and when this got in a place it was the most useful thing like for the increasing reliability I think so conclusion of my part is if you have like CI as are for like developing GPU drivers you need at least one CI developer community can help you a lot because like a lot of people which are not developers or CI developers just came to us and like send patch like ok you can approve this dependency fix something you can improve this script amazing but you still have to have some people which working on it full time there is another thing because for example on our CI collaborating a few companies so if you merging some bigger change across all the devices all the farms it takes much longer and you have to be aware that with longer time you wait to get merged there is chance someone else push some merge request and break your changes so it's like the CI is still changing very fast even if it's really huge and already very covering almost everything and what is most important that reasonable reliability of the CI can be reached we still getting some fails some issues but on the scale what we testing it's still perfect so like the developers are feeling plus minus happy they are getting their code inside so everything is perfect anyway thank you for your attention and I will pass to Erico so hi everyone I'm Erico I work for Red Hat I'm gonna talk about my project which is to run my own farm actually run it at my home this work is exactly not necessarily related to my work it's something that I classify more as a community work so a few years ago maybe like four years four or five years ago or something I started participating in this limo project which maybe I'm not sure if anyone heard about it but it's a community driver for the first generation of ARM GPUs it's a GPU that's a little bit older at this point but it's still used by a number of embedded devices for example maybe you went across the hall and you came across this device it's a pretty popular device these days it's called a Pine Phone it happens to be running with this GPU driver that we developed with the community there was a fairly active development in this both in the Mesa space and it was in the kernel space for the past years I mean we got a super took scar to run so I guess that's basically job done so now we are in a situation where the driver is more or less stable we have a very good coverage in the OpenGLES compliance test for this device so development more or less slow down a little bit we get games to run so I guess we could say job done but there's still one thing we can actually do as a contribution for this which is to care about the regressions because people are actually using this if the developers are no longer pulling it every day and running the test every day something that would be really really great to have is to have coverage in CI so whenever people are pushing code to the shared infrastructure of Mesa or actually new code to implement new features fixa bug in the driver we have coverage for it but who is going to maintain it like there's no company backing this up nobody's getting any payment to do this work so in one of the conferences I attended we were actually giving some of those boards to the speakers and there were a few left and I got offered to bring them home and maybe set up some CI farm somewhere there actually this device I have a stack of those in my home and they are the farm that runs the jobs can you put it in the stand nowadays we are the state where this is getting tested as part of the big matrix pipeline that Davi was talking about if you can see here there is the Lima Mali 450 jobs they are running on this one at my home we can see here also some of the parallelism things that Davi was talking for example this test the piglet test they are actually they take a long time and you have this rule that we should try and keep something like under 10 minutes so we split and they actually will turn on two boards and run half of the tests in each of the boards so that we can reduce the run time and so what did I have to set up to get this working I decided to use a lava farm because I didn't want to implement yet another set of power on and connects to the serial port and like read from the serial port and type in some commands this kind of thing for example there's also these boards don't have a lot of storage so they use NFS route and they need to download the kernel from TFTP and all these kind of things I get basically for free by setting up lava and also because it's a GitLab project I'm running a GitLab runner as a separate runner as well to set up the hardware site I needed the actual boards which I happened to have because I got them from the community I guess I need to run a lava host somewhere and a GitLab runner somewhere a separate server which is running this as a couple of virtual machines which are next to those boards you need to have some solution to power the boards on and off they are not actually on all the time every time there's a new job from ESA they need to be powered on and they need to pull the kernel and everything so my current solution for the power instead of going for some super expensive power control device I have some of those wifi control device like HTTP request to it and it turns on or turns off and that's like a online script I can run and then I can just put this script into lava and lava will take care of that part also need serial connection for this I'm using those USB cables as well I have a picture in the next and then there's the whole network infrastructure which I'm going to talk about later this is my view which I have accessing lava directly it's not something that is visible outside because I don't even have to expose lava to the internet but those are what I can see the jobs that are being run by different merge request that people are submitting or sometimes people testing their own branches in the last month it run 1682 jobs not counting the boards that lava is running just to check that the board has not disconnected or anything so some of the challenges I had to do this setting up the initial secure network basically I'm putting these boards on the internet and by definition they are downloading code from the internet and running inside my home and that's something that well is not very it could be a little bit insecure I guess to run code from the internet so I set up the whole isolation for this network so basically these devices they are blocked by firewall and by switch level as well so that they cannot see any of the other computers that are connected to the same network some of the other challenges were related to infrastructure reliability so all those things that I have listed there they have failed at some point and nowadays since it's run as part of the actual CI if my network is down and someone wants to merge something that is in the common part of MESA it will not be merged because the tests are going to fail so it took some time to get to a reliable state I actually set up the boards way before I actually put them online as part of the pipeline each one of those things I don't have time to go over all of them now but they eventually had some flakes and I had to replace something or change something people ping me on IRC like hey your lab is not responding if I don't fix it in like 10 minutes they're just going to disable the farm which is completely fine results that I got from this actually I got a lot of new developer engagement so like sometimes someone is enabling some new feature in the common part of MESA and they actually don't really care much about the driver we're developing but they're like hey you know what actually I'm developing this feature and I just add this one line to your driver and they actually enable this feature for you as well and CI is happy about it so we actually get a lot of new patches by doing the work of having the test actually in the upstream so that's something super cool that we have now we started with just super simple OpenGL ES2 test but they actually improved more to the EGL test to the Piglet test as well and then well first we start just with the OpenGL ES ones so many regressions were prevented I figured out a couple of kernel bugs that nobody noticed before but because CI was running the test every day actually some of the kernel bugs we found out because we tried to bump the MESA CI version that failed and the nice thing is that it's very easy for anyone to disable the lab, I actually checked the lab has been on for two years people disabled it six times over the last two years but it's happened that my network is down or something and someone can quickly flip a switch and I'm not blocking anyone and I think that's a good thing hopefully I can inspire someone that if you participate in some project that has some super specific hardware that you need it is actually possible to be that even in a project as big as MESA with so many contributors from many big companies I was actually talking to someone this week and I said that lava was actually something relatively pleasant to work with and I basically followed what's in the lava documentation I set up my own lava lab and that has not been a hassle since then so I'm kind of happy to work with lava and it takes care of all the boring things I didn't want to care about MESA has some documentation as well coming from the other lava other lava lab which I think is basically collaborate at this point I did have to learn some new stuff especially about the network side of things for example to provide some isolation that I was happy enough with to post this on the internet but like I mentioned a few times already the fact that we have a good way to disable me in case I'm causing some trouble it's perfectly fair and I think is a must thing to have if you actually go forward to do something like this and network isolation is really a must don't want people like we had incidents and I can talk to you maybe if someone is interested about it but fortunately nobody like these boards are not even able to to see what's in the rest of my network and then I'm going to pass the word back to David thanks so I had to put slide what's coming next there is a lot of things coming next but what we are working right now is like for example for Mesa when we want to add the new test or update some dependencies we have to rebuild like one hour long pipeline on at least three architectures which started like the developers was saying ok I need this dependency a little bit newer what is inside the Debian and like you let's compile this and let's compile that and you know after some time you have like one hour pipeline when you compiling cc cache software and like it takes long time so right now trying to split it so like if developers need to change some dependencies or test not different Mesa but different for example libraries we use so they can have it like in few minutes not like in one hour and we also trying to increase testing coverage so for example we have tracing we have traces which are like imagine it like replay of games or applications which we just take like one or two frames from the game and we just feed it to GPU so we don't replay the game itself we don't have to run it we just run the frames which goes into GPU so yeah Q&A and of course adding more devices because more devices is always better let's switch to Q&A and that was quick yeah I'm trying wrong button too much a little bit back so questions do you have any questions for us here is one question I will answer alright repeat the question what happens when the GPU starts crashing to kernel so first we usually enable testing on our CI when like the kernel drivers are already in place and at least a bit stable but for some let's say some other devices which never had like great coverage of kernel support and like we still test them at some point it sometimes happen there are two types of crashes one crash is just you know the driver crash but the kernel continue working that's completely fine and we can like even continue if it can support restart of GPU at the point when the device crashes it's not problem because we have like set up time out and when the console when the SSH or serial connection doesn't for example print anything in 5 minutes or like some time then we just shut down the device and run the test and if it's crashing continuously the developer has to fix his bugs is that answer for your question yeah like the kernel crashes on GPU loads is not really a problem because every time we are downloaded the kernel and restarting the board so it's not like some test can run and then like the next test which runs can be affected by it also Mesa maintains it's own kernel run on the boards and every time we are going to update this kernel we will re-run the whole pipeline multiple times to make sure that this kernel that we are putting and going to be downloaded for Mesa CI it's stable on all of the boards and also part of the CI work is maintaining this kernel as well to make sure that it's not going to cause any problems other questions yeah so the question was how we define the priorities for the march and for the get the right jobs in the right time so for example for lava we currently using approach that like the GitLab runner which serves lava has some like cache he can like you know load like let's say like if we have like 80 devices we have something like 24 jobs open and so everything which gets pushed gets pushed into these 24 jobs cache and each job has own priority for march we have like the highest priority for user tests we have like lower priority and for the nightly runs or like the runs inside the main branch when like it gets already merged we have like lowest priority and so the trick is when this job gets like ordered into GitLab runner the GitLab runner you know picks up the you know with the highest priority around them and others have to wait longer for our lava farm yes so thank you for your time and thank you for coming one last question should I close the door so nobody comes or if you don't mind that Starboss will be coming they can come in ok perfect so I would like you to come to a grand moment for presentation about who broke the build one ask brand build have a demo and he ask me to ask you to not do any streaming so the demo will fit into the time because there will be some containers can I good evening everyone good evening from where I come from good afternoon everyone my name is Ram Mohan today's session is who broke the build using cuttle to improve end to end testing and release faster ok welcome to this talk today's session is on who broke the build how many of here are developers here managers fortunately you know have you ever seen this from your manager I hope everyone as a developer would have noticed breaking a build then the developer would say how can I actually fix that is there any better way of fixing the broken builds so let's get started so quick intro about me I am Ram Mohan Rao I am a software engineer at JFrog India I am passionate about open source and I love table tennis ok a quick introduction of what JFrog does for people who don't know what so JFrog is founded in 2008 it's a public listed company it has around 1100 employees around 9 locations with 7000 customer base and 6 products which has get2k8s Kubernetes all hybrid OSS and multi cloud we have a universal DevOps stack and most of these developers are community champions as well ok as you know software runs on everywhere so you can consider on your mobile I think most of you use your phone or android you frequently do an upgrade so Google Apple uses our back end to push those software updates for you so JFrog mission here is to power all the software updates in the world so whenever you use any device you would like to have updates to that so that the latest versions are actually being used so to align to that I think most of you have come from these backgrounds where you have used a waterfall model before 2000 and then in 2000 nearly 2010 people still use agile don't mistake by putting a slide over there saying that it is still 2010 but post that most companies started to do releases daily that's more important so let's get to the agenda today's agenda on the talk so as most of you are here are developers the major concern that people tend to have is how can I actually push my changes to production much faster so I would just give an overview of how a development environment would look like and then cover end to end testing as most developers feel that they run unit test locally but not end to end test so my talk is more focused on how can developers leverage a tool and then use and run end to end test locally and we have a tool called CUTTIL we discovered it and we were fortunate to get direct from direct contact with the maintainer and we were able to understand and integrate into our CI and a quick demo that would be very interesting and a quick summary of what we would be learning ok so an ideal development environment would look like say I might be a developer joining a new team and I might be given a windows laptop or a Linux desktop or a Mac I don't really care when I want to set up my dev environment I should be free enough to set up my environment in such a fashion that it runs on day one so the onboarding process of development experience is something that we need to focus on like when I joined as a developer in most of the previous companies what we used to do is we used to have a Confluence Wiki page there would be instructions to set up a dev environment you would copy paste all the instructions and it would take a day or two to set up the environment then most of them are manual I would say and it would run into some issues so let me quickly get into how a development environment should look like it should be a single click automation and it should be able to develop and test locally as a developer and it should be as same as production environment so as I said I used to generally take a couple of days to set up a dev environment previously with automation you can quickly set it up in 15 minutes or 20 minutes so what does automation save here it saves time and the resources as well when you have no manual steps it would implies error free and as a developer I would like to quickly reload my changes locally when I mean that I don't want to deploy it on an external server to test my changes so I should be able to deploy my changes very locally and the equivalent the dev environment should be very much equivalent to a production environment ok let us understand the problem that we have with a current feature branch development I think most of you use Git as a source repository so when you develop a feature as a developer who would first analyze the requirements then create a feature branch then does this coding around that and then push it writes its unit test it and then pushes the code to a Git repository as a feature request ok but when he does some test he only does the local unit test not the end to end test and end to end test are such a pain that you can't run them locally and say as a developer when I raise a pull request or a feature request on a merge request the CI would just trigger the end to end test on that branch on a remote CI CD server generally and it would take at least couple of hours to run the end to end test say if something breaks there then developer is actually losing couple of hours to understand what has actually broken ok so we thought I mean we were thinking how can we actually solve this problem ok so let me iterate the steps that I'm trying to help you here say for example if end to end test fail on a remote CI CD server what a developer would do is he would test it and commit it and then push the changes to a Git repository and then commit the changes back again and see if it works but there is no environment to test locally so the pain point is developer is unable to test everything locally so this round trip continues and how can we avoid this so let me quickly quickly take you to the remote end to end test see when a developer writes its unit test if something breaks he would fix them and and commit those changes again and raise a pull request mostly the CI CD server runs those test and the time to come back to the developer is huge only when the CI test pass you should be able to review or merge the code so is there any solution that we could think of then we started evaluating just a heads up saying that most of our development happens on Kubernetes cloud native so we were evaluating couple of tools and see if we can help us to reproduce the same thing so what we thought was instead of running end to end test remotely why can't we leverage a tool or something that can enable us to run the end to end test very similar to what we can run locally so how would this help a developer so instead of actually pushing it and there might be some environment issues on a remote CI CD server that may not work sometimes so to leverage that have those end to end test locally then if test fail locally you would be able to run those test commit it, amend it and then run it and fix it so what I am proposing here is instead of running end to end test remotely run those test locally how can we achieve this so as I said we discovered a tool called CUTTL which is a cloud native foundations tool it's under CUDA builder and a tool CUDA and there is a specific Slack channel in Kubernetes called CUDA you can refer to that you can probably refer to the references at the at the end so what is CUTTL? CUTTL is a Kubernetes test tool it is basically used for writing test mainly designed for testing operators or custom resource definitions controllers or simply put it as declaratively test any Kubernetes objects and it's yamal based so you don't need to learn any new language over here if you are comfortable with Ansible or anything it should be very easy enough and it accelerates the ability to set up an environment very easily so to get started how can you actually install CUTTL on your desktop if you are using a Mac you can use a brew tab CUDA builder tab and then brew install CUTTL CLI if you are using a Linux you can use a kubectl crew install CUTTL and CUTTL also provides API integration you are a go developer and would like to integrate it directly into your go code as an end to end test framework so you can use go get github CUDA builder CUTTL as well so CUTTL is, I mean, if I am being a developer what is it used for so if you are an application admin who wants to automate the creation of Kubernetes clusters you can still use Ansible for that but this is more for testing and testing some Kubernetes operators specific code and if you want to test Kubernetes applications on different versions and if you are a developer who would like to easily test operators without writing any specific go code so CUTTL let me explain how CUTTL framework would look like it has three main objects one is the test suite you can see here it's a CRD custom resource definition where the API version is CUTTL dev v1, beta 1 and the kind is a test suite so this is a test suite and you can see kind start kind as false so by default CUTTL uses kind as a local cluster to run all those tests so if you have a Docker desktop use kind to actually start integrating start your test suite without having any Kubernetes cluster externally and I would actually demo the entire test suite and actually how it works so you can see it also has specific commands where you can actually add commands as a preset up thing then next is a test step so you have a suit declaration then you have a test step say I would like to install some specific app and to see whether that test has successfully thing so CUTTL has a test step and an assertion as well so a step would just install or specifically do an API testing around it and assert would do an assertion on the expected values that it needs say for example most probably you would be knowing Artifactory so as part of my test I would be installing Artifactory and then asserting whether Artifactory comes up with a single replica ok so this is the test suite structure that I have for my demo so you can see I have an end to end test that you can add into your project structure ok so CUTTL by default provides parallel test as well so you can run 8 parallel tests install is one step install is one test scale is another test so you can parallel add 8 tests that can concurrently run 8 and there is a CUTTL test that I have already shown you so let me quickly run through the demo and how CUTTL works that would be interesting can you see my screen ok so let me show the same structure that I have shown you and I would like to quickly run a test CUTTL test run ok and what it would do is let me actually show you the code as well so you have the structure that I have shown you so I have set the kind as true and I have my docker desktop running ok and this would mean it would use a kind cluster and run those tests so you could see here it has actually created a kind cluster so let me quickly copy this configuration I would like to export this cube config file and I use a tool called canines which is easy to view the Kubernetes clusters ok so it's an open source tool so what I have done is I have actually run two tests as part of my end to end test as part of the demo and each test would run parallely and on a different namespace in a Kubernetes so you can see there are two namespaces that are getting created one is CUTTL test extract goldfish CUTTL test short cowbird so both of them tries to install artifactery and then do an assertion on top of it so the first is the basic install and will do an assertion on the replica account next is the scale thing where you would install artifactery with multiple replicas so the replica account is two over here and then do an assertion saying that multiple replicas came up so this is a simple example of how you can actually test simple Kubernetes object or an application but this can be integrated with API test or anything to do with so this is anything I mean it is language agnostic I would say so even if you are a Go developer or a Java or any Python developer you can still integrate end to end test using CUTTL in your project so let us quickly walk through the demo how it is running so what is tries to do is first it runs the test suite it you can see it ran a Helm repo add and then it is trying to run two test it is trying to run two test one is install, another is the scale at parallely so it is creating a namespace and then installing artifactery you can also see on this canines tool how this installation is going on the reason I asked you to stop the streaming of the wifi is the artifactery image is around 700 MB 700 MB and which would take probably 5 to 6 minutes to download and install so bear me for few minutes so it is an in the meantime I would actually show you the documentation of CUTTL so as I said this is a cloud native sponsored even free project and it is built by CUDOS team and it has a Slack channel called CUDO and you can see the release cycle and it is very active this is not maintained by JFrog by the way this is something that we have been using and we are currently working with the contributor Ken Siphan he is actually from US so if we have any issues around CUTTL feel free to reach out to him on Kubernetes Slack so and there is a documentation it is very specific to CUTTL how test so, test step and test assertion would work and there are very specific setting how can you actually integrate any Kubernetes cluster not only running it on locally using kind you can use that so let's go back to the see, so the artifact started downloading and it has ran successfully as well so once that test is completed it would auto the end to end test would actually delete the namespace as well so let me quickly go back to the configuration part of the test suite where you can see the time out that we have mentioned is 900 seconds so it nearly took us around sorry, it's already done so it took us around 6 minutes to complete the end to end test locally so the main idea around showing a demo of CUTTL is this is simple install that we were able to parallely install two test without any complication without any complex logic simple, easy to use this can be integrated into any language specific code either Go, Python or Java we started using this in our internal projects and we were able to see the leverage that it has so it completely deletes the namespaces as well let me quickly go back to the presentation so the CUTTL references are you can see cutldev.docs and there is another GitHub reference that I've shown you and there is this very specific Slack channel Kubernetes Slack channel called Cudo where you can post any queries and there is a tool that I've shown you called canines that you can install using brew install canines on a mac which would actually show the graphical view of Kubernetes clusters and access them and manage them as well Summary so CUTTL is an open source tool you can contribute as well it is used for local end to end testing with that we were able to minimize the builds that we broke as part of our code when you test end to end test locally most of the code that you have done as part of the featured branch development would work when we have few broken builds we can probably release much faster than you could do before and that would mean happy developers any questions happy to take them could you please repeat yes no these run on locally using Docker desktop as a kind Kubernetes cluster ok his question is we have the test suite that we have configured where we have provided Helm add is it running on a specific container or is it running on your local machine it's running on your local machine any further questions right so the main idea around is yeah so his question is how can developers able to run local setups how can we actually set up dev environments which are very equivalent to production right we have done in our previous company where we had a setup of long setup we were using RPMs to set up a our production was using Ansible to deploy RPMs on a production VMs so we were able to replicate using Docker containers so we were using CentOS Docker base image and we were able to just compare it so that for example I have some shell script that we have written on our own that would actually set up docker containers and install RPMs on top of it so it's replicating the same thing as so cutl does that very affectively on Kubernetes side on cloud native side so if you want to set up a Kubernetes in a go you need to write go code set up set a context and then create a namespace and then do that I would I would recommend I don't think most companies open source most of the dev setups that they have locally but few companies does that saying that we have done that and you can probably try it out but most of them use containers to do any further questions ok thanks everyone for joining this talk we have 34 attendees they got least registered which include spy and all should do my best so hello everyone let me welcome you here, thank you for coming to this session and let me welcome here our speakers Akanksha and Pima please in case of Q&A save them for you please in case of Q&A save them for the end of the presentation you can ask either in person or you can use our matrix for the Q&A our speakers will answer them at the end of the presentation thank you thank you hello can you hear me ok ok I'll try to be loud so hey everyone thank you for coming in so the topic that we are trying to address here is optimizing the long test run times using AI ops so often times as contributors to open source projects most of us have made pull requests to GitHub repositories and they tend to get into CICD tests which often times take up so many resources that we cannot possibly run all tests or just have backlog so to answer all those questions how do we do that and where do we come from why we are doing this we are here to talk more about the optimal stopping point tool so I would like to introduce my colleague Heema hello everyone I'm Heema Viradi and I'm working as a senior data scientist at Red Hat both Akanksha and I are part of the technology's data science team which is part of the CTO office at Red Hat and I'm based out of California in the United States and my name is Akanksha Dughal I'm also a senior data scientist in the emerging technologies group and I come from Boston so to give you an overview of what we are going to talk about today so we start with talking about AI for CI which is the main project and the optimal stopping point tool is a subset of this project and I'm going to also go over what is the motivation behind this project and what's the solution that we've come up with what all data sources we are using along the process going over the workflow presenting insights and a final demo where we'll walk you through all the steps that we've taken along so to start introducing this AI ops toolkit called AI for CI I would like to mention there's a problem that we're trying to address here so that there's a need of automated and AI driven monitoring when it comes to the testing data there's tons and tons of open source testing data that is being generated but not collected or put to use so we see that there's a lack of AI driven metrics in the open source community help so that brings us to the opportunity of leveraging these data sets that are being made available by lots of open source communities and there's tons of open operations data for example test grid prow and github have so much data that we could put into use and find out some interesting metrics that could in return benefit the community health lost my cursor guys sorry so the solution that we've come up with is coming up with an open source AI toolkit that helps us in collecting and analyzing all the CICD data that has been made available then as a part of this project we've also made machine learning models that help resolve various use cases so one of the use cases that we try to solve here is called time to merge so every time you make a pull request often times it takes super long to be reviewed or merged to the main code base so this particular model is going to make a prediction which will just put a comment on your PR every time you make it it's gonna make a prediction of how long will it take for your pull request to be merged similar to this the main topic of this presentation is the optimal stopping point classifier which will help you understand how long should it ideally take for your test to finish running and what is an optimal stopping point after which it should be technically safe to just terminate that test or maybe restart it or inspect why it was acting like that and just save up on some resources for it be cloud sources or engineers these are also your resources that you must want to save for the right set of things and another ML service that we have here is called the build log classifier so we all know that there are tons and tons of logs that are being generated in proud and it's often times difficult to just go through one so many logs so what we would like to do here is just classify these proud logs on the basis of the type of failure they belong to and also as a part of this major project we have KPI and metric dashboards these are interactive dashboards that you could look at and see all the metrics that you have so far and all in all with this project the aim is to foster an open AI ops community that helps leverage all these amazing data sources that we have so as a part of this project we so far have collected tons of data from various data sources like test grid, proud, GitHub and we've collected metrics and created KPI dashboards we have machine learning services that support CICD processes and are also integrated with GitHub projects and it's super easy to use and finally it's also a resource for everybody in the community to utilize there are templates notebooks, scripts that anybody can just put to use directly without it's just for free basically so to come back on the topic that we're trying to address I would like to mention why we are doing this so this is a graph of various tests that were collected from a GitHub repository called code QL so this is the distribution of the run duration of all these tests so we see that most of the tests that were run they finish running within the first minute of running so that makes us believe that technically it shouldn't be taking so long for these tests to finish running however when we started to look at the test failing test specifically we saw that there are couple outliers that take up all the resources or take sometimes infinite amount of time to finish running so wouldn't it be a good idea to just come up with a point after which we know it's probably just holding resources or block somewhere it could be an outage or any reason for that matter so the aim here is to find out the bottlenecks and just point out where it could be going wrong so the ML solution that we've come up with is that sometimes these tests are taking long and we would like to find an optimal stopping point after which this test is most certain to fail and hence in return we want to just save and allocate our resources in the best fashion talking about the data sources so we all know Github is home to a lot of open source projects and a lot of people make contributions on a daily basis to these repositories so we are collecting our data from Github we've also looked at Prow which is a Kubernetes base CICD system it has a lot of data that is being collected a lot of checks that run on PRs are often times reported back to Prow which is also a good place to scrape data from then Test Grid which is also another platform that helps people visualize their CI processes a lot of communities even besides Red Hat are keeping their data on Test Grid to visualize but still we think that we can come up with the better tools scraping all this data from the back end and coming up with more insightful metrics and KPIs so now I think I'm going to give it over to Heima she'll elaborate on the solution approach that we've taken here thanks Akanksha so now that we have a brief understanding of what the problem is at hand and the approach that we tried to take in order to come and come up with this optimal stopping point model so first off as we saw the different data sources that Akanksha went over the first step is of course to start collecting these kind of data from our CICD tests so the main data source that we are looking at right now is GitHub and in GitHub you would have noticed that whenever you make a certain PR to your repos it usually shows a bunch of checks that are happening at the back end part of the review process and for example you have precommit checks, you have your file linting checks and apart from this there can be many other checks that happen and for any PR to get merged these are like some kind of prerequisits so a test has to be successful in order for the PR to get merged eventually and all of these are also sort of defined under workflows which are part of GitHub Actions so that's how we're getting these kind of data sets from various repositories so as an initial experiment we look at a particular repository of interest we see if GitHub Actions have been enabled for that repo and then we go and look at the different workflows that have been set up for it and we go ahead and just use an API which allows you to extract that particular workflow ID and all the checks within it can also be obtained through that API and that's how we get all those test durations and those kind of features that we start seeing from that data so once we have all of this data ready we start moving on to feature engineering so in any ML model approach that we look at you need to start identifying what are those important relevant features that can be used as an input for your model so here we're going to look mainly at those test durations where Akanksha showed us the plot of those different durations and what we try to do here is for a given test we kind of see the entire distribution of how long that test is taking to run and we further try to bucket them into different time range intervals so for example 0 to 10 seconds 10 to 20 and all the way up till the completion of the test is where we try to further split and divide the test durations and within each of those intervals we try to find how many tests are likely to fail in those time ranges and we try to basically calculate those percentage of failures over time and once we do that kind of approach of calculating those durations and splitting them into those different buckets what we do is finally try to reach to that optimal stopping point prediction so to do this as I was talking about that percentage of failures that we like to calculate in those time intervals what we do is we've defined a threshold where here it is 75% so we say that in those time ranges if we see that the threshold percentage of failure is more than 75% we say that anything beyond that is likely that the test is holding up resources and it's likely to fail now this threshold is just a default value that we came up with but again it's customizable we can tweak it as per the needs depending upon how the workflows and checks have been set up but this is just some initial kind of approach that we sort of started implementing to further define what that point in time would look like and once we have that final sort of interval or that time slice at which it's going to have a greater chance of failing we eventually want to integrate this back to the github actions so we would like to use the help of github bots here so what the bot will eventually do is whenever you have a PR open the bot is going to sort of run the model at the back end and then the bot is going to leave a comment on that PR saying hey this test is likely going to fail at this particular timestamp you should probably stop it so that the rest of the checks do not get affected because of it and it's still a work in progress but that's ultimately the plugin that we would like to enable for these different repositories where you have a lot of checks to be passed and there's like you know it's a big PR maybe and you have a lot of developers who are being put on it to review it but they kind of don't have an idea as to why that certain test is failing and you probably want to move on to the next phase of your review so eventually that's what the bot leaving a comment is trying to overcome and help out with identifying that pain point for your developers and so how does this entire kind of workflow set up look like eventually is you have your PRs getting open to a repo we go ahead and have that optimal stopping point running at the back end so in this graph what we see here is you'll have all those test duration buckets on the x axis and on the y axis is the percentage of failing so as you see as the first range is basically taking 0 to 10 seconds and then you'll have all the other ranges beyond that until the test gets completed and then you see there you know the percentage of failure so over time slowly the percentage of failure tends to keep spiking and rising so anything beyond that 70% we say that you know hey it's gonna eventually fail or if we set the threshold to like 60 then it's probably going to fail beyond that 50 60% time range and then if we correlate that back on to the x axis you'll actually know that time interval at which it was going to fail so maybe it was around like one minute ten seconds or something like that so that's kind of how the output is gonna look like and then finally this is what the github action bot is going to eventually integrate as part of our service it'll leave a comment saying that you know it predicted that for this particular check that is happening beyond 50 seconds is when you should terminate it else it's gonna hog up your resources but if you won't stop at this it's the idea is that we can also allow users to take some actions based on that so we would like to also have some kind of capability of you know asking the user like should I go ahead and stop this test or giving them some set of actions that they can perhaps take and help part of their CICD process but as a very low hanging fruit we basically want to leave some kind of comment to provide more feedback for the repositories and for the PRs that are being open in different repositories and now we can move quickly to a small demo so in this demo we're just gonna go over kind of like the code and a little bit of the workflow that we follow so I'll give it to Akansha to start with the initial part of it awesome alright so just for this demo purposes we've like started to look into this repository called CodeQL and similar to any of your repositories this also has couple pull requests if you take a look at any of the open pull requests that we have here we see that there are some checks, some haven't completed yet some were completed couple seconds ago so things like these often times take a lot of time because there are so many checks and workflows that are being run if you go to actions you can see more details about each of these checks and each of the runs that were made, how many were failed etc. so what we aim to do here is to get all of this data in jupiter notebooks which is the home ground for most of the data scientist to start exploring the data and find insights out of them so if we start looking at this notebook the main agenda here is to just scrape and get all the information that we get from the actions tab from github so after we get that information we just try to put it in a decent format that we can finally use to perform some evaluations and once we have collected all of this we try to just put them together into different formats and like classify them by passing and failing tests because it's easier to make a prediction once we know how both data sets look like and once we have the passing and the failing frames ready we try to split them up into train and test data and further move towards the optimal stopping point prediction so he was going to go over the approach that we followed to find out the optimal stopping point so once we collected the data in the previous notebook so just to mention jupiter notebooks are if you're not familiar with they are basically an interactive way of writing your python code why we call it as a notebook is because it's like you have everything sort of broken down in a cellular format like this so this will be like a first cell that you run and then usually the outputs also get printed one after the other depending upon how you write your code so it's a preferred tool for most data scientists so if you're not familiar about it you can go read more about the project jupiter you'll get to know how this tool has been used so that's kind of how this code looks like in that notebook format but yeah moving on to start our analysis we go ahead and take those CSV files that we saw in the previous code and once we get those two different sets of data so you have one for all your passing tests and then you have one for all your failing tests so we kind of read them separately and we kind of looked at it from two different approaches the first approach was an experimental approach we don't use this actively right now but it was something that we researched a bit and did some analysis on top of it so I won't go too much into detail but the idea in this approach is we try to look at the distribution the statistical distribution of how the run times look like so you again have your run duration on the x axis we try to see how many tests are within those different buckets that we see here so we have about 20 in the zero to less than three seconds time range and so on so that's kind of where we try to figure out the distribution pattern of it and we do the same thing for both passing and failing tests so statistical distributions there are some libraries in python which will automatically define the best distribution based on the values that you have so here it's trying to fit like these different type of distributions for the data set that we have and then it tells you which one is your best type of distribution and based on those distributions further we try to find out the intersection point for your passing distribution and failing distribution and that intersection point is essentially what we map to as our optimal stopping point on the x axis so that was approach one but we wanted to move on to a different approach which is more favorable for us which is based on the probabilities of tests failing rather than just looking at the distribution but more on from a probability standpoint because ultimately that's what you want to predict so as we were talking about those buckets again here you see some tests are even going all the way up to like you know infinite time stamp which is not something that you want you want to get rid of those kind of long running tests so after you understand that distribution we go ahead and try to plot the percentage of failure so that's kind of what this ultimate approach that we want to focus on so we set a threshold of 70 after observing a lot of tests we came up with that threshold this is just one test that we're showing this for but over time you're going to have multiple repositories, you're going to have multiple checks multiple tests so each data set is going to look different but this is in this particular code at least it's just one test for a one particular repository so that's just something to keep in mind that threshold may not make sense in some situations but at least it's some kind of a starting point for us to look at so that's where we can visualize it better these are just a more from a more data scientist perspective we're trying to normalize the values rather than look at the raw values we're trying to scale the values and things like that so these are just some couple of more ways where you can normalize your values and try to do the analysis on top of it but you see that the graphs kind of look different just because you've normalized your values a little bit you see some more sort of intervals here and things like that but nothing too drastic of a change and ultimately we kind of use this threshold values on this and then we try to intersect it on this x axis and then we find out what time duration or what time stamp does that correspond to and that's the point beyond which you should not be letting your test run for so long so that's the overall kind of a goal of this particular code and coming back to our slides if you want to engage more if you want to learn more about this you can scan the QR code here it'll take you to our GitHub repo so we track all of our work there if you have suggestions feedback open for any contributions or even just opening an issue if you want to learn more about it so please go ahead and do that and we also have another project called AI for CI which akanksha mentioned at the beginning of the talk so that's a more larger open source initiative which is a collection of all these models that we're building so one of them which we presented today another model is the time to merge model that some of our colleagues worked on so in that model we are basically predicting what is the amount of time that it takes for a certain PR to get merged in a project and we don't want to give this to scare people that it's going to take so long for a PR to merge but the motivation for doing this is if you have new contributors for any open source project they might be hesitant to participate and contribute code because they don't really know if that PR is going to be reviewed or not so for any of those first time contributors the goal is if you're able to predict and tell them hey the PR is going to get merged in a couple of days it gives them maybe more confidence to contribute to your project and the second advantage is it can also help community managers to look at their project and say ok it's taking a lot of time to review PRs do we need to change something how do we make it more efficient, things like that so that's kind of how we came up with these different ways to consolidate these models so if you want to learn more about that I would encourage you to look at the AI for CI repo and you'll find all of these resources there like the notebooks the dashboards we've built the data sources and so on so I don't do that and these are again some more references to all of our repositories and with that I would like to stop here but thank you all for attending and if you have any questions we have a couple of minutes to take them yes yes yes yeah yeah that's a good question repeating the question again the question was how well would this work for a smaller github repository so if you have a smaller repo a less mature project I'm assuming where you have smaller unit tests how accurate this would be, is that your question yeah how much data do you need so I would say for any machine learning model the generic answer is more data is better so I don't want to default to that answer but I would definitely say that it might not be fairly accurate but at least over time if we keep trying and seeing that it's predicting at somewhat near accuracy then it would be a good start but of course if that particular check or that particular test is also found in other repos we can also use that as our training data set so it doesn't have to be coming just for your particular repo but if it's a test that is very specific to your repo then maybe the training data sets cannot be as large as expected but I would say it's definitely generic enough that we can try to find you know data sets from other repositories which might have similar checks and maybe use that to better train the model so that would be one way to approach it yeah yeah I think just to add to that I think the more we retrain and the more feedback we get from the repositories and the more number of tests that we run that's the feedback that we look for in any sort of machine learning model so that we know how well the model is predicting and we can probably just retrain it on the new data that is being made available by new PRs and tests being run so I think that would be useful does that answer your question yes yes so repeating the question it was so if you are taking tests from different repos and using that to train your model how do you kind of map and figure out which if the data is accurate right so yeah so some of these checks like for example in that screenshot you saw there was a check for file inting or there was a check for precomit check so those are more generic so those kind of are standardized to some point to some extent so even if you're collecting from different repos we can have some confidence that these data is going to look similar but yes some of the other checks are more customized maybe for a particular repository so it might not work well for the other so in those situations we might have to eliminate those kind of data just so that the model is not training on the wrong thing so it's kind of like an experiment of you know checking which data sets is it able to learn better versus which one it's not so that's something we might have to look at further yeah yes yeah that's definitely a good question I'm going to repeat it please correct me if I understand it wrong so you mean to ask that now that at this point we're making predictions for long running tests but what about the shorter running tests that is it even accurate that it finished so early right is that the question short running tests and false negatives that's a good question but as of now we are just looking at the long running tests but that is one of the use cases that we would also like to address but it's a part of proud logs because if you don't get any information from run duration the best place to go look for more info is proud right like where do you go look for your logs particularly for these tests do you have logs that are being generated for these tests like is there a place that you go to ok ok yeah but for most of the tests especially for open shift repositories they are linked back to proud and often times when they don't get an answer the engineers are working on these projects they go back to proud look at the logs and try to understand why particular thing was happening in certain ways and that's where like we also have like a very initial project that we did on proud log classifier which would help you understand why certain things are happening because it mainly just classifies all these logs into categories as to what category they belong to in terms of failures and passing and what could be the possible reason if they are behaving a certain way so that's another use case but this use case is mainly focused on the long running test go for it yeah thank you for that plug in oindrila also to add on to that at this point it's a work in progress but what we aim to do here with this initiative is that anybody who wants to use this tool can have a customize file where you can just put in what you want from this tool to do so starting from what is the threshold that you would like to specify it could be anywhere between 75 to 95 whichever thing you want to specify and also the next step here would be to have a tool that will automatically terminate all these tests but this is only something that we can do if the user or the repository owner lets us do so this is something that we would want to add to this tool but definitely something that the owner has to take a call for so you can specify all these parameters and once you specify it the GitHub action will automatically run on all the incoming pull requests in the future so that's the plan but most of the things are still work in progress but we are happy to just present whatever we had so far looks like we are out of time so thank you all again for joining us please feel free to reach out for questions even later during the conference thank you thank you so today we'll be talking about Ansible and more specifically how to automate some tips and tricks that we have discussed over the course of the last few years around how to think about automation in a way that I think it's a little bit more sensible or sane so I've been roughly 20 years in IT at this point then I have been using Ansible for 10 years and we'll be talking a little bit about the project that it was actually my first project so it's a 10 years old project but still I think that we have done some things right and I had a lot of learning from it and now I work in reddit as a specialised ocean architect for Ansible so the project that started my experience with Ansible has been a project that was a little bit messy let's put it this way so I arrived in this company as a consultant and basically they started to describe me the complex situation they had and they did not really felt that it was a problem but it felt that every time they tried to do something it was a little bit too much complex for some reason so what was the situation first of all we're talking about a Java environment where they had roughly 300 Glassfish installations Glassfish is an application server where basically you deploy your application on top the way they did their business was that basically they had one application server per customer so they had roughly 300 customers which means for this specific application so they had roughly 300 servers and they were not really the same version of Glassfish because they started with 4.0 and then some customer had some issues with 4.0 so they started to route one to some customers I mean obviously to the customer that actually complained about the issue and then they found other issues so they started to roll out 4.1.1 but even there only to the customer that actually have issues kind of the same application in the sense that the application was the same or at least that was what they told me but not the same version so you know and they had 250 of those machines of those applications running on EC2 roughly 10 per machine and then 50 on bare metal in a complete different situation on Centos 5, the others were Centos 6 so you know different version of stuff and 300 instances of MySQL because you know why not you have one application server per customer why not also having a MySQL per customer now interestingly they were installing MySQL from RPM and they never updated any machine so this man that basically they had probably 60 different version of MySQL running and to cover it all they had 5 scripts per instance to actually manage the instance research and those were kind of the same but you know kind of until you start to replace them and you discover that actually they were not exactly the same so as you can see it's not a very linear situation but I am sure that with some slight differences but still same concept you can find in your organization exactly the same situation I have been a consultant for many years I have worked in Reddit as well for many years as a consultant as a presale person and I have seen a lot of different cases where this is roughly the same situation everywhere so we started to search for a solution for this problem and we were looking for an automation system that was the only thing that was clear at the beginning and very easily we started to add requirement the first one that had to be simple because you know why add more complexity to something that is already kind of complex second it has to co-exist with whatever is already running because we are talking about a business that is in production all those customers are running these applications in production and they are running their whole company based on this application because this is a mission critical application for the customers so everything needs to be smoothly move to automation without any interruption or issues for the end customers it does need to keep the same security model because we are talking about PCI DSS applications applications that manage credit card numbers therefore if you change the security model you have to re-audit everything which might be a little bit more complex than we would like to and it has to be kind of self-documenting in the sense that at least we don't we have to have just one thing not two code and documentation so it has to be together code and documentation so that effectively we can be sure that over time we don't have a drift between the documentation and the code and this was a requirement that I added for the customer in the sense that it was not a requirement for them but it became clear to me that that was a very important point it had to be edempotent now what do we mean by edempotency I think that edempotency is a great concept very absurd sometimes but very useful so it's a property of certain operations that allows you to ensure that only the first time the operation actually changes something from the second time onward it will not be changed with some masters so some examples of edempotency x equal 100 no matter of the version of the value of x at the beginning after the first run it will be 100 and from that moment onward it will always be 100 so effectively it will not change another one is x equal x at the zero it can be any number then it becomes one and then stays one so going into the IT sector kind of thing equal a string to a file replacing the whole file so only one major sign it's edempotent because no matter on the content of the file after the first execution it would be stable and it would be just test now so what is non edempotent an example would be x equal x multiplied by two because the first time is let's say one and then become two and four eight and so on so effectively it's not edempotent and another example is very similar to the previous one but we have two signs of major which means that effectively we are just to the file which means that every time we execute the file will have one more line than the previous one now there are some major cases here to be aware of about edempotency and the first one is YAM update YAM update is not strictly edempotent because it relies on the repositories which is the one that you are fetching your RPM from and if your repository actually changes the content so if there is a new version available of your RPM then you will have a change so I would not exactly define this as edempotent but obviously edempotency is in math is everything is easy is either right or wrong in the real world it's a little bit more mixed for the same reason YAM install something can be edempotent or not based on the state of the repository now all these can be fixed in a way or mitigated if you manage your own repositories so you don't use fredes repositories but you use something like satellite for instance that would allow you to actually check and be the one that creates the changes in the repositories so even then it's not edempotent but still it's something that you can control and manage wget same thing if you wget you can have different issues with edempotency the first one is that the source changes which might not be ideal and the second is that a corruption wget should fix it or fail but still it can be a little bit tricky to speak about edempotency in a situation where you wget from an unknown source if you wget from your own server then it's easier and also if you are managing your own server and you put in the file name also the version then it becomes way easier to ensure edempotency another example of something that can be thought as edempotent but actually it's not we only have one major sign so effectively in theory should be edempotent but we are dumping a variable so that variable value can change I have seen servers with terabytes of data being completely wiped out due to this issue and not with Ansible with bash but the thing was the same because removing variable slash rm slash dash rf variable slash asterisk if variable is empty so be very aware that variables can be very useful but also very dangerous so Ansible was the solution that we found because otherwise it would not be installed but still why, first of all it was agentless agentless means that we didn't had to install anything on the single machines but just creating a new service that would handle some complexity second, it connects to the machines bssh, it can also connect with other protocols to machines but in our case it was only so effectively ssh was obviously the choice we went for and does not care about the state of the rest of the machine which means that if you tell Ansible that it needs to handle a specific file for instance it will only care about that file all the rest of the machine who cares this really helped us to do that slow migration to automation and a big advantage is that it applies changes in a sequential way which means that if you look at a playbook first thing that you see is the first thing that Ansible will try to do and then the second, the third and so on this makes the whole thing very simple and it can seem obvious that something executes in a linear way but actually in a lot of automation systems this is not the case because they try to optimize by parallelize operations and this also creates a way more complex situation and it has a very gentle learning curve as we see a little bit of YAML code YAML is fairly straightforward I would argue that Ansible is not configuration as code is configuration as data because I would not really define the YAML as code and Ansible playbook can be easily read by even non-technical people so like auditors we are in a PCI DSS environment so we have auditors every three to six months in asking questions and be able to tell to an auditor this is exactly what we are doing can have auditors but at least to some and it's very simple to set up and another big advantage that it has and it was a deciding factor in the end is that it's a Swiss life tool effectively you can use it to provision new systems, you can use it for deployment for configuration for whatever you want so the initial setup is simple but how exactly simple you first create SSH keys if you don't already have them you should already have them but you can maybe create one specific for Ansible that could be a good idea and then you start to distribute those SSH keys create a git repository where you will put your Ansible code into yes you can do it without git you really don't want to do that you create the first inventory where you list all your machines and you are done you are ready to actually write playbooks now which playbook should you write now because we have the whole environment ready now what so you need to select a process that you want to automate your first process to automate how to select it first of all it should not be a critical operation if you are trying to do something critical as first thing it will probably not go too well try to with something very easy it should be a very well understood process hopefully by you or by whoever is writing the automation because if you understand the process it's very easy to automate it if you don't understand the process it's going to be more complex also because while if you do the operation yourself it's interactive so even if you do a mistake you know the console will tell you that's wrong parameter wrong whatever with automation it's a little bit more indirect so if you don't know exactly the process it can become much more longer the effort to automate it and it should be easy to test so whatever thing your process you select it should be a process where you have a clear understanding of the process and how to prove that the process actually was successful or not successful it should be or it could be at least a boring thing that you should be doing and say you know I don't really want to do this but I can automate it which can be fun and you should have a little bit of time a few times the time to automate the process is longer than the time of executing the process should not be like millions of time longer but still if it's a process that you usually run in 5 minutes probably you will take it will take you 30, 60, 90 minutes to actually automate it so if you are in a rush it's not pro a good time so let's say the first process that we did was actually the provisioning of those application servers because we had new customers in the pipeline that had the requirement of having new server deployed so the first part is understand what is the process this was not the process the process when I ask which is the process that you are following it was basically a 30 step 30 steps long process we worked first on the process changed the process made it sensible and at that point we automated the sensible version of the process it's much easier much simpler, much quicker so you install java from rpm in that case create the glassfish user because for reasons they wanted to have a different user for that download pajara which comes in a zip file and that's why we had installed unzip unachive pajara set pajara file ownership to be owned by glassfish and create the system the unit to start it this is the process that we use for bare metal servers for sure servers was slightly different but still same idea so how do you automate it with Ansible you create a playbook such as this one and as you can see with very little changes in wording you can find the previous steps after the name so ensure we have java installed ensure that we have glassfish user ensure we have unzip installed ensure pajara installer is present and so on and as you can see then we have a few lines in between that are the actual Ansible code and the two are very similar I mean obviously it has a module it has some parameters and so on but still if you read one or if you read one or the other it kind of feels like the same and if you have both available it's way easier to understand them both and this is the second part of the same process as you can see we had 7 points here 4 steps here 3 steps here 7 again so we started from a process that had 7 steps we end up with 7 steps of automation this is a good thing if you start with the process that has 3 steps and then you end up with more than 3 steps in the automation your process was wrongly mapped you have to map your processes very well to ensure that your automation will follow exactly as you predict some consideration around this first of all if allowed redesign your process before you automate or while you are automating it because it's going to be easier second, simpler is better so if there are steps that are no longer required try to remove them if they are still required and another thing that we worked on was the automation of users so the left cycle of a user in every company is slightly different but roughly is like this you create a user on certain machines you add the user to certain groups you add SSH keys for this user to actually begin and then time passes and at a certain point you will delete the user on all machine that it has access to because that person maybe does not work there anymore or for other reasons so same here as the previous one we can very simply automate the creation of the user we simply have the built in user module we pass the name the shell, the groups already here as they present ensure SSH keys are available we copy the SSH keys to the target machines those are the public SSH keys not the private ones and then we can call it in this weird way which is basically passing a bunch of extra variables to explicit which is the user which are the groups which UID we want and which are the target host where this user should be present and we can remove users in a very similar way now some consideration around this it's very easy with Ansible to create distributed batch scripts which is what exactly we have shown now but that is not edempotent so it's very good to know that you can do it it's very good to do it but be aware that that is not edempotent that is not something you should aim to but it could be a good step in stone will improve the consistency of your environment even if using non edempotent way but you are not leveraging the whole tool and then automation is a step by step process so don't try to automate like this try to automate like in the second way which is basically try to create something easy that solves your first issue and then over time you will improve your scripts your code to make it even better but don't try to big bang the whole environment at the same time because it will not really work so we wrote the second version of the user automation the first one was not edempotent so it was okay at the beginning but not really in the long term so we decided to create a file yaml file that describes the user that we want to have in our environment so we have the user file and the user admin for instance and we could have many more of them and then we changed the first script that we have seen to use with items and iterate basically on that file this allows us to have a effectively edempotent way of creating users and ensuring that all users are always available in the right machines now this also does not cover the delete part of users that was something that we continue to handle in a non edempotent way because it's a little bit more complex and there are some security concerns around removing users so that was not changed over time but still this was a good example of something that started as a script a distributed script written in Ansible that then became something edempotent so wrapping up Ansible can provide you with simple way to automate distributed processes if that is your stage of automation start with some low hanging fruits and then build on top of that if possible rethink processes don't try to automate whatever process someone tells you that that is the way we are doing things here try to understand what exactly they are trying to achieve and then create a good process to achieve that result aim for complete life cycle automation but work over time don't rush it and in a way try to work for a complete company automation so effectively to automate all processes within a company or organization group whatever but same thing as before don't rush it it's an overtime process automation has a huge impact on people even more than processes so it's critical to explain to people what will change how it will change what will be the impact on them and repeat this kind of things many many times because a lot of times people are scared by changed and when we are talking about automation that is even more scary for people because the first thing they think about is great I'm now going to be replaced by this thing and be fired which is not usually the case obviously there are some cases where that is the case but 99.9% of times in my experience automation such as sensible does not mean people being fired simply means that people will change the role will change a little bit how they work will change their processes but over time everything is fine, no one is going to be 5 due to this and there is a big point that is critical which is be intelligent be specific on what you are automating if you have a process a long process and that process has a certain speed that speed is probably the speed of the slowest component or the slowest step of the process if you automate or increase the speed of anything before or after that step it will be just a waste you have, if you want to increase the speed of a process or the efficiency of a process you have to identify the weak link and then work on that part not the rest of the process otherwise it's just a waste so thank you and if there are any questions we have still some minutes so the question is how big should roles and play books and so on be to be useful so not too big, not too small well my answer is there is no right answer to this in the sense that it's absolutely okay and that is usually how it works that you start to create a role and over time it becomes bigger and bigger and at a certain point is like okay that's too big let's play that out but that is exactly the iterative process that should be going through because there is no right answer of you know oh it's 57 lines because it really depends I have seen some roles that made sense and they were written below 20 lines others that were more than 1000 lines still kind of made sense but still it really depends on your specific environment my rule of thumb is a role should be as big as possible but it should be able to run exactly the same in different environments without any change so that means that if for instance you are installing Linux and MySQL and you are doing it in the same role probably does not really work well because there will be some cases where you only want Linux or MySQL but if it's a very big role that is configuring maybe a nautical rock then it's fine as long as every time you install nautical you are installing a nautical rock so that is a little bit my thumb rule yes so the question is around which text editor to use around YAML this is absolutely a personal choice my personal choice is VI I think it's the best tool ever but I do understand the fact that 90% of people that use Ansible or YAML code does not use VI and is very well off without it so I personally find VI for instance very comfortable because I can grab stuff on the command line on in VI I can open multiple tabs multiple windows at the same time and I mean this is my workflow I think that over time you have to find your workflow what works for you and it could be VS code it could be VI, it could be something else it really depends on how you are used to behave and interact with it yes so I think the question is about what is the best way to handle a lot of play books that are automating a big environment and I think the real point is it depends on which stage you are in automation if we are talking about an early stage automation where usually you are running play books yourself having multiple play books can be okay I think that over time when you reach to a point where effectively you can nuke your whole environment and recreate it from scratch just with automation and you can run automation over and over because it is completely damp and so on I would expect you not to have many play books I would expect you to have maybe many play books because you want to do something specific in some cases but roughly your environment should be describing roles more than play books and then I would usually have all play book that simply calls every single role for every single environment and that would be also part of my CICD system where every time I commit something it then triggers that play book to be a rerun and therefore my environment to be aligned to my last change into the git repo but those are very different situations based on the level of maturity of the automation within that environment yes so there is no easy way because it's basically a problem of proving the impossible of something or the non-existence of something so it's a little bit tricky the way we have handled that in this case was basically that in yeah don't worry so the how we solved it was very simply in a very short amount of time we had one user per machine and it was the answerable user that was the only user that was allowed to connect to any production system we have decided at a certain point that was the good strategy before we had for a certain amount of time a number of user that were non-root or non-sudo users and it's an overtime process I think that ultimately the only way to really solve the users issue is not having users at all on machines yes and that is a big problem so if for instance is the absence of something within a file then I would use a template for the file and then replace the whole file with exactly the content I want with the users in theory you can do the same because users are justify entries in the the users file so effectively in theory you could do the same with that but handling users can be critically problematic in if something goes wrong or in other situation like this so the question is about get structures for ansible playbooks and roles if I recommend to stick with version or not my answer is yes I do recommend to start with the recommended layout and then start from there now those are not strict rules some of those are so for instance the folders within a role have to be named certain way and so forth but roughly the whole organizational files is not a strict rule and there is a reason for that you have to be able to change it if it does not fit you now if you want to move out of the standard read the standard very well try to understand why they did what they did and understand if that really does not apply to you or should apply to you and what is your situation but if for you it makes sense to have a different structure definitely have a different structure so thank you and with this we are out of time but I will be here and probably just outside the door thank you let's talk about Linux system roles thank you hello everyone welcome my name is Fernando I am a senior software engineer at Rehat I work for the Networking Services team mainly focus on network management and today we are going to talk about Linux system roles right so let's go to the important part what will you learn from this talk so basically when you walk out of the room you should have learned what is Linux system roles what is NMS state how we can mix them navigating the API of NMS state and the world role and then the new features of NMS state so ideally this should help you to configure or your networking using Ansible and using the role so let's go with the first part alright, what is Linux system role so Linux system role is a set of roles that aim to configure new Linux components for different subsystems it is a set of roles but it is also available as an Ansible collection so if you want to use Ansible collection feel free to do it we must say that it tried always to use native libraries subsystems sorry, native libraries from the interactive system instead of using CLI commands so all the time that we try to do it we use the libraries directly and it support a lot of those subsystems like Portman, Selinux Storage Network SSH FreeEPA many of them but today we are going to focus on network role ok, the network role so the network role supports two providers one is network manager and the other provider is Initscripts we are going to forget about Initscripts let's focus on network manager and it supports a lot of features for example a lot of interface types like Ethernet, B-Lambond Bridges a lot of them IPv4 and IPv6 address configuration DNS configuration routes routing policy DNS option configuration and some other utilities like ATH2 or authentication wireless yeah, plenty of them let's check first some examples of Linux system role configuration file the main thing that we are going to focus here is that when using the network manager provider we are going to focus on connections so as you know network manager has their own configuration files which are called connections so we define a variable which is network connections and in this variable we define the properties in the end what we are doing is translating a key file to a jammer file so we have the name of the connection the state the type, the interface name and some other configuration that we would like to use for example IP address or bond options like the mode or some other options like remoon or whatever and for example in this specific example we are defining a bonding interface and attaching an Ethernet interface to it so we can see like we have the controller property here and we need to define the ATH1 connection and then we are going to define the bonding connection I am showing this because my idea of this is show you then later the difference with the NMS state configuration in the other side we have Ethernet with a b-line in top of it so we have our Ethernet with the main connection port and IP Ethernet, auto connect the state app the interface name here we are using just a body upload from the top we could place whatever we would like without the ACP connection sorry without the ACP configuration and then we have Bilan configuration with Bilan ID 100 it's quite simple because I wanted to use simple examples to see better the difference so basically this is the network role when using the NM provider and when you see the network connections body app now oops let's talk a little bit about NMS state so what is NMS state? NMS state is a library with an accompanying command line tool that manage the host network in configuration and it do it in a declarative manner so it is written in REST and we provide a native REST library but we also have plenty of bindings to see Golan, Python and others one important thing and the difference is that when defining NMS state configuration files we focus on the device we are not going to focus on the network manager details we are not going to focus on how the specific properties are going to be written we only want to focus on what is the what the what the user want to be configured so the user can define for example I'm the user and I want a bond interface with type bond to state app we define the different bond properties like the mode the options like MIMON and the port and this has the same effect than this so this is the main difference here we are talking about connection name interface name here that doesn't exist we only have the interface name it's what is interesting to us and also NMS state here we try to help the user so if the user want ATH1 and ATH2 configure and the interfaces are present on the system NMS state will automatically create a Ethernet connection for them and it's going to manage them and it's going to attach them automatically to the bond so you don't need to care about how Ethernet is going to be configured obviously if you want to configure some specific thing on the Ethernet interfaces you will need to define it but if not you don't need to care here you need to define it because you need to define that is the controller the controller is a bond so more difference for example if we create a bilan the difference is that we create a bilan ATH1.100 with the type bilan STATE UP, IPv4, Enable True and the others so what will happen is for example this bilan already exist and have something configure or this bond for example in NMS state it will keep existing properties and we only replace the one that the user specified so for example if the user specified minimum 100 NMS state is going to set minimum 100 but the other properties are going to be consider and they are not going to be replaced here it is different and in this case it will replace the other properties because in the end it is rewriting the whole configuration so this is done because how NMS state works is that it fetch first the existing state so it fetch the existing configuration translated to a general file similar to what we use mesh them and apply them so it allow us to do what we call partial editing so that's quite great but currently NMS state by itself it can only be used on your host and we wanted to have a tool that can use NMS state across multiple hosts at the same time and then we thought the net will work alright so let's mix them we introduce the network state variable so when using the nm provider we I mentioned before the network connections variable and now we are going to have something different the network state variable how it's going to work in the network state the user will define the complete network state that wants to be applied and it's going to do it using the NMS state schema so in the end if the user is already familiar with the NMS state API they don't need to lend something new, they can use the same schema and apply it as I mentioned before thanks to this using NMS state partial editing is possible and now it won't replace the existing connections on the system not mentioned here but also very important NMS state has an extra feature and it does verification checkpoints and dropbox so when you define an state we call it desired state first NMS state is going to check the current state and say ok this is what we have configured using the network manager feature of checkpoints is going to save a checkpoint then is going to try to apply the settings if something goes wrong it's going to revert it but if it went well according to Kerna but then NMS state fetch again the existing configuration and compare it with the desired one and it doesn't match means that something was not configured properly so it's going to revert back obviously this can be configured to be disabled but it's a really really good feature because if you misconfigure something and you revert your connectivity it's going to revert back to the state before and therefore the matching is going to have connectivity again hopefully and also it's important that it simplifies a lot the API for the user because we will have we are going to focus on the interface instead of the NMS connection in the other hand it is true that if you are willing to configure various specific NMS details this is not going to be able to be done by using NMS state because it's not the scope that we want to call alright, so let's see some examples of combining them as you can see this is basically an ATH this FNN interface configuration and here we are going to define NMS state and we define the whole state that we want to configure in this case it's a FNN configure with IPv4 and IPv6 and yeah and that's all so it's in my opinion it's quite clear and just by looking at it even if you are not familiar with the API and it's the first time that you look at it you should be able to recognize what is this doing then we can go to a a little bit more complex state and in this case we are configuring routes and an ATH1 interface so the same thing here we don't define the routes in a specific network manager profile we just say I want this route to configure and NMS state will try to find out how to do it so the user never need to care about how it is being done basically need to care about what is going to be a configure alright, so this is a sample of that route then we have another example which is for example next wednesday and we have the DNS we can configure the search and the server names so again this none need to be configured per profile it's going to be configured in the whole networking and this is quite good because in the past for example if you configure using this using this schema if you configure DNS in one interface and for some reason you need to drop this interface the DNS is going to be dropped as well and this way you don't need to do it when you drop an interface as I say partial editing is one of the main features of NMS state NMS state will recognize that in the past the DNS was configured so if the DNS configuration is stored in one interface it's going to move it to another one and it's going to move it to a different place or a global conflict in case there is no a suitable interface so it is quite great because you don't need to keep moving the DNS or routes or routing policy configuration from one connection to another you just need to configure it once and forget about it until you want to drop it and alright so how to navigate the API the NMS state API is declarative and we try to make it as much intuitive as possible so it is documented at nmsstate.io and it use inclusive language I recommend you to take a look to our web page we have documented the Jamel API we also support JSON and we also support some libraries like Python Rust etc and it's also documented so you can check the libraries documentation alright and one big benefit that we get on the network role when supporting NMS state is that we don't need to implement things twice imagine that we have a situation where network manager have a new feature and NMS state if you want to use it for network role it's immediately supported if it's supported on NMS state because it uses directly so for the developers it's quite cool because we don't need to implement things again and again which is quite common when you have a diamond and a set of tools that interact with that diamond also it's beneficial for the user because when we fix NMS state it's immediately fixed in network role just say that I recommend you to try it out and I would like to hear from your opinion we have mainly list the project is on github NMS state and network role there should be some more okay so I would like to hear from you some feedback, questions any comment at all yes you can use Ansible I mean you can use this as an Ansible role so in case you are configuring or managing your infrastructure using Ansible this could be part of your tooling and apply this directly with Ansible and it's going to be applied to all the hosts that you are managing with Ansible so could you speak up, sorry yes yes, but as far as I know for example if you use yes, NMS state is basically a client of network manager so alright, okay the question for the stream the question was if you need to install NMS state as you install NMS CLI in your host to make it work so yes but as far as I know the network role is able to install the necessary tools when needed so when you try to use it it's going to install NMS state obviously this is not supported on not support the distributions so I could check the list but I think that supported distributions are REL, CentOS, Fedora and I think some more but not completely sure it's fine thank you more questions something no, alright so I hope you enjoy it very much who was there this morning we have the last session for today for Convention on Computing for the host and Christophe hello so my name is Christophe de Dinshin try to say that three times fast I hope you learn since this morning and so I'm working for Red Hat nobody knows what I do there because I don't know myself vaguely working on computational computing to try to show how to go from confidential computing host all the way to the workload so what we are going to cover today first a quick overview of confidential computing that's a repeat for folks who wear faithful and weather this morning then we are going to show how to access a guest memory to steal secrets we are going to set up a confidential host which is extremely simple as you'll see then we'll try to set up a confidential VM and hopefully everything happens as we expect then we'll discuss some limits of memory production we'll talk very quickly about confidential VMs for instance on Azure and other places but Vitaly explained everything about that this morning so there's a better talk and things like confidential clusters as well as examples of SLS concelación and continuation containers I'll probably have to skip for time reasons and there is a workshop on Sunday where we'll show you how to do that in practice so you're welcome to join the workshop for containers so the problem statement is how to protect data in use that is data in memory and we use so confidentiality is really the essence of trust and it's broken right now in the cloud because your infrastructure essentially sees your data so the software that you have runs on the cloud also known as someone else's computer and the hardware resources there are owned by that host you don't really buy for this memory or whatever you just front it you run various workloads on it so that's typically some things like containers it can be virtual machines can be other things and we have many mechanisms in Linux collected memory fruits all these kind of things are designed to protect the host from the workloads the other way around nobody really cared about who has root access on the host can fully look at the files for your containers look inside the memory and so on and so that means that if you have competitors on the same hardware they tend to not be very at ease to put really confidential stuff there because nothing tells them that their competitor did not pay the CIS admin more than they did and so in the end maybe their data is being shared so we have solutions for data on disk or data in transits networking etc so that the host essentially cannot really know what you have when you use an encrypted volume for instance normally you're safe for memory it's a different problem and there's tons of secrets that run there and these secrets well they may be transient that may be moving fast but you can still extract pretty much anything from it if you know how to look and so what continental computing first added was a way to encrypt memory and the encryption is not necessarily super strong which is why I'm not using a super strong encryption there either but it's very fast it happens on every write to memory and so it really offers some serious projection anyway and then can I take that question at the end that's the next question the short answer to that is that the encryption the encryption is really normally not visible to the host at all and dependent on the page dependent on the location and other aspects that make it in practice hard to break what could work well would be replay attacks on a given page for instance that would work better because that encryption mechanism would be the same for that page but we can take that later that's a very good question and when I say weak it's not super weak either it's still relatively serious so there is also integrity production for the CPU state essentially to make sure that the hypervisor cannot make your workload do whatever they want by changing the address of execution or some registers in the CPU state and one thing that I covered extensively this morning was proving that what you run is what you want through mechanisms known as attestation and that give you some cryptographic counties about what you run ok so let's go to some real me so this morning was sort of a joke presentation today we are going to look at a real system and I would have like to do that live but I explain in a moment why I did not do it live so I'm going to comment what I recorded so we have a FETO R38 system it has a relatively recent Linux 6 to 14 and it's it's running with 16 CPUs so that's an epic 16 core so that's a relatively serious machine with a CV and a CVSNP available on it that's what I'm going to use for this this demo today so in terms of so in terms of what the machine looks like when you remote connect so that really depends on the machine but you have typically web access like this and you can look for instance at bios settings when it starts that's the good old fashioned way of doing it so I'm going to skip a little forward in the setup you can access the bios the old ways with hitting the F12 or F2 or whatever it depends on the machine what matters there is that we are going to look at the processor settings and check that we have secure nested paging enabled so you can see that at the bottom so that's essentially the thing that re-matters in your configuration there and of course the way you do that depends on the machine you use so the rest is not specifically interesting you just have to be relatively up to date with firmware because these technologies are relatively new and so older firmware may have values issues ok so nothing really really special otherwise we can boot the machine and look at the virtual console the virtual console and you get just your regular text only mode Linux this also support graphical mode so the more modern we are doing that is to go through the web interface like this so you have as always the place depends on where you are but you have system settings and you can see again that you have your secure nested paging somewhere in the middle there I scrolled too fast so it's a bit above so you see it's a secure memory encryption secure nested paging at the bottom of the page ok so I think that's about for this page so we can check that we have some host level SCV support by looking at the message log and that's essentially all we can do without additional software ok so in order to be able to do some kind of demo I'll have to use a little bit of extra extra software and I'm trying to give a sort of shout out to KCLI which is a relatively convenient tool to do various things so you enable a special copper for that and then you install what you need and I forgot one of the packages there which is Lidlot so nothing really complicated but you need essentially a hypervisor, a manager for that so you're not very surprised so far I guess so now we can check with self color we can check a little more about what our machine has inside so it has support for SCV SCV ES and SCV SMP and now you all know what this means we also need to record these numbers that we have there, the number of physical bits reduction, the cbit location and stuff like that we'll need that later so let's just remember for now 50.15 we can also check that we have a kernel that has enabled the AMD support and we can check that our Libvert also has support for this kind of features so the DOM capabilities will tell us if there is an SCV section in there then your Libvert also has support for that so this is essentially a pre-flight check to be sure that your machine has what you need and again we have this reduced face bits and the 51 and of course you can now notice that the number of reduced face bits is not the same between the two so starting to scratch your head and say ok something in there does not cook well which one will I use we'll see ok doesn't so KCLI I assume most of you don't really know how this what KCLI is it's a sort of a quick user interface to create VMs, create clusters and it has voice backends it can use Libvert so here I'm going to use Libvert and so I create a default pool for that I essentially reset that machine initially before doing all that except I have I forgot to reset the network so let me just delete the network, reset, create so that you can see how you get started setting up a special network for your machines and let's click a little and now I have no VM and these are the images I played with one interesting thing with KCLI I can list all the available images and you see there is a number and that's for me the best tool for instance to install things like Arcos or Rel compared to the others saves you a few steps in terms of being able to deploy this kind of operating system so this is essentially the basics for what you can do with it so we are going to in the first run create a regular VM and steal data from it so you use the KCLI create VM the dash I is I want the Fedora 38 image I want 4 CPUs and I want 4 gigs of memory and I'm going to pass a root password which is one of the secrets I'm going to look for so please remember that secret we'll use it later so now I connect to my console and I check that the machine boots so far it looks good, I'm happy with it right and you see that there is some cloud in it magic happening behind my back I log in, I tap Strumpf which is the French for Strumpf and KCLI already sets up for me SSH and stuff like that so that means instead of going through the serial console I can use a more serious tool and use SSH because now you we really care about secrecy and stuff like that so I'm going to SSH into my test VM and you see that I'm user fedora so KCLI has this idea that depending on the image it can actually create a user for you and do not necessarily log in as root but you can also log as root and we are going to do that to check that there is no ACV inside so no ACV support in this VM the way I created it just a regular thing and let's try to now inject a typical C program in there so that's high quality C code QE told me you'd better run that in a VM please they told me there is a bug or two in there I don't know looks good to me when I compile there's barely a little one or two let me copy that to my VM so I have the ACP command as well that's fine and I connect to it and then I'm going to run and of course it's going to run perfectly well like any C program so it's not exactly what I expected that's not the message I put in there but close enough ship it ok so now I have this this program and I'm going to do something slightly more complicated so this one I have to look up the documentation every time I want to use it and I'm going to done the memory of the guest so that's why there's a pause there because I'm thinking ok, where is my documentation where do I copy paste this stuff so and also because I discovered at this time that reinstalling everything caused me to lose my completions that's also because so I'm going to use the QMU monitor command and execute a dumb guest memory and the arguments are it's very user friendly it has all the quotes and backslash and curly braces you want and I'm going to use a protocol which is file colon something to dump to a file which why not just file colon but it's essentially I'm dumping the memory of my guest into this place you can also do something like decore if you do that you're going to have some other stuff than the guest memory so that's why it's better to do it that way and you see this disconnected from QMU system doesn't look really good en realidad that's probably why my machine ended up being a bit screwed but we'll see later so now I can drip for my secret that was in my C program and you see I have a match so let me check with Emacs what I see inside my guest memory Emacs even completely designed to open 4 gig files so it takes a few seconds to open it does want me, you know 4 gig is big do you really want to do that when it displays the stuff the way it should be displayed tons of zeros control ads and now I look for secret and I find my hey really secret stuff right in there so my C program does not emit that message but at least I see it in the memory so I know I did load the right program now what if I search for my strumpf password well of course I find this as well and I I find it in this nice little snippet of script so hey that's perfectly unsafe I'm happy because that's sort of what I wanted to demonstrate right so success for some definition of success of course so let's try to fix that and you'll see that it's really really super easy don't worry so let's first try to make I'm compatible with what you need for for continental computing so unfortunately KCLI still creates all style VMs so I have to change the hardware to say I want a more recent Q35 stuff so let me do that I replace PC PC 440 effects with QMU and then it complies because over now I need ok so let me fix that too so I need to fix my devices the good thing about Libert is that it checks tons of stuff for you and then it complies about the addresses being wrong so I just remove everything it's going to recompute a set of addresses that match usually works and and the next thing I need to do is replace my SATA disk and normally I should be more or less the good shape after that so what I'm doing here is only trying to make my machine compatible with the kind of hardware I need for virtual machine so ok now I forgot something and it's I mentioned the SCSI stuff and I did not do it so let me fix that and that's because the devices you are going to pass in your VM need to be producted somewhere so you need to have IOMMU active everywhere and to have IOMMU active you essentially need to convert to virtaios crazy and stuff like that so that's what I'm doing here converting to virtaios crazy and everything should be pitchy ok so let's start my VM and check that it's still alive come on don't do typos my AI is typing that's why it's sometimes making typos ok so the VM looks good and I can still connect and of course I still don't have any kind of ACV I did not activate that yet but at least the VM has the right kind of hardware so I need to do tons of VM edits like this and it's well documented in Libert on the Libert.org page so you see that for instance that's why I took 35 idea and things like that but it also tells me I need to change the bootloader and I need to do some adjustments to the memory so we are going to do that let me scheme a bit quickly that's how you need to boot with OVMF so UEFI and you need to do some adjustments to the memory you need to do some adjustments to the network so it's all relatively well documented let's keep relatively quickly to this because it's not necessarily the most interesting part of the talk but it tells you what I'm spending my days on right who doesn't love that ok so I follow these instructions and I keep turning and turning and I try to start it again and that's why I got my first Libert Hang so forgot that restart restart some demons and I can start my VM again and now it boots in UEFI so that step looks good I'm still not completely there of course I still have not enabled ACV so you're starting to think maybe this is taking a bit longer maybe this is not the best way so before I can actually activate ACV I need to activate it on the host and endorse it so to do that first let me give you a bit of terminology because AMG is very light on acronyms so ARK stands for AMD root key ASK stands not for ask but for AMD ACV key CE key is a chip endorsement key OCA is the owner's certificate authority PEK is the platform endorsement key PDS is the platform Diffie Helman you need to know all of this and there will be a question at the end I ask question at the end oh and there are two more like the tick and the tech and you can read so let me do my host setup let me first reset everything and let's verify the host and you see that I have this nice chain so this morning I taught you about the root of trust and maintaining the chain of trust you got a good example there with this sign this and and certify the etc and if I run it twice I get the same results for PDH, PK and OCA which you remember what it means if I do a reset though I'm going to get different numbers this time this has changed and transparently SCVCTL is actually talking to somewhere at AMD which I learned the hard way when their server was no longer responding to the requests so notice also that the AMD root key fortunately did not change when I did a reprovisioning of the system so that's what you do when you reprovision the system ok so that looks good now I'm going to do the guest setup so the guest setup is primarily adding a long security step to certify the virtual machine as confidential and you'll see it's really easy so it's all well explained on the document here that's the xml you need to put so I need to find a place to put it in my code and that's that's why I'm going to put that stuff and that's over this is where my cbposs and stuff matter I need to put them there and I need to select whether I want to put one or five frontly both work so there was some discussions on what exactly was supposed to be there so I've got my long security section but it's not sufficient because I need to have actual keys in there so in order to do that I need to create a security session so first I will export my pdh which is the ok everyone platform if you have mania and then I'm going to just check that it works with this setup and then I'm going to generate the files that describe a given session for a VM so now that's related to that specific VM you will use the hostname there because you can transport this data to some other place and I'm going to keep it simple just use host.pdh and do everything local normally you can do that remotely and the policy is helpfully explained in chapter 3 of some offshore document somewhere on AMG website it's also on a different page on the libvert documentation and you essentially have to orbit mass for stuff that's modern style, you know, 1980s style of user interface so now I've created my session and what this did is create a few files that all begin with that VM so there is this god mode thing there is a pack a tick and the pack in the session and all the stuff so that looks good I can probably make some progress from there and put that stuff inside my file so I'm going to insert where there is dh side so dh should remind you of course that it's pdh is the same thing so you insert the copy of the file there and then you insert the copy of the session file in the session section I think that makes sense I mean that's simple that's a good user interface use max, use vi, whatever I mean just text there is some remarkably low entropy on this sequence of A's that I find really surprisingly low I don't know why exactly that's something I like to look into why there are so many A's ok, so now that's edited and I'm going to do it the scv old style way I'm not going to try remote attestation because that's complicated so we'll keep the simple stuff here and so what we do essentially now is that we are going to enter a brave new secure world so to launch a secure VM the first step is to start the VM but you start it in post mode because reasons the reason is you have to check before you actually launch it that's called pre attestation remember I explained that this morning so you have to check that it passes the test before you let it go then you do this dumb launch second full test VM that gives you a measurement of what you are trying to launch it's a helpful tool where you just have to copy paste the output of the previous command but of course with different option names like everything that begins with scv does not begin with scv on the command line other than that it's it's really simple you do that and you try to not type too much oh you have to measure the firmware so you have to tell where the firmware is and with that stuff it looks fairly nuxy to me so far right now I'm going to to get my grade 4 and I get this next answer looks good to me so now I'm allowed to resume my VM I can say hey resume, console and let's go and what happens so putting again into EFI so what should happen there I should see a cv and I should be able to run my application without it's content being visible that's what I expect so I do that I connect, I type tronf which is no longer so much of a secret and I have a cv success I'm really happy memory encryption features active let me ssh into it from another window and run my little test program so let me copy the latest version where I upgraded nothing I don't know why I need to copy it I think because in the meantime I recreated a VM from scratch so I run it it works just as well as before and now I'm going to do my memory dump like before but now I have it in the history so it's easier and I take the encrypted version what do you expect same as I said but let me now that is a pretty let's look with the max something is wrong there I have to wait to see why it's bogus so and the reason for that is that we see the word secret as part of some fight system stuff which is already I'm already scratching my head but I don't actually have my secret stuff with 3F it's not there so something is working I have predicted somewhere something but if I look for stronf ok who can tell me what I did wrong I scratched my head I thought maybe it's something on the console that is going wrong so I'd say if you can't answer if nobody in this room can answer it means I'm not the only one being stupid so you're all as stupid as I am because I had to scratch my head for a while to figure out where did I go wrong what did I do incorrectly no it's much simpler than that so I double check scb is there no we have a problem if you have not seen the what's talk you search google you search google talk what's talk right now ok so it's time to tell the boss dear boss my demo for next week doesn't work which is why I completely changed compared to what I was supposed to talk about dear real wasn't I do cps of my secrets in there and I'm not sure why so that was my secret the first experiment with dictatototo scb is active answer for my boss isn't that the punchline I just suppose a product from that ok so the reason for that is that only memory is encrypted what I did wrong if you if I grep the stuff from my program it's not there that stuff is not visible that part we succeeded with however when you do encrypted memory like this you cannot send that to this directly right you cannot write your network directly so all the IO buffers that are being used by the kernel have to have the famous cbit clear saying it's not encrypted which means that you'd better encrypt your disk right and that's the thing I forgot about KCLI is that by default it uses you know it uses good old card in it and there is no encryption of the disk setup by default so we need tools and we need to build tools to keep that complexity and reduce the risk of misconfiguration so first of all how did I get there you can blame Wayner over there I'm pointing fingers there because I'm also doing a workshop in two days with Wayner and his workshop was about I know it's Q&A but it's the last session of the day so I have a little extra time so I have too many talks going and he did a remarkable series of script for this lab and he was using KCLI all over the place and I said that makes it really shorter compared to what I was planning to present so let me use KCLI and I switched to KCLI and did not realize it wasn't encrypted until the very last step so step one, prepare a list of commands for a live demo step two, prepare another workshop with Wayner step three, notice that this workshop uses KCLI step four, think ha ha great idea, I use that too step five, record a movie with all the plan steps but where you replace carefully all the steps where you were creating encrypted VMs with a fun installation with KCLI totally forget that KCLI does not encrypt disks success so that's how you fail spectacularly in public now there are better ways and I'm going to show them that should be for another talk you can set up VMs in the cloud that was at 10.30 10.30 this morning with with Italy where you can use confidential containers that's the talk we have tomorrow and you can deploy complete clusters which I'm going to show on the next slide with a tool like constellation for instance which is carefully to set up the VMs the right way for you it's just that it would not fit in the 30 minutes that are still alerted to me after the end of the talk and so so essentially you wait and you wait and you wait and after something like 40-50 minutes you get a cluster which is really set up the right way so this is the recommended way to go if you want to really spend your money if you have too much cash on hand and you are ready to pay for all your VMs and all your control plane and everything to be encrypted go with it that's very easy it behaves like your regular your regular cluster getting tired a little bit so then you can connect to a node and check that it's encrypted properly that the disk is encrypted etc and of course we are working on this for Fedora and CoOS so what you see here is the creation of a Fedora CoOS a disk with ignition with a secret injected and that gets encrypted on the fly as the ignition step happens and I have to thank Timothy Ravier for helping me doing that this morning this is relatively new to the talk now it's time for questions if there is any there is no time for questions but so the question is the question is about what is the cost of memory encryption mostly and so in terms of power usage is hard to tell because it's not exactly the same generation of processors so if you do processors with more transistors but they are smaller then maybe it's not consuming that much yeah I don't think I don't think that using encryption adds much in terms of consumption of memory sorry you have energy and it does not add latency in terms of measurements and in terms of latency compared to non encrypted it's the same speed however on the same CPU however what it adds cost is that you cannot do any kind of deduplication you cannot deduplicate your disk because they are encrypted you cannot deduplicate your container images because now you want to download them from inside your virtual machines so you are downloading the same thing over and over again and you cannot do any kind of KSM or anything with encrypted memory anyway don't ask me so your question was when do we get cheap hardware so so to be precise I think it does consume a little bit of power but when you see that you have chips that consume 100W or 130W of TDP I'm sure there is half a watt in there that is encryption but I don't think it's more than that the reason I think it's cheap the reason I think it's cheap is that my phone does it on a battery for a whole day so it keeps encrypting stuff all the time when it's communicating and they accelerated in hardware or the encryption with the network and so on precisely because the hardware could do it relatively cheaply but to be fair you've got the wrong Dinshan you've got the wrong Dinshan my brother floor is the hardware guy in the family I have a brother who does FPGAs and stuff like that he could tell you everything about how much it costs to encrypt stuff I can't sorry and back to your question when does that become mainstream as usual it will probably take a decade so the feedback we got so far is that there are a few markets that are already interested today because they have to deal with regulations constraining the kind of the responsibility in terms of data leaks and that's banks, medical stuff like that but also that because this becomes available of course the manufacturers are not being all they can to make this mandatory and so now it becomes a lot that you have to encrypt your data that way and so maybe there will in some cases you won't have a choice time will tell we'll see how this works no so when I'm seeing it's weak is that it's typically done to be done in real time it's still modern algorithms and the algorithms they use are dependent on a chip and frankly off the top of my head I don't recall which one they use but I think they have AES 128 somewhere some label for the first generation chips and now they are doing more complicated stuff so it's not the stuff that you can break with your computer without paying a lot of energy and by the way I suspect that one reason that triggered these efforts is I don't know if you recall like almost ten years ago there was this big push about memory stores and persistent memory non volatile RAM and this kind of things and everybody at the time thought that we'd have really cheap non volatile memory one decade later that's now and so the problem became for the cheap vendors I need to make sure that if someone plugs one of these non volatile chips they don't get all the data that was in there so that was probably a primary motivation to get started on this encryption stuff initially and in that space it had to be relatively robust anyway I apologize for the use of the world week I meant it's not as strong as you could do if you were really pushing the limits on your crypto but it's pretty good I'm out of time, I knew I knew that I'm sorry