 º º Studying table º º Studying table º když jsem vyskávával, protože jsem chceš v tomto většině. Já bychom nám zvukat 5 minut pro těch zvukatů. A když se to vám přijít, to bychom přijít? Proste. A když se to přijít, to bychom přijít a když se to přijít. A když se to přijít, to bychom přijít a když se to přijít. Ok. Já bychom přijít a když se to přijít. Já bychom přijít takhle. Já bychom přijít v moich mas 뽑inům, co bychom přijít?uralout. A from the last time, when I was turning my head, done on the video, it wasn't hearable from the beginners of the head. Jiitos. It's good to have it in the middle. There is no nothing... Just do your old my lips, just... OK. I'm the provider of the stuff. Yeah, I'm doing it. Well, I have to come to my computer. No, it's nice to put it here. First time for me to put it here. Cool. I'm into the rocket office, though. Rocked yesterday. Top of its tiles. Nicely. Nicely. OK. Yeah, it's been nice to put it here. It's been a lot of fun. So do you need a typical? If you are presenting, you don't give it to me. You are not presenting again. It's a playbook behind the end of the album. But it had hiring books. OK. You can just ask somebody and maybe they'll give it to you. I think that is a limited space. So whatever you do, it's not in the right place. OK. OK. It's a lot of monitors. Is there anyone to switch from the old stuff? Yeah. Well, I don't know what can be done. I think the aggregation might even pop. I think it will change. I still have tasks from the previous teams. OK. I can slowly lose it. OK. But I think it will be preempt. OK. OK. So you have to put the program here to watch it. Why you switch, because you have to do something. I like to do something. OK. OK. OK. I like to do something. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. OK. I would like to make sure someone is not there. OK. OK. OK. No. OK. OK. OK. Máno opozici, že pracujeme na MediciAQ, to se točí. MediciAQ tím. Abym se poslednout. OK. No to se dovolíte. A to se točí. Přeslednout. Abym se poslednout. Přeslednout. Abym se poslednout. V pulpíkoch tráví persistence nevětší kice, takže chtějte, že jsou všechny, že jsou těchme, že veliký zá какие to výbějí. Takže to, že je oto opustení, že je to, že jsou větší, že chytíme na výběhu, že jsou vytruhli, že dost se si uratí. Takže to, že jsou vám na navědělosti, že také to jste s názvým, že je to začali, že jsou vBE. je to the triple-o under cloud, right, or RAL-OSP director, which is the product. Ok, so let's start with the OpenStack cloud. The main categories I will talk about is Inventory Collection, capacity and utilization, smart state analysis, also known as Fleecing, control and policies, drift state, and some examples from Automate. So let's just quickly see inventory collection. Produk lighting looks like this. So, you have some basic properties in here, you have some relation스� which want to see in your cloud. That is very similar to Horizon. Let's look at it closely. It is hostname, what part it is. What is last refreshed. Some relations. You probably know this from Horizon, the basic ones, you would expect. So, now the next most important thing are the instances. v závětávingu. Je to námasná, které je mluvěla a akná se vámí ne??? Pojďte se na ztínku. Význam je na vyszelé modno. Protože je to nějaký závět, který je mluvěl, když než jsou zvědětáry. Přiž odměřila tomu examiční složitě a hodináře a vodináří je to příště výjdy, co je hodinář na výjdy. Co je důvodopraví dřelovit, co je hodinář výjdy a co je hodinář důvodopraví důvodovit? A než tam si zvukovat nájudšná výklad賢. Jeho zvukovat návěství v tom, co se na večce třeba se zvukovat. Pozrýváme si k vyskávání kapacitivů. To je zběžení z výkladního závěství a v sobě děláme nějaké rychlety, abyste se těžké přijít v měství a se vyskávání je hodně. Takže se jste vždyčný všechno, všechno výdělává na tom, jakéto štěni, nevětně, nevětně, nevětně... To bylo všetí vyspětný vždyčný, tak to bylo všechno ve význem. A vidíme, že v zvukeném všechno a všechno o středném význem, že jsme v štění větně, vždyčný význem význem, a význem význem, kvělávala o návalovu základku. V návalovu středném význem význem na dve, na základku. Smart state analysis, in short and simple way I will explain what it is. We don't have access to VMs. We don't own the private keys. What we can do to actually check what's in the VM, we can snapshot it, we can mount it in the local disk, we perform smart state analysis of the files and we can show what we found. user groups, packages, in-it processes, files, that you configured. Just remember, you need enough time space in your appliance or workers that are doing the smart state analysis. So you don't need the smart state analysis in some providers. For example, VMware, which is very old and has a major API, so it can provide you most of the detailed info in the API. And if it doesn't, we can get it from other means. So then drift state, what is that? Once you collected all the information about your provider, about your OpenStack, about your VM, then what you can do is you can actually compare the historical samples of the data, right? So I had a list of packages with these versions now, and when I will collect them again, I can compare what changed, for example. Now, control policies, you've heard some of that in the last presentation. So you can leverage, again, all the attributes, all the entities, for example, for writing alerts, right? When something is happening, like your load is too big or something is, I don't know, doing smart state analysis of VM that should be forbidden, you can do an alert, you can send email, SNMP traps or even internal events, we will show examples of that later. Then compliance, so simple pass fail checks, which can be included to reporting and again sent periodically, for example. So the basic stuff like, are you suffering from heart bleed or shell shock, right? Which is just simple comparing of package versions, for example. And you can use the same condition on all your VMs, all your hosts, anything you have in your system. And the last is policy enforcement, when you can actually take the condition and you can use it to prevent the user from some actions, right? So you will not allow to clone some VMs because they are in some region and you can't just put them in another region, for example, because of some legal issues. Now, automate. So we will show the most used automate example, which is deploying of the VM. So this starts in the UI, I will not show screenshots of that, but from, you probably know the Amazon or Horizon, it's all very similar. You just pick flavor, network, number of VMs if you are deploying collection, right? And after you picked all the attributes for your VMs, in many aq it generates a request, right? That has to be approved. And then when it's approved, it invokes a provisioning workflow, which is a state machine. And after the workflow is finished, you can actually tie it to another workflow or the last step of the workflow can be another state machine, which is nested, for example. So you can deploy a collection of VMs and tie it to some action in Ansible Tower that can be another workflow and another workflow after that, maybe. So let's look at the request approval workflow in automate. So you probably saw the automate in the last presentation. So this is the default provision request approval. So you can find it in here. And it's a basic state machine, which has two states, validate and approve, and some input attributes. So for example, this is for auto-approved. So you say maximum 10 VMs request will be auto-approved. So you can put there was the maximum memory of the VMs. And then it goes to the first state on entry in invokes validated request method. We just checks all the attributes. And then it decides either it's all OK. So on exit, the default behavior, there is no method. So it goes to the next state. So by default, the state machine is sequential. And it does the approve request action. Or you say that there has been an error. And it does this action and ends. So the pending request will then mark the request as pending. And a person approver all has to come and manually approve the request. So if they have budget or something like that. So this is just glimpse at the code of the validate method. Simply setting the result of the state as error will cause that it will go in the on error branch and invoke another method. So you can control your state machine via this means. Now the more complex example is the provision workflow. So you can see that you have two basic state machines here. Which is provision from a template, which is image. So you can do the same around the heat. You would not do around the VM and nova calling, but around heat calls. So this is the default state machine for VM provisioning. So you can see there is a lot and lots of actions that are not used by default, but can be by different providers. So for OpenStack, we will look at the most important, which is a step that is picking up the placement. Then the actual provisioning and post provision step. And on the finished, you can see the prelast step is that you can email owner that the provisioning has been done. And on the finished, you can tie it to another state machine that would do the connection to the Ansible Tower, for example. So you could actually put something on those VMs. So let's look at the placement. You already saw that in the last presentation. So it's divided into... The behavior is different for the providers. So for OpenStack, we will define it here. And the method best placement here, just checks on the get option. So this is basically the UI form. You either define a network, where to deploy or not. If you did not define it, it will just take first eligible cloud network that is there. And it will log that it picked the network. Very simple default action. So for provision method, you can see here, it just executes the internal state machine. So we cannot touch all the state machines. But we can drive these state machines by the attributes we are sending there, the parameters. So then we have the post provision step, which by default, you can find here. And by default, it's empty. So it just plays holder, which has placed the code here, and it will happen immediately after the VM has been provisioned. So let's now look at the ways how you can override the default behavior when a customer comes to you and says, so we have a development environment, and we want for that, that all the VMs will pick something, and it will always pick the most utilized network in group of networks, right? Or the least utilized, so it will spread the load. OK, so the admin role will come. They will copy the best fit open stack placement method to the networking domain, and they will override the behavior. So the basics are, you just get the provisioning requests. There is some check if image is there. Now, this is important. You can get the tags of how you tag your VM, right? So when you provision, you can tag your VM, it will be development environment. Same way, you can tag all the things in the managed IQ. So, and then you just have simple case with the special behavior. So when the environment is development, I will do something special. So, for example, here, I will pick all the private networks, so that are not the external networks in OpenStack. And I will select all that have public networks count bigger than zero in OpenStack. That means that those private network are actually connected through router to external network, so you can get floating IPs from them. In reality, you would probably also filter it by the tags of the network, right? So for development environment VMs, you would have development environment networks, also other things. So then you just sort it by the IP address left count. This is a live method, so it actually asks the API in this moment. So you have a fresh count of the IP addresses. And you filter only three networks, right? Some of them may have been already filled. And you just pick the first, which will be the most utilized. So here, if all the networks were full, you can still pick the first one, do some nice logging that what was the state of the networks when you were taking that decision. And still, when all the networks were full, you can still try to deploy it because this is all running in parallel. You can be deploying 10,000s of VM in parallel. So it will happen that there will be conflicts, right? And this is the step when you will just set the option for the provisioning, right? So set the option, and the networks will be one network, or if you have multiple interfaces, you can set their multiple networks, right? You would put most utilized network from the one that will be externally facing, and for another port, you would pick, I don't know, the least utilized network for your isolating networks, like where you have databases or something like that. So we are done with the placement, but what happens when you will deploy to the full network, right? You have plenty of networks, and as I say, it all runs in parallel. So when that happens, the open state actually just fails, right? The DHCP will fail to get the IP address, and the VM will end up in the error state, and there's nothing you can do with that. Okay, so let's try to deal with that behavior. We'll override the open state post provision action. So in post provision, you just get the VM object that was just provisioned. You get the error power state, and you check whether that's error. In reality, you would do probably more checks, right? What was the actual reason of the error, right? You want to pick only that the private network is full. So what we are doing at this point that we will be creating a cycle in the state machine, right? We want to go back. We want to try another network and see if that will succeed. So we will create an internal counter, which is these three methods, like state var to exist, set state var and get state var. Just simple counter that we will increase every time we will go to this place. We will have some max rat rise. We can put it on the state attributes, for example. It doesn't have to be in the code, and we will just check if the counter is bigger than three. We just raise exception, which can be code, and we will quit the state machine that all the networks are currently full. We can send email to somebody to fix that or even wait for that and try it next day, for example. So then if we want to try again, we just log warning that we are placing in the full network and there is some logic around it, then we have to actually destroy the failed VM in reality you would want to create another state in the state machine for this and again wait for the VM to get deleted and you would retry that whole step. When that finish, you can jump back to placement and try to pick another. So you can do it by this simple changing of the next state. The next state will be placement and you are saying you want to restart it. So it will go back to placement and do all the logic again that will pick the most utilized networks. It should ignore the full now and pick another. So now different branch will be one in post provision. It's not error state, but the VM is active and ready. For example, you want to autos to say defaulting IP under some conditions. That's not very easy to do right now in OpenStack. You have to script it. And this can again react on a special service that you will actually add a checkbox in your UI that says do this, which will be in the same behavior as Amazon is doing, for example, or Azure. So for the public networks, we will try if the VM is actually connected to public networks, which is this condition. We will have some special behavior if you have more than one public network. It can happen that you will have multiple interfaces. Each of them will be connected to some external network, which should not happen, but you can build it like that. And you will just pick first public network. Again, you could have some logic here. Like to check the most utilized public network if you have more. Usually, you will have one external network per tenant. If you want to have that separated. And again, some optional action when the public network is actually full, so you don't have any more floating IP addresses. And then we will just run the Assessate floating IP. In this case, we are not defining any floating IP. So we would need admin rights because it will be generating a new floating IP, which is from what I found only race condition safe action. Otherwise, you can steal the floating IP. So again, if you would be deploying thousands of VMs for one application, they would be stealing the network IPs if you just do the association action. So you best deal with it to use the create action or you would have to do several cycles to check that nobody stole floating IP from any other VM. And some other actions, like when you delete VMs, the floating IP will be there, so we can try to delete the three ones and try to create another if it fails with the already full public network. OK, so that was a hard example from the provisioning. And we can now jump on the open stack infrastructure from the cloud and see what we can do with the rail OSP director. So again, there will be some inventory collection, some smart state analysis, drift state. We can compare nodes. We can do auto scaling using automates, so that's a nice example of automated usage. OK, let's go to inventory collection. So again, provider, this is your undercloud with some basic statistics. Let's look at some aggregate info. It's not that useful because it's aggregating controllers with storage and compute. So this is a relation to deployment roles, right? So in this example we have controller, self and compute. All the nodes we have there, now all the tenants that are deployed in the overcloud or the availability zones from the overcloud. So again, this is a connection of undercloud and overcloud. Let's look at the deployment role inventory. This is the heat resource group, right? So we have three of them. Let's look at the compute one. We have some basic relationships there. We have aggregations and some open stack status. So let's look at it closely. Relationships, OK. So we know there are two hosts in the resource group. We know all the VMs deployed on those nodes. And there is also some drift for the cluster, for the deployment role. Then totals for nodes. So we can actually see what's the total memory of our compute hosts, what's the total CPUs we can use, aggregate disk capacity and what is actually being used by all the VMs there, right? So we still have, for example, memory and CPU. The CPU is actually overcommitted here. And you can see that in the ratio. All right. And then the open stack status. So in this example we are actually checking what services are running on each of the hosts. And then it's aggregated on the cluster. So now you see the host view that we have two hosts here and all have all of our services running. Same with node run sealometer and some support services, right? So no failures. This is actually being get from the open stack status utility. So the open stack should be healthy on those nodes. Now let's look at the nodes inventory. So we can see a list of all nodes. This green check button says that the smart state analysis has been successfully run on them and we haven't done any smart state analysis on the rest and they are running. So let's look on one of the nodes. So again some basic info and info from smart state analysis. So look at it closely of what we can collect. So memory of the node, CPU information, device information I will show that next and IP address. For devices this is actually virtualized node. It's not a real hardware. So we have processor, CPU type is some virtualized processor, CPU speed, memory and what disks it actually has. But we can't yet collect provision space. They should be next. OK, then relationships of the hosts. So we can see under what provider it is, under what deployment role or resource group or even it's referred to as cluster in managed IQ and we can see what availability zone in the overcloud node is. And what cloud tenants are running there which can be handy when you will have alerting like the node is not healthy. I need to notify all the tenants that are on that host to move their stuff out. And then the VMs running on the host and the drift history. OK, so let's look at the attributes from the smart state analysis. We can collect users, groups, patches and firewall rules that are in progress. We will probably be done next cycle. And then the packages, services and files that you have configured to be collected. And then the open stack status. So list of all open stack services running and if we are collecting the configurations for the services. So smart state in detail now. The first step for the smart state you should do, should be to specify what files you will actually collect, right? So back in configure and analysis profiles you are specifying either by name or some wild card what files to collect depending on the user you will have or if you are actually doing smart state analysis of the image you will collect those files and you can specify whether to collect the content here. Because some files may be big so it's just enough to collect, for example, MD5 sum to check if the file was changing or not. Now we will show an example what will happen if, for example, open stack nova compute service will go down for some reason on some node, right? So in the cluster view, deployment role view you will see that you have a file there. So let's zoom on it. Okay, so now it's saying that one node with nova services is running and one has some service failed. So when we click on that we will get to the list of hosts with failed nova services. Okay, we can click on the host and in the host detail we can see that the one nova service is failing here and we are collecting one configuration for that. So let's look at the configuration. It is nova conf and let's look at the failing service. It is open stack nova compute, it's not running and the system d state is failed. Okay, so how we can observe what was happening in the system? The first option is a drift state. So in drift history this shows you all the samples so how many times you have run the smart state analysis all the samples will be recorded. You can have some clean up actions of those in some Chrome then. So we have collected few of them and you can select multiple of them to compare how they are different, right? So I just select two. I am selecting the one I know it should have been working in the time and the one that I just did after the failure was seen. Okay, here the first thing you should do is select what to compare, right? The comparison will be in real time or you can do in some reports from it. So now I want to compare the guest application. It's actually installed packages on the host, the running system services, the files and file system custom attributes which we do parsing of the files to show what is actually in the file. So in the detail we have different system services. So before OpenStack Nova Compute was running, now it's not. So okay, we know that, but we have found a state where it was running, so that's good. Now the files before the OpenStack Nova Compute had a different MD5 and different size. So we can see that the Nova Compute has changed and it should not change, right? Unless you are doing some updates or something like that. And some logs has changed so that's good to know that something was locked in maybe some special logs. And then the actually parsed attributes. So now we have changed attributes which is in Nova Compute default section. Okay, it's there. It's called RPC backend and the last value was rabbit, right? You want that as a backend but somebody changed it to bunny. So that's probably the reason of the failure. Okay, so next option, how to figure out what was happening is to compare nodes, right? So we will go to the list of nodes and we will compare a healthy to non-healthy node. So just click in and do compare of the selected items. You can do list actions here like performance might, span state analysis on everything. So let's compare them and you will get to do same list as the smart or the drift state, right? So again select what you want to compare. Here you will see the percentage of how same they are. So this is 100%, this is 75%, 35% difference and again we can do, we can look at the details of the difference. So host properties, you would expect that. It's a different name, different number of VMs on the host, so let's expect it. Host services diff, okay? So some special unique DHCP interfaces that are missing and are but not running on the other node. But again, we have the OpenStack Nova Compute, right? So healthy node has it running and non healthy node has not. So we can compare the files. For example, ETC hosts, the modification time is different but the content of the file is the same. Then we have Nova Compute here again and we can see that those are actually different. Modification time is also different but it's like a day difference so there can be something happening there and you can compare attributes of the hosts. So the first is what we saw before, right? The Nova configuration on the healthy node has the RPC back end as rabbit and non healthy has bunny. So you can actually compare. In this example, the nodes should have almost the same configuration, right? They are in compute cluster. They should be all handled the same. We are also doing parsing with interpolation so for example, you can check that blockstretch IP, the variable is the same but when you interpolate the variable, the IP addresses are actually different. So that was the comparing on the drift state and last example is leveraging, leveraging automate for auto scaling of the hosts. So first thing you want to do is to define an alert, which will be the driving thing. So my alert will be that on the compute cluster, so the resource group with all the computes of some type, I want to say when the average usage of a memory is bigger than 50%, I will do something, so I will scroll down and the something is I want to send a management event to my custom event, so I will name it, right? So this will tell me, okay, I am running out of the memory or I would put there like 75%, but let's keep it 50. Then once you have the alert, you are creating a thing called alert profile and that's like a collection of the alerts. You want to connect together, so I will pick my alert, so it will be the director, auto scale and then you are assigning the profile to the thing you want to observe, right? So it's like observer pattern maybe. And we want to observe only the compute cluster, right? We want to, this example will scale compute, it will be special for compute, so we will observe only compute cluster for the alerts. Okay, so let's do that. And once we have that, that something will send events in the manager queue, right? But now we need the automate end that will actually catch the events and do something with it. So the first thing is to go and copy the system event, custom event and create special class and special instance that will be called exactly the same as the event we created, which is the director, auto scale processing and only thing that this does, it's a proxy, so it has some relation, the relation doesn't have any conditions or anything, so it just do the relation and it will run this automate action. So let's look at the automate action, it will be that you have to create a class, instance and method for auto scaling. This is very simple auto scaling, so in reality you would be probably creating another state machine that would do more complex stuff, you could tie to different events, like when a new node appears in Ironic, you catch the event, you start the state machine that will do some burn in, so it will run some benchmarks on it. If it's good, it compares the profiles, it puts it into correct cluster, it autoscales it, then it can run some validations, so you can create your custom state machines that will be actually useful for the customer, so it will be actually the customer that will ask you to do this, because they have their special processes. So the director auto scale processing action, it's instance that have only one method and it should execute this. And the body of the method is very simple, so you just get the provider, this is the undercloud, you get the current hosts count, you get the writes tag, you select just the machines that are powered off, so this is a very simple example, otherwise you would need to, for example, compare that it's the write profile, so you would do auto tagging that the profile of the nodes, which is the hardware profile, would then match the tag of the cluster, both tags, so you know what it is, and then you would pick from that group, that you can autoscale, because it has to match the computer cluster, it's tied to only one hardware profile, but in this example the hardware was the same for everything. And the simple autoscaling is, again, just the host count plus one, you could do some more complex logic and decisions about how many to autoscale. And we just basically call heat then to do the autoscaling for us, so we don't have to do any complex logic here, and we can, again, tie some actions after that. And that's not all, okay, we are running out of time. Once you set it all up, how you observe what was happening in the system, this is all very complex logic. Okay, so the first thing is that you can see all the events that are happening in the timelines. So in this case I've picked alarm status changes and errors, which is the red one, and power activity, which is the green one. Let's look in detail what it is. So I had several alerts before I enabled the domain that was actually doing the autoscaling. And it's our alert, right? The memory used is bigger than 50%. It's tied to the compute cluster, and it happened around 7 o'clock, 7 p.m. And after that, I'm not showing here the heat events, but we could show also them, but this is the NOVA event, compute instance create ends. So this is actually time when a new host was added by NOVA. Right? And we can click, the undercount it was happening in, and we can even click on the host it added and is already in the Manijaki system. And again, we can check that the time is 7 p.m. It's nicely aligned. First you had an alert, then something happened, so like addition of the nodes. And the same can be observed in the charts. So around 7 p.m., you can see that the maximum available memory raised from 10 gigs to 20 gigs. So that's the node you added. There was some disk IO peak. And the number of hosts in the cluster, in the compute cluster has raised from one to two. And you can see that also in here, right? This is the number of VMs to statistics and number of hosts. There is a slight rise. Okay, so that's all. We have like two minutes for questions. If you have more questions for me, this is my GitHub. You can find me on IRC on Gitter, and this is the Manijaki main repository. So you can find all the other informations there if you want to ask other people. And yes, we are hiring. So that's all. Do you have questions? I understand we have some nice cards for the questions. So, yes? Does bottom scale only go down, or does it only go up? We are preparing the scale down. Yes, the scale down. So now we do only up, because that's like very easy action, right? With the scale down, we will have probably prepared for the next release just compute, where we already have all the cleanup actions, right? It's not that simple. So you have to migrate all VMs out. In the same time, you have to disable the nova and the ironix, so it actually doesn't send more stuff there. And then clean it up using heat. So like several actions you have to do before, some of them may fail, right?