 Hello, everyone. My name is Tony. I'm an open stack technology evangelist and I'm working for Mirantis IT as a L2 support engineer. So today I'm going to share with you some pieces of the SEMA about how to be proactive and how to support your customers. So let's start. So today we're going to talk about some architectural issues created by humans. Also we'll discuss some common workflows, how to support, which operation technologies we have. And also we will have a common view how to avoid a cloud breakdown and other stuff like that. So let's start from the project of supporting customer care. And we'll start from the some stupid issue related to MySQL. So the issue related to MySQL is when you're creating some small disk size, disk partition for your MySQL. And unfortunately for the open stack is not enough. For example those 20 gigabytes. And what you will see in a moment. Your MySQL will not go. And probably your open stack will go down. So the issue is very simple, but you have to avoid this issue anyway. And if you like to have a MySQL as a backend for your open stack services, you have to create a huge partition, a huge disk. The next issue, actually the common one, because usually customers which we care about there usually got this issue. So it's about a rabbit and some storage backend. For example I use stuff here in this slide. And it also about the piece of network. So many of you understand the default typology, default network typology. When you're creating some managed network, storage network, you have a PXE network for your nodes. But sometimes, sometimes customers trying to drop each one inside one one. So and it also some stupid architectural issue. Yes, you can avoid it in one click. Just provide you with some additional band and everything should be okay. If not, your backend, your storage will kill your rabbit. As I understand, everybody knows what a rabbit does. Everybody knows what. Okay. So the next issue is also default architectural issue we're facing with when we care about our customers. It's about the HA structures. Many of us trying to care about our services, care about users, but we definitely don't know how to avoid situations with a split brain or something. So and the answer here is to use HA. When you're trying to create your open set cloud and deploy it, you probably think, oh, maybe I can use one controller and one compute node and it should be enough. But no, it should not be. Or okay, the next case, probably you're going to use three controller nodes or maybe two. I don't know, actually, it's up to you. But okay, you have a choice and you did this and did this without any covering like a pacemaker or a chip rocks itself. You did it's mainly by your hands. And sometimes your services will have a split brain, so it will have nothing. The next issue, it's interesting one because it causes because of reduced footprint. Who knows what it is? Nobody? So let me just give you an example of that. So reduced footprint and when you have some couple of nodes and you can deploy your control plane there like VMs. Also you can deploy compute services also like VMs. So it will be fully virtual environment for your tests, expertise and what you're going to do there. But some folks, they usually creates by this model the production clouds. So it is also not a good idea because if you will lose your node or replace it does controller VMs, you will also have nothing because you will just have those computes and that's all. You wouldn't be able to spawn any VM. So this is also strange and interesting architectural issue that you're facing. And it's really new because we faced this issue maybe a few months ago. Just new one. The next one it's about Zabix. So as we discussed in the previous first slide about my scale, if you create some small partition for your database, it will be just killed by the capacity of the data. So how to create that data? You just need to put your Zabix table inside the open stack my scale and after your Zabix will kill your my scale if something goes wrong inside the controller. For example, some service like Noah will be switched off and you will see some messages inside the Zabix and Zabix will generate a bit amount of data. So and you will see how Zabix will kill your my scale and your open stack as well. So really interesting thing. But fortunately here in Merantys operations we have a lot of knowledge and we have a lot of articles how to avoid such behavior, how to fix. So if you guys would like to get understanding or properly get those articles you can call Merantys sales department and we will help you to understand how to work with the Zabix. So if you already have our account supporting account you can just create a ticket and we'll provide you with article how to move Zabix out the open stack my scale to be if you have this feature if you have this installation already in production. Okay let's go next. Let me see. Okay so how to avoid cloud to break down. It's interesting theme and I would like to split it on three areas. So the areas will be the covering about users, about services and the stress. So these areas will help you to avoid the breakdown to prevent it. There is a K word here to prevent because you can't create the proactive support without any proactive steps without preventing something. So let's talk about users. It's always important to care about guys who provide you the income. So let's say someone asked you about I would like to have a cloud. Please give me this option and this guy probably will give you one dollar and else you will have more guys who would like to give you one dollar but you have one folk who can give you five. So there is a key customer and you have to care about his capacity his VMs more than usual but you also have to care about those guys and more than they have this care on the VM bear or amazing. So you have to understand one thing. If you would like to create your cloud for your users you have to calculate firstly the capacity which performance you will have and how to avoid issues. So I recommend to you have a plus 30% of amount of that capacity to your calculations and it should be a good idea. So the rule here the huge not is never enough. So remember that rule. A couple words I would like to say about the quotas and policies. So science we have this guy with five bucks and those three guys with one we also have to understand how to manipulate with them and which quotas which policies we can grant to them. So which capacity they will utilize. So and if you have this key customer it should be really happy. Just grant you five. The next area is everything related to services. As we know the open stack is a service cloud platform. So let's describe it it's like it's dots there those in the space yeah in the some space. So and they should be connected between themselves somehow. And you have to be proactive to create those stable connections between those services. See those guys would like to be in HAA by the way. And the next slide about this. If you have service and you have stable connection between those services with the nodes you have to put your services in HAA as much as you can. Because as you remember that previous slide with the problematic of the HAA it's usually happens. So and also the recommendation here for these teamwork you have to provide your services with 20 proteins of amount of operational power. So simple kind of the rule here the service HAA is very important. Use that rule and your incomes will be stable. Okay the next area about our theme of how to avoid cloud to break down is the stress. I know that it can be painful when you have some issues inside your cloud and you don't know how actually to fix them. How to avoid them or prevent. Yes you have some monitoring solutions and you have experience and you already created some cloud. But anyway you got this issue. The customer faces it. And probably it's a kind of now niche and probably you have some page or work around somewhere and you need just to fix some service or something to provide your customers with stable work. So here is the typical scheme how you can perform it. So there is a CI CD thing inside this square. So I'm not talking about CD systems at all because usually huge enterprise utilize the system. They have stagings. They have a lot of them. They have a lot of production clouds. But anyway they have some place where you can check and verify some patches work around and probably they can create build tests and packages. So it's another story I would say. But if you're from small company you can just create some small staging environment. Let it be one controller and one compute and check your patches, check your work arounds and what you're going to do with some services and configuration files there. And also the thing you have to move those changes into the production cloud only with the maintenance window. So maintenance window it's even important as well. So because let's say your customer is going to create a VM and these customers are going to do this through Horizon. And your work around or patch fixing something in that Horizon and you have to restart some services or might be you will change something inside Horizon view or maybe you will create new strings there or new options for that VM. And this guy is going to open the page trying to start the VM but you're going and pushing some buttons and boom he has nothing to do. He creates a ticket for you. So to avoid such behavior you have to create maintenance window and put your tested patch or work around there on the production on that service through maintenance window. Okay the last one above this is about rollback. So sometimes even if you're tested something inside a staging course the ICD system probably your patch is not successful. So what you're going to do with this you have to be clear and create some rollback procedures what it means. Let's say you did this you created the work around you scaled and maintenance window you fixed the service but something is broken. So I recommend you if you are not using salt if you're not using Ansible puppet or other systems which can provide you rollback in one moment probably you have to create some script some short script how to change the package or change the configuration file maybe you will create some backups and you will restore them. So this area is just showing how to manage and how to create this process right. And the rule here be sure that everything is tested before you're trying to rolling something on the production cloud. Okay right now I'm going to share with you some support best practices which were learned from the really smart folks bless you. So those best practices they're simple but they can give you a breeze of the fresh air for your cloud if you will have something some issues or will provide you with some time frame if you will have something which you have to fix. Okay so let me start with some some story two years ago maybe three some folk called me and say hey Tony we have the issue with AMC we need volumes they're stuck please help us. Okay I said so could you just provide me with the configuration file and some piece of logs I will just get understanding what is going on inside. Okay and he did this and I got the configuration file and I when I have seen inside I have seen that around 10 backends so that mean that that guy care about customers if they have some issue with some backend he can't just give them the option to use different backend so if you have the option to use knit up AMC we need it's LVM anything related to your thinner things please do it please use it if you don't have anything but you have some nodes you have capacity you can create two default beacons like LVM tough or LVM tough one tough two and if you will have some issue you can be proactive and you will be able to provide your customer with this feature and you will have a time to fix what is going on inside the beacons which faced with issue the next best practices about the core is a ration as you can see on this slide and this as well so we usually facing this issue our customers usually facing them around 76% of our customers a faced the issue was high CPU utilization and you also can see the count of incidents here since we started the teach them how to reserve the course so they are the count of incidents decreasing so we recommend to you reserve the one core for operational system interrupts needs and for one core for each knee so why it is for example your customer created some ticket he said my VM is not working okay fine you said and you were going there inside but you can't use SSH white that's because you don't have capacity for that guess you don't have reserved the one core for voice interrupt so you can interrupt and you think it's working inside so you can go even go there inside or the different story you have the connection with you trying to ping some address some of controller but you can't reach it you can't maybe you can't ping but you have some issues with packages packets so there is why we recommend you to do the things and you can use for that purpose CPU affinity it's nice software so you can Google it and you can just do these even if you steps with a few comments the next hacky-trick is from all the author time so when I was just small engineering some small telecommunication company we usually gotten such issues when the customers facing with some strange behavior they have a lack of capacity of the disks so and we did one thing we just created some file delete me if something and we put this file on each disk of the note and what will be if you will have a lack of capacity there you would be able just delete this file and you'll be able to fix something delete logs and other stuff I know what you have there anyway the go here is a time yes you will have this time when you will use this hacky-trick and also as addition you also can use tune to FS to also give you a breeze of the fresh air for your notes because for default is utilized for 5% of capacity and you can decrease it for example on to you so you will have also capacity so but remember this will work on only only st file systems yeah the next best practice is also because of the futuristic thing about a reduced footprint when you're trying to create your your cloud like VMs on the some note so sometimes you also have a lack of capacity for your controller yes you have some services which utilize your disk so the best way is just to have some comments like on this slide you create some disks in additional use virge to attach to the disk you're also going to check that you have those disks which were attached to that VM that controller VM for example using the wish so then you just can Google some common surges use the man of the years how to understand how to add and added some pieces of settings for that VM then you can go inside that controller which are virtual and prefer some comments to resize the that capacity about disks so and after you can check that you have the capacity and your control plane will be saved then after relax okay many words were spinning to talk with our customers about the EGA strategy and on this screen we have some diagrams which show now the incidents registered with the non-HA services unfortunately we have around 19 proteins of users sorry 18th proteins of users which still don't covering their services to be the HA so they just I don't know how how they actually do these guess if they using fuel they are okay and they are they have a peacemaker there inside so they can work with this without any issues but anyway we have some custom installations and they still and want to use a shape rocks you are and want to use the place maker itself so they have issues with the split brain or other other issues so that's the best practice to use each a if you like clouds if you have a lot of services a lot of knots I don't really can't imagine how you can avoid the HA structures at all okay we have a time so the few words I would like to say about cloud monitoring because we can't be proactive without any monitoring we have to see what is going on inside a cloud we have to understand what's going on inside the VMs and other stuff so our company is providing our customers with two solutions it's Zabix and a StackWide let's just describe the Zabix and StackWide in a common view I know that many of you already know these technologies these monitoring systems but anyway without those pieces our picture wouldn't be valid so that Zabix it's a good tool to monitor the state of your services the state of your nodes it's optimized to work with the open stack services so if you will use fuel or other our products the Zabix can be installed there as a plugin so and you'll be able to do nothing just install it as a plugin with some button and that's all Zabix will know about everything what is going on inside your cloud so and it will be happy and if you will have some issues like this one this red you will see the behavior and you'll be able also to see the CPU utilization here and you'll be able to fix something or prevent something using this tool so there is a common hints if you're using fuel you can use them if you are not using fuel you have Zabix standalone mode it's also simple for any strange behavior that Zabix you can just restart it it's like kind of Windows way but anyway it's it's work yeah it's working the stack light the stack light is also known as the Mirantis logging metering alverting tool chain it's a professional heels and response when you're running solution so if you would like to get understanding what is going on in this project you can go on our website and you would be able to and get a full understanding about its kitchen so you can go you can get understanding the car components of the stack light is collector plugins in fluids it's elastic search kibana plugins not use alverting plugins there is a typical scheme of the stack light so it's easy to get understand what what it is it's just a good monitoring system which provides you not just a common view of your instances of your notes it also provides you the business KPI so if you have something which you would like to count your money for example or right now some other stuff metrics so you can use tech light fortunately this feature is installed in ccp as well and you can install it for fuel as a plugin so no issues here and also we have to mark the one more rule here about the monitoring monitors as much as you can so if you have something it should be it must be monitored because you can't be proactive because if you will see something you'll fix it if not okay it wouldn't be fixed never the last thing which I would like to discuss with you it's everything related to your common support strategies right now we have two of them it's operational and manage operations let's discuss the first one let's say that you have some user and this user creates some ticket because if I said some issue and there is you or me I don't know this guy from Mirantis so you get this ticket you fix it and you provided this guy with a page and work around with the CICD as well yeah and what will be if you will have three guys and they also create a ticket for you which ticket you should fix first which issue I should fix first which ticket you have to care about first so you can create some strategy let's say that you will mark those tickets like urgent high and low urgent it's when your cloud is when your control plane of the cloud is in downstate the high will be when customer has the issue but he can't leave with the issue probably is lack of backend or something or VM slow work so and the lowest let's say it's kind of feature let's say that customer would like to spawn three VMs but you can grant them only two so you prioritize them and you can go inside each of the ticket get an understanding what is going on there inside and fix it with this spatial chain which you created for yourself you can create your own system and your own strategy how to care about the tickets maybe you will use not just prioritize it maybe you will have some time frames when customer would like to fix the issue maybe you have some strategy and policy that you're you have to fix some urgent issue in one hour or 15 minutes so anyway you can use this strategy you can do you use this common picture because it's just common view of the operations more granular than I'm describing or if you're stuck and you would like to get understanding about this strategy more you can call our Miranda sales department they will help you to understand which policy which thing you have to use how to create your operations so don't hesitate just call us managed operations the managed operations it's a new era of the support of the clouds of the open-stack clouds in this scheme we don't have any user yeah we actually have users but they are hidden they're just using our cloud their utilize our capacity but they are not asking us to create something ticket or fix something they don't know about anything what is going on inside that cloud which you're providing them with so you have a bench of a group of the open-stack ninjas they knows everything about the services about a strategy and those guys using on this picture for example Kibana and stack light they can use abix as well using that's this tool they trying to prevent their incidents they trying to prevent the issue if they have seen that you have a lack of capacity and the soon the cloud will go down they cannot this capacity they also will create some tickets for themselves so and they will provide you and your customer with the fix so that's all so if you guys have any questions or any trouble with anything related to open-stack and how to support you can call us because we have a lot of guys who care about our customers we have a huge expertise in that so we can help support and just give you some articles and knowledge we and share it with you actually if you have any questions you can contact me directly there was my email addresses and I also I also put the my business cards here as well you can get them so if you have any question you can call our sales department or you can call me or drop me an email if you have something to say or if you have some issue and you would like to get some article so if you have any questions please ask okay then thank you