 Good evening all, today I am here to talk about best practices of performance tunings of OpenStack Cloud with Chef. So by the way, how many people are using this OpenStack with Chef, wow, good number. I think almost everybody is using it all. Anybody newcomers for the Chef, okay, one, great. Everybody knows about Chef, good. Okay, so before going for the next slide, we want to introduce ourselves, who we are. I am Swami Reddy working with RJL. I am working with OpenStack and Chef projects for last three years. As my job key responsibilities, I need to manage my Chef clusters which are back and for OpenStack Clouds. I am having around 15 years of OpenStack community experience like linux.org, gunu.org, gcc.org. So I am familiar with all OpenSource communities across the world. So now let me introduce my colleague Mr. Pandyan, who is an ATC in OpenStack and key contributor for all projects, he is doing a lot of patches and all things. He is one of the community core community member in India OpenStack community. Okay, sorry. Okay, now let's go for Ajanda. What we'll talk today. Today I'll quickly going through the quick overview of Chef, then OpenStack and Chef integration, then going with the few recommendations for the OpenStack and few recommendations for the Chef, then followed by question and answers. So quickly tell before all my recommendations, this is a setup, we are using it out. Typically we have a general purpose cloud with 200 nodes, we are using for compute, block storage and object storage as use cases. And currently we are running approximately 2500 VMs with 40 TB RAM and 5000 CPU cores. Everything is with our raw Chef storage with four petabytes. On the average, we are using for 20 GB Linux volumes as a boot volumes and 200 GB Windows volumes, sorry, 100 GB. So on the average, we are having the data volumes for 200 GB for both VMs, as Linux as well as the Windows. So below the compute and storage configuration just glance it out. These are all compute configuration, this is the storage configuration. Okay, so now we'll talk about quick overview of Chef. So what is Chef and what is the, just one, I won't spend much time on this. Just have a glance on this so that what Chef provides. Chef is designed to provide the excellent performance and reliability and scalability, which is the Chef was designed for that. It has basically three components, Radar's Gateway and RBD and CFFS. So RBD we use for the OpenStack Cinder as like object storage, block storage as a back end with RBD. And Radar's Gateway is the object storage which is back end with the Radar's Gateway client which supports only the APIs, the APIs which are compatible OpenStack Shift as well as the Amazon S3. Other thing is based on the Radar's on the Chef cluster, which is, so all three, this Radar's will be giving the block storage and everything will be stored in the Radar's. This object storage which is Radar's Gateway, which has the clients, supports the client's Gateway and again everything comes here. So FS Chef file storage still some not production versions you are not using. So I won't talk much about it. Sorry, I am just confused with this. Now I will be talking about the OpenStack Chef integration. What are the components on the OpenStack? What are the Chef components? How do both talks each other? This is a typical diagram what is the OpenStack? So it has this Sinder, Glance, Nova and other components. So we will talk one by one, Sinder. Sinder is the OpenStack block storage services which supports the persistent block storage for the users and it talks to the Radar's Gateway, sorry RBD, Chef RBD, then eventually it comes to everything here, Radar's. So Glance, Glance is again image services which will provide the catalog for all the images, what it has stored and everything. Again it back end by the RBD, all it stores in Chef Radar's. Similarly I will first talk about Shift. Shift is object storage of OpenStack and it has interface, APIs which will talk to Radar's Gateway client and eventually store all the objects into the Radar's. So coming to Nova. Nova is a compute services which will spawn VMs and manage VMs and attach volumes and all things using the hypervisors. So Nova talk to hypervisors and all the back end is RBD. Again all the volumes will, everything will be in RBD and coming to the storage here. So here is a typical flow for the block storage and object storage. So first block storage if we go here, the OpenStack, Nova will be there, it will talk to the LibWat. So LibWat configuration will say that how to talk, what is my RBD and all things. It will come to the QEMU and from the LibWat, from the LibRBD, from LibRadar's, this is the lower part where the OSDs and everything will be available. Moans OSDs and everything will be available. Similarly for the object storage. So object storage it doesn't have anything like front end. It will depends on the S3 APIs compatible and Shift APIs compatible. Usually talk to this API compatible. The request will come from here. Go to the Rados Gateway client. Again Rados Gateway client is built on the LibRados. LibRados again internally goes to the Rados Chef Rados. This is a typical flow for the object storage and block storage. So now I'm going to talk about a few recommendations for the Chef to get the best performance or the tunings. But these are all the recommendations based on my experience and what we are done. So again depends on the use case. It's not one-to-one mapping of all the use cases, whatever we are using now. So first I'll talk about the glance. So what is a glance? We just discussed about glance. Glance is an open stack image services. It provides all the images and the catalog available for the images. So here I have few recommendations. Like this is the first one, say default. If you want to use the Chef as a backend default or storage, you can need to say Orbedy. It is a default one. Second one, disabled local caches. If you are using the boot from volume, your images, you are spawning a VM with the boot from volume. You will download the image always and keep it in the local cache. Let us say if you are doing for 1000 times, 1000 VMs you are spawning. 1000 times the image will be downloaded and kept in the local cache, which will eventually kill the compute's local space. Like 10 GB into 1000, two big numbers, which will all space will be gone. Due to that, you may impact about the existing processes. Lack of memory, lack of threads and all things. So I recommend to remove the local cache. So second one is a very important thing, which is like show the image, direct URL and show the multiple locations. So this will support us to two ways. One, there is no downloads required if you have direct URL available and no copy required because due to that, we know where the image is available. So here is an example with the direct URL. We could see the direct URL as like this, RBD, image ID, images, the full location will be like this. Else we need to go to the glance, ask for that, keep downloading and keep it in local cache or don't keep it. So it has all the time saving for that. For the glance with Chef, always I recommend use the raw images because Chef internally supports the raw images only. This helps to save the conversion time. So here are the sample test results, what we have ran it out. If you use the QQ2 image of windows about 50 GB, booting time took around 45 minutes. So simple with the raw image, so less than a minute because it saves the time on conversion, downloads and everything. So quickly picks the image from the Chef itself and boots. So now I will talk about the Cinder recommendations. What is Cinder? Obviously we discussed Cinder is a block storage services from the open stack and it supports the persistent block storage things. So here I don't have much recommendations except one. Since it is a default one, if you want to enable the Chef as a back end, just say enable Chef back end. And for the Cinder backups, I always recommend to use the Chef if you are using the Chef as main storage. Back up also you should be having on Chef because Chef internally supports the incremental backups. So once you have your backup is back end by Chef, all your backups are incremental backups. That will eventually save you a lot of space. Let us say first backup is 10 GB, second gap is 12 GB. You will have only 2 GB for that. And Chef has well known about this support so it maintains all the no dependency maintenance and everything it maintains. So this is a strong recommendation if you are going for the Cinder backups and with the Chef as your main storage, it is recommended to go with the Cinder backup. This is a default configuration, nothing much but I recommend this to have the incremental backup to get a lot of space savings. So coming to the NOVA. NOVA I do not have much recommendations except this is a default configuration to configure things and I recommend to use the LibRados instead of KRBD. So LibRados will link with the LibWord configuration where we need to say enable cache. That will give a read IOPS more, read performance will be more. So that is what the basic components of OpenStack recommendations. Now I will go with the Chef recommendations. We have a lot of things here. Major recommendations come from here only. Before going for the recommendations, we need to answer the decision factors, use cases. What is the storage required? And there are two types. We have a raw storage. Let us say I have so many petabytes of storage. Can I use everything at one shot? Because Chef use the replication factor 2, 3, 4 based on the use case. We have 4 petabytes, 2 replications we get usable as only 2 petabytes. If you have 3, divide by 3. So this is again required, depends on the use case. Depends on the use case, depends on the requirements. Similarly, we need to decide on how many IOPS required. What is my requirement? I want very high IOPS with low cost. See, average, upgraded how much, per VM how much. These are all the we need to get answers before going for the trickings and things. And again, most important thing is that what to optimize? We want to optimize for the performance or optimize for the cost. These are the very challenging item. Both are very opponents. See, if you want to optimize for performance, you need to put a lot of money for that. Or you want to cost for this, you need to compromise on performance. So this is again decision factors before going for any performance, any trickings. So here is a quick performance optimization criteria. So I could simply, this table is a self-explanatory, just to plan it out. So what is IOPS? So low cost, high cost, and throughput is what is optimized. And capacity, what is the cost per TB? These are the examples. These are given the examples. Now I will be talking about OSD considerations. Self has OSD, right? What is OSD? OSD is an object storage demand, which is responsible for storing the objects to the file system. For this, I have few recommendations. What is the CPU required, right? This is the minimum. It's not maximum. So these are the minimum requirements for CPU, RAM, and all things. If you have more, it will be better. And ChefMons. ChefMons is a ChefMonitorDemon, which is the, maintain the consistency about the maps of ChefMonitorMaps across the cluster. This is the, this demon maintains the consistency across the cluster. Again, it is recommended to use one ChefMon node for 15 to 20 OSD. So let us say we have more, it may take time to make the consistency. Copy the maps one to other. Delay may happen. And networks. Always networks, right? So you should not cross the network throughput. What is the all OSDs? Let us say I have 20 OSDs, which gives some X throughput all together. And we should have the at least X or above X. We should have below X. So that will throttle the performance. Because all the OSDs operations will go and block at the network layer. So it is suggested, or at least should not exceed the network throughput. All OSDs. Again threads. See if you have, using the more OSDs on a single node, each OSD will spawn so many, a lot of threads based on the, is internal operations like backfilling, right recovery, scrubbing, a lot of things, right? A lot of other things. But be careful that number of threads should not affect our performance. So Linux will support some X number of threads only, or Linux box. So always recommend, at least around 20 should be recommended value for the OSDs. And one more thing here. The Chef OSDs always go with the J-Bot mode. Never go with any red setups. Because if you add the red one, it will eventually cause the red controller all the performance bottlenecks comes. Because Chef itself maintains the copies and all things. So definitely go with the J-Bot mode. No red required. So now I will be talking about Chef OSDs generaling. So what is the generaling purpose? Why people use the generaling? So basically general you need for getting the speed and consistency, right? So Chef OSD demand will talk to the journaling to quickly write the data to the desks. Quick operations. And consistency how it comes. OSD will take care to flash the data to OSD from journaling to the file system with a fraction of seconds. During that time, it won't do any write operations or read operations, but it flash the data to the file system. So these two operations OSD will help. The OSD generaling will help us to get the performance. Again, we have multiple types of generalings. On disk generaling or use the separate hard disk as a generaling or we can use the SSD generalings. So if we use the SSD generaling, we will get good performance compared to the on disk. We have a test results which we have done with on disk generaling and SSD generaling. With on disk, we got around 45 bps. For SSD, we got almost double. But we have our test environment used with 1 is to 11 SSD OSD ratio, but as the Chef recommendations, it is 1 is to 4 or 1 is to 5 or 1 is to 6 is the better one to get the best performance. So same results we have done for the other things. Sequential writes and random reads, sequential reads and random reads, these are the results for this. So operating system considerations. If you have a Chef nodes or Chef, what are the operating system tweaks, we want to set it out. I just solve, you can just glance it out. Nothing is all self-explanatory, CPU tuning, IOS show dealing, disable numers, swappiness, put the maximum kernel things, enable the bias with HD and VT. But these are all the default things, but these are all required to improve the performance from the OS side. So now I will be talking about the Chef networking. So what are the things required for the Chef side? So it is always recommended to use the two networks, one for the public or user, another for the cluster network. Because Chef has been doing a lot of internal activities like rebalancing, recovery, scrubbing, so many other things are happening. So these operations should not affect the public network. Let us say we have only one network, Chef is occupied everything, so users will see the slowness of that. So that is the best one is to go with two networks, one for the public and one for the cluster. So if you have two nicks, so two networks comes with two nicks. So better to use the 10-gig, if you are offered to or one gig is fine, but it is advisable to go with 10-gigs because Chef is a lot of intensity using all the internal operations. And one more point, the demo frames are recommended across the network. So this is how to set the demo frames. And the tower switch and spine switch, we need to have a high bandwidth and always go with the BMC hardware to get the alerts and all things. So in our environment we have done the nick bonding, we have two nicks with the 10-gig and we have done the nick bonding with the LB mode, balance LB mode. So this is the before nick bonding, we got around this is speed average 5 gbps. Now after doing the nick bonding, we got almost 9 or 8.x gbps. So I recommend to use the nick bonding wherever possible to achieve the good performance. So now I will be going for the failure domains. What is a failure domain? A failure domain always says that something fails, we have to access our data. That is a failure domain. Chef supports these are the failure domains starting with OSD, host, chassis, RAP, ROP, PDU, POD and all things. But there is a info right. So if you have something cost added for the isolation of data, if you go with the failure domain as a host, we should have that many hosts to support our environment. Again similarly if you have chassis, rack again it will cost added for the cluster. For example if you want to go for the rack, we should have minimum 3 racks or 2 racks as per the replication count. So I recommend based on this right, you select the chassis or the rack for the failure domain, for the data durability and data availability. So now I will talk about operational recommendations. So what are the, during the Chef, Chef is running, we want to tweak the performance or we are impacted by so many items, how to deal those things. Few or things are scrubbing and deep scrubbing. What is a scrubbing and deep scrubbing do that? Scrubbing and deep scrubbing is a mechanism to do maintain the data integrity across the Chef cluster. Scrubbing is simple light scrubbing. It will just check the object, size as well as the attributes are fine. It won't do that much in that. But deep scrubbing will read every data and checks it, checks some, everything is fine. So this deep scrubbing will take a lot of CPU cycles and impact the performance of the existing cluster. So we have, these are the options how to disable, enable the things. So once we disable Chef, health will go on state and again we can enable like this. So this deep scrubbing Chef supports few options. Like we will enable the scrubbing begin and end times. So let us say normally we have the operation so non peak times. Users will be there in the 8 to 8 or something like that. So in that time, we'll stop the scrubbing. Chef won't do anything scrubbing back end. So we will set timings. When it start, when it begins. So what is the threshold levels? And similarly for the deep scrubbing, these are the main travels. Like deep scrubbing I can set it to 2 weeks once per PG. So that after 2 weeks only, it will again read all the objects and check it out if it is everything perfect, then do that. And other performance like the related operational items comes like recovery and backfilling. So let us say we have some OSDs or some node gone down. There is a lot of objects to the backfilling and recover. The data has to be recovered, right? So these are the parameters which will impact on these operations. Max backfill is a default is 10. Either you can increase or decrease but there is a complexity on it. If you increase it out, performance, the recovery backfill goes fast but if you decrease it out, it will go slow. So basically these operations are linked with the user operations. Share the same bandwidth on user operations and the cluster operations. So these numbers we need to tweak very carefully during the what time we can do and when we do not want to do these things. And these are all things we can dynamically tweak. Nothing to restart, nothing to do anything else. This is self-command style. We can inject the things dynamically and we can change that. But before these are the very carefully taken that may impact your user operations. If you increase, impact the user operations, if we decrease, definitely it takes, recovery takes long time. So finally I have the few guidelines for the performance tuning. So always change one option at a time. Never play with two options at the same time and we do not know what is happening with other things. So always I recommend to play with one option at a time. Check what is changing. Let us say you have changed the X option and identify what is happening with that and see if it is impacting or not impacting or not. And if the option related performance test you need to run. Let us say you have changed the IOPS related things and you are testing the performance of other things which are not related. So you cannot identify whether this tweak or performance tuning is really working for you or not working for you. So that is the reason. So the right performance test for changing options. We have changed at least 10 times. Do not assume that one time I have done, okay it is working point, okay I will go with this. Never do that. Do at least, I recommend 10. Maybe you can take bigger number or smaller number. It is always good to go with 10. Sorry, what happened? It is recommended to change multiple times and see what is happening and identify that. Then only take addition. And if during the, any value or configuration changes, do you see any errors in the log files? Just observe that. So that will be very important. And look for the results of the things and estimate these general things. So coming to the Chef tuning side, Chef cluster parameter, that is what I already told that. So we can dynamically change the workloads or change the performance configurations dynamically. No need to do the restart service, nothing required. But these changes may impact the data integrity and so that it may cause the rebalancing more and reduce the network bandwidth. Tuning should be performed on the only test environment. So this is the most important people directly do on the production environments and see, oh, my production gone, what happened to do with it. So it is always recommended to do the testing of the tuning or testing everything on the test environment. And confirm that all the results are fine and everything okay and make the automated process to how to do on the production, then go for production. So now time for question and answer. Could you please come to the mic so that it is recording? This side. So as a brief follow up to the statement about the testing and make sure that it's done a number of times, I have worked with disk testing in the past and there is actually a group called SNEA which deals with being able to measure performance metrics on disks and yeah, there are recommendations for the minimum number of times you need to rerun a test in order to be able to determine that statistically valid. There's also something very important to note on that one and that is solid state disks of course perform better when they're new. So they have an elevated period of performance that will drop off after a certain period of time. So I'm very glad to see that you put that on there and we're mentioning that because that is a big pitfall. So it's more of a thank you for validating that than necessarily a question. Thank you very much. Very good. Just a quick question. So one of your slides mentioned that it's highly recommended to have hyper threading enabled. So why is that? Did you do any benchmarks comparing self-performance with and without hyper threading or was the motivation there? I profiled it. So one of the slides further. Okay, fine. I understand that. The bios settings I have recommended, right? Okay. See, nowadays the bias comes with the default in disabling the hyper threading and virtual technologies. But Chef requires a lot of CPU cycles to perform the operations because one OSD will do all sorts of actions like recovery, rebalancing, scrubbing, deep scrubbing. So much activities has required. Definitely if you enable the hyper threading, it will add the performance. And by default, nowadays all biases are coming with disabling state. So I recommend to go to the bios settings and enable that. All right, thanks. Thank you. The real reason you want hyper threading is it kind of hides the latency of the memory accesses and Chef is all about going to storage. So when one thread is going and doing something in storage, scrubbing, cleaning, whatever, storing, the other thread can make progress. Thank you. We have four minutes left. Any more? I have a question about networking. This cluster networking should be separated physically or should be another VLAN. Do you have any test results about this difference? I don't have at the moment, but it is recommended to go with physical if you have the infrastructure supports because going for physical is almost cost, right? We should have all towers and we have all routers and everything. But worst case, we can go separate with VLAN. It should be OK. OK, because we have one network and we observe packet drops on our switch and I'm thinking about separating this network physically to another switch. I don't know if I see any difference or not. The difference will be there, but again, it depends on the use case, whether if you are able to provide all the towers and then it's advisable to go with that. OK, thank you. Hello. You mentioned that you recommend to use HBAs for connecting disk OSD devices and not recommend using read controllers. And did you run some tests or other things because our experience shown that read controllers with enabled cache gives better performance than HBA mode? OK. I didn't have that use case. Maybe later I'll talk to you and we'll give the information. OK. Thank you. So last minute. Does the Ceph cluster go to warn after it couldn't be scrapped for, I don't know, days or weeks if the load is too high? OK. It is not like that. It will be operation. If you disable the options, I'll show you, I'll just tell you through that. OK. So if we disable the options, then it will go on one state. So this is what I have disabled here. If you unset the scrub option, then it will go on one state. But due to the high load, it is not complete in the scrub. It won't go further because eventually it will try to do that, right? So Ceph health will be OK. Thank you. Thank you very much for the patience over time. Thank you very much.