 Hello, stickers. Did you have a nice breakfast? Some people are eating now. So it's the last day of this summit and the first session of today. Thanks for coming to our session. Actually, we are not native speakers, as you know, so you need a little patience. Let's start our journey. I'm John Han. I'm a DevOps engineer in the cloud computing team in NEMMable. Hello, I'm Jaesang Lee. I work for SK Telecom at South Korea. I have been working on Cinder for several years. Now I'm responsible for deploying and operating Cinder with SAP. This is the second summit I joined. At last, I went to Barcelona and I was only watching. But this time is the first I present. It's such an honor to present this time. We have one more presenter, Byungsoo. But he can attend this time because he has some personal reason, unfortunately, since we came from very far away. And you might not know our company. Let me introduce our company first. SK Telecom is our first telecom company in South Korea. Like many other telco, we are making a lot of efforts to virtualize the networks. I belong to NIC Research Center. It's a synonym of network IT convergence. We are committed to compute and network and storage virtualization. The open system is researching compute and network virtualization. Our mission is to virtualize all of our legacy infrastructure. OREO is our project for cloud platform. It's a synonym of open reliable elastic on OpenStack. We're trying to build auto-upgrade OpenStack platform on Kubernetes. And network virtualization solutions are developing that call Sona. Each neutral ML driver and S3 service program are based on owners. Yeah, I want to show a cartoon before introducing my company. Do you know what does this simple, four-card cartoon mean? You see, raise your hand. So many, oh, you know, yes. And many people know that Korea is famous for its gaming skill like StarCraft or LOL. But Korea has a lot of game companies also. Netmarble is one of the Korean game companies. Over a relatively short span of two years, we have rapidly and constantly move up the ranks among global mobile game publishers. As of February 2017, we ranked number three in Apple App Store and Google Play combined. In Google Play alone, we were the number one publisher in the world. We have over 10,000 running instances that are on service related on board games and platforms. We have eight self-clusters and the total usage over two petabytes. OK, our role is DevOps for profit service each company. We are concerned about the efficient and fast integration, self and single. As everybody knows, there is no longer any doubt that SAP is the most typical and powerful back ends of the center. SAP is a unified storage system that supports multiple interfaces, including object storage, block storage, and file storage. It also supports most of the storage service in the open stack. So it enables storage as a service. Also, SAP's copy on write is extremely helpful for cloud services that need to deploy VMs quickly. However, despite of these advantages, there are many difficulties in integrating SAP with open stack in the production environment. This is the purpose of those topics we want to talk to you. While operating a storage service, it was imperative to address elements such as performance tuning, HA, volume migration, and replication. But the existing documents and guides could not be adequately supported. We want to share our experience about these topics. And this presentation is intended for storage service manager and operator who want to operate cloud service, not already expert. I think there will be SAP experts or a guru here because today is SAP day, and this session is a related topic with SAP. But if you already are experts or a guru, you'd better go another session, I recommend. OK, our agenda is largely performance tuning and operation. Mr. John, Mr. John's part is performance tuning like cross-turnable bucket type journal tuning. And I will say about operation topics, including high availability and something else. OK, let's start with performance tuning. What do you think of performance of SAP? IOP, throughput, latency? These are all right. But I think the performance of SAP should be considered in two aspects. The first is a simple numeric storage performance, and the second is the overall performance impact on recovery or rebalance. Today, I'm focusing on the second things. As the version of SAP was released, the crush algorithm also advanced. The crush algorithm is the rules for storing where the data is stored. SAP manages its history under the name of tunables for these algorithms. A series of tunable options that control whether the legacy or improved variation of the algorithm is used, you can find the crush tunables with the command SAP OSD crush your tunables. Each line means a specific set of algorithms. SAP SAP tunable profiles named by the release, like legacy, agonaut, bartail, and so on. Legacy profile is covered by the agonaut version of algorithm. Optimal profile is the best values of the current version of SAP. Let's take a closer look at the crush tunables. The most common profile is firefly, I think. It fixed internal weight-calculated algorithm for store bucket. Since Hammond profile, SAP supports store-to-bucket type. The drill tunable profile improves the overall behavior of crush such that significantly fewer mappings change when an OSD is marked out of the cluster. Thus, the more up-to-date the crush algorithm applied, the more improved the performance of data movement. However, there is something to be done before adjust optimal crush profile. The client must support a feature of SAP tunables when you use kernel-based RDD. It is the reason that the default profile setting is firefly even though we install the drill version of SAP. For instance, if we want to apply drill version of tunables, your kernel should be above version 4.5. Bucket types define the types of buckets used in your crush hierarchy. Buckets consist of a hierarchical aggregation of storage locations like rows, racks, and hosts. Since the hammer version, SAP supports store-to-bucket, it fixes several limitations in the original store bucket. Specifically, store-to-bucket achieves the original goal of only changing mappings to or from the bucket item whose weight has changed. The new store-to-bucket will apply when you change the SAP tunables to optimal. However, in the case of existing buckets, the crush rule must be changed manually to store-to. This is the data movement test for store and store-to-bucket. The test environments are consist of total 6 hosts and 84 OSDS. We chose 6 OSDS randomly and changed our weight to 0.3. In case of store-to-bucket, total 8.713% objects are degraded. In case of store-to-bucket, total 3.596% objects are degraded. We figured out that store-to-bucket type can improve or rebalancing performance. Next tuning point is SSD journal. SSD can be divided three types according to read-write priority. Read intensive SSD benefits for read performance. On the other hand, write intensive SSD supports application with heavy write workload. You can find the specific performance has a big gap, even though there is the same model of SSD device. For most cloud environments, write IOL ratios account for more than read. In our case, almost 9 to 1. SSD journal write performance can be bought like during rebalance or recovery. Thus, the performance of rebalance can be improved when you use write intensive SSD journal. This is a test environment to check the performance impact of recovery of staff. We implement upstack with that. 100 VMs are running on five company nodes. We developed a Python agent that stores VMs read-write performance data into influx DB via read-bench tool. We store the data every minute and make a graph them through Grafana. In addition, 102 OSDRs were intentionally removed physically to generate false condition to check the effect of recovery. As a result of test, IO weight was observed on the SSD journal disk. We figure out that write performance of SSD journal can be a bottleneck of recovery performance. So overall throughput after replacing with write intensive SSD was improved from 254 megabyte per stack to 460 megabyte per stack. Failover times have also been dramatically reduced from 100 to 48 minutes in the case of physical removal of two OSDRs. In conclusion, you should chose the SSD type that is appropriate for your workload environment. Yeah, the next positive is high availability. It's very critical and important part of system administrator from service launching to finish. How about Cinder HA? Cinder service consists of API and scheduler volume and backup service. API and scheduler are relatively easy to deploy through HA products and revenue queue. But volume and backup service has many difficulties in communicating with volume back ends. Now let me talk about HA about Cinder volume and Cinder backup. Traditionally, Cinder volume service was started on just one node. And it can be moved and migrated to other controllers as necessary. Due to single threading required for Cinder volume, there is a single point of failure. We have good and bad news for this problem. In this picture, you can see support, high availability, active-active configurations in Cinder volume. This picture hasn't been completed yet over more than a year. But it has good progress. You can test active-active configuration now. It's a very simple workflow, I think, about Cinder volume active-active feature. Most big concept about Cinder volume active-active is cluster. If you cluster the multiple Cinder volume, Cinder volume makes a queue for cluster use. When Cinder API was requested, an API call is scheduled to request through cluster queue. Let me talk about the proper concept for Cinder volume active-active feature. Because this feature is under construction, if you want to test this feature, add support active-active option in Cinder volume like this. And add cluster option in Cinder configuration file. That is all. It's very simple. This is an example configuration that add cluster and host parameter in Cinder configuration at two hosts. And then you can check the cluster status by run cluster command. This is there is nothing, but you can check the cluster status after you run the cluster. Because Cinder API support cluster command from 30.11, you should set with volume API version higher than 3.11. In this example, I set 3.29. This is a cluster result after you activate cluster in yellow box. From now on, this cluster is only a life cycle of volumes. And then I try to rarely test about volume create and delete. This scenario will create and delete 30 volumes with 10 tenons and 20 users, concurrences two. Two Cinder volume is running and add test and test of a scenario. As expected, all test case was success. Two Cinder volume service was scheduled round-robin internally. I ran the same test and down the one Cinder volume. Although I stopped the Cinder volume at host two, all volumes was created and deleted successfully. Next is another scenario that is create volume and attached to VM. It creates five VMs based on Syros image and create volume that size from one to five and attach to VM. Two Cinder volume is running and test of a scenario. All action was also success, like first test case. Two Cinder volume was scheduled round-robin internally. I also executed one more time and stopped one Cinder volume service. It was success like the first test. All test case was success. It's a simple test, but I think it cover a very core function of Cinder volume, like create, delete, and attach. If you're running just one Cinder volume and you have a SPO app, it's worth considering this feature. But because this feature is under construction, I hope to finish this feature as soon as possible. Next is HA for Cinder backup. Cinder backup is already available at multiple service, but it's a scalable feature, not HA strictly speaking. If you use this feature, you can execute multiple backup service. Every service handles a backup CRUD each. Even if one backup service dies, backup service can maintain through another service. This is also workflow for Cinder backups, but it doesn't create new queue unlike Cinder volume. Each host has a backup queue, and Cinder API choose one queue randomly if they are same availability zone. For multiple Cinder backup service, it doesn't need a difficult option, too. Each backup service has the same availability zone and same backup driver, and only different option is just host. This is a common result for Cinder service list. There are two Cinder backup services at two hosts. I also ran the test. This is a create volume and backup test scenario. It creates a sorry volumes and backup, and concurrency is also two. To Cinder backup service was selected by randomly. Cinder API service choose that. Every request is a success. I also did the test one more time and stopped the one Cinder backup service, but there are two error codes, and successful rate is not 100%. It's a timeout exception because backup status isn't changed from creating to available. For Cinder backups, it seems too difficult to maintain a perfect HA state because it's a scalable service. But even if one service is down, the rest can work without any interruption because Cinder backup service has the same availability zone and same backup driver. They can delete and restore backup made by others. OK, next topic is volume migration. OK, it's my turn again. What kind of feature? Volume migration is migrating a volume transparently, moves its data from the current back end to for the volume to a new one. How can it take advantage of this feature in production environment? We are able to move volumes as a cluster and work on the volume of the cluster if it has a large impact on overall performance or stability, such as maintenance or major upgrade. Volume migration of Cinder can be divided into two ways. First is a driver-specific migration which is internally supported in storage driver like EMC VMAX and HP 3.0 storage. Unfortunately, RBD driver does not provide its own migration functionality yet. If the back end driver is not able to perform the migration, the block storage uses host-assisted migration. In case of block-based storage using the IceCache protocol, mount both the old volume and new volume on the Cinder volume node, copy the data through the DD command, and delete the eGISC volume. In the case of SAF, volume migration operates through the file-based copy. Let's look at how the function works and what happens to the result through four typical user case when using the SAF back end. Let's look at volume migration flow. When the client requests the volume migration, the API's migrate volume function is called. The target host is specified, but it goes through the Cinder schedule logic. Migrate volume function in volume manager calls same name of function in RBD driver, but it's not implemented. Volume manager calls migrate volume generic function in order to operate host-assisted migration function. Host-assisted migration functions works as creating a new volume on the destination back end and copying the data from source volume to a new volume and delete the source volume. Now let's check the result of actual migration through four user cases. User case one is available volume and same RBD back end and same type. The volume back end consists of two clusters, SAF-1, SAF-2. Let's migrate volume from cluster one to cluster two. The first source copy of the command argument means that you want to perform host-assisted migration. During migration, a new volume with the same name as the source volume is created on the SAF cluster two. And data is copied from source to destination volume. Yeah, volume is migrated to SAF cluster two. You can find migration status changed to success with Cinder show command. User case two is available volume and same RBD back end but different volume type. There are two volume types. One is SAF-1, the other is SAF-2, which are mapped to the same name of the cluster. Volume migration was also successful. However, the volume type is still applied to the e-registering volume type. If the volume type is different, the volume type does not change automatically. So you need to make a manual change. You can change the target volume type using the read type command. User case three is volume migration between SAF and block-based volume back end. At first, you should remove the capability filter because the filter blocks when volume back end is different between source and destination volume. We use ecologic volume back end as a block-based driver. We create ecologic volume and attach the volume to a VM. We mount the volume and make a test file. Volume migration was performed successfully on the sender side. Use read type as same as use case two. We attach the volume to the VM. However, we will face the error message when we try to mount the volume. Yes, we can fix it through those tools. Even though we can fix it, I do not want to create a situation that will cause problem. Use case four is same as use case one, but the volume is attached to the VM. An error was in Nova Compute Service when we tried to migrate volume. Swap-only support host devices error occurred when running report swap volume function. There is also a related blueprint for solving that problem, but not started yet. Let's wrap up. Migration between the same RBG back end volumes in available state is very well done. However, if the type is different, you must manually retie. Migration between SAP and other back ends can be done through several volume fixes, but I do not want to recommend it. Migration between the in-use state and the RBG back end is not yet performed normally. OK, the last posted is volume replication. What is RBG mirroring? It means RBG images can be asynchronously mirrored between two SAP clusters. This capability used the RBG journaling image feature. From the dual release, SAP start is supporting the RBG mirroring feature. There are two types of mirroring. When applied to pull unit, all images created in the pull are mirrored automatically, or can be configured in image unit only. How can it take advantage of this feature in the production environment? We can utilize it for disaster recovery. Let's talk about single volume replication. It depends on the driver's implementation. There is no automatic failover since the use cases disaster recovery, and it must be done manually when the primary back end is out. Here is overall step for SAPful replication in Cinder. The Cinder replication of SAP back end requires two different SAP clusters, or many. Configure SAP cluster to mirror mode and apply mirror to pull used in Cinder. In most cases, use the volumes pull. Copy the configuration file and key of each SAP cluster to the Cinder volume node. Then add replication settings to the SAP back end of the Cinder configuration file. Let me explain on detail from SAP side. We should install SAPful mirror package for SAP clusters. You can install easily using app or young command. Enable and start the SAPful RBD mirror service, and enable mirror mode to volumes pull on all clusters. Copy the configuration file of the other cluster to the primary and secondary node respectively. Note that you must save the name of each cluster's configuration file and key file differently. RBD mirror pull peer add command is used to peer the different clusters. At this time, since the peer is established based on the host node, the information of the different cluster is registered in the host file. For Cinder configuration file, we need to define replication device option. We create volume type, which is named replicated. And set two extra specs. One is replication enabled. The other is volume back end name. Let's create a volume middle replicated volume type. After a while you request volume creation, you can see the volume information that replicated status is enabled. Let's see how the volume is created in the actual SAP clusters. You can see that journaling feature has been added to the volume in all SAP clusters. The mirror status shows up plus tapped in primary cluster and up plus replaying in the secondary cluster. Now it's time to make a failover. You can find that the replication status of SAP back end service is enabled before we do failover. We have a problem with the primary cluster, so we must do failover. Just type failover-host comment. You can find that the value of replication status is failingover after we did failover. What about the Cinder volumes? A normal volume that has not been created with the replicated type will be in an error state. And the replicated type volume will failover normally. If you look at the actual SAP cluster, you can see that the mirror status of the two cluster have swapped. But I have one hidden post. This is a bonus and really last section, tips and tricks. Let's talk about some of the tips, that number of volumes in one single VM, an IBD cache, and TCP congestion algorithm ish. And failover always denotes. Do you have an experience that attaches many disks to one single VM until error occurs? Although KVM supports so many disks, over 100, in Nova compute, it supports the maximum number of disks can attach is only just 26, like alphabet order from A to Z. So if you try to attach 27 volumes to single VM, the error occurs. This is a Nova compute log when you attach the 27 volumes. But in reality, the number of volumes that we can use to single VM is considered separate with this above course. Recently, I got issued reports that customers VM that attach 10 volumes is hang. Here, that's 10 same size volumes to single VM and format disk by MKFS. As soon as the ace volume was format, the VM didn't respond. Control plus C break key didn't exit the process. And MKFS process didn't kill by kill command at the terminal. After that, I saw the VM kernel log. There is only message saying that there is no response for more than 120 seconds and there is no comment except that. I don't know what makes this. So start to search about this problem at Google and there is some mailing list about that. A reason is a pile descriptor limit of QMU. In KVM, it executes QMU process per VM and FD limit is 1,024 per one VM. A number of pile descriptor per VM is increased when attached disk and actual IO occurs within a VM. Every test is same in our environment. Every MKFS to attach volumes increase number of FD. And then FD limit is over when trying to format ace volume. From the moment FD is over limit, the IO behavior in the VM has not been executed. The hang is start. OK, how to try? How to fix it? To solve this problem, change your picks. Backspires option in the QMU configuration because the default value is 1,024. You calculate appropriate value of your environment. How to estimate value is I make some formula. You should set the A, B, C, the three numbers. Check number of FD for your new VM is A. And set the volume quota to your single VM is B. And last number of FD for volume in your environment is C. So the formula is A plus B multiplied C. For example, FD for new VM is 150. And volume limit for single VM is 10. FD for volume when you attach is 100. So you should set max price to more than 1,150. After that, you should restart service, report D. Next is RBD cache is not issue, just tips. RBD cache is a very important feature when you use SAP block device. In SAP OpenStack guide, many of you guys see this guide. It mentioned edit SAP configuration on every compute node to set RBD cache like this. But in fact, I found some comment at blog. I think this option isn't required at least in OpenStack. This is Sebastian's blog. I think he is a very famous guy at SAP field. He says, this cache mode option in NovaComp affects every block device. So you just test the edit NovaComp, not SAPComp. So option is just this cache mode, equal network, equal write back, and report section. After restart the NovaCompute, NovaCompute adds a cache parameter to VM so you can check dump XML of VM. What is different? There is B4 and XML for VM. You can see the cache option is different. And this is a performance result about when you activate cache option. The difference is 20 times greater. This is why RBD cache is highly recommended. Next is we'll discuss about performance issued by the TCP congestion control algorithm. We use the CentOS in our environment. We upgrade CentOS from 7.1 to 7.2. And kernel is also upgraded. However, we found that throughput in large block size drops to 40% between CentOS 7.1 and 7.2. So we traced the problem and found that it's influenced by the change of TCP congestion control algorithm. The cause of the problem is when the TCP congestion control algorithm containing stretch expatch is used in a situation where a large packet workload is loaded and the network is busy. Currently, the default algorithm of TCP congestion control is cubic in 7.2 version. Cubic and Reno algorithm actually implement stretch expatch. So we test the other TCP congestion control algorithm, which does not contain the stretch expatch in CentOS 7.2 as close. A test result showed that high speed and Reno algorithm are similar to 7.1 even better in large block. This is a relative value to cubic algorithm in 7.1. If you maximize performance for large block, this is worth to reconfigure like that. Yeah, that's tip is about OSD host failure. Let's imagine there happened OSD marked out situation. The reasons could be disk failure or file stem error, et cetera. We normally remove the disk for resolving the problem and add a new disk into a self-cluster. How about OSD host failure? The reason could be electric problem or network problem. So anyway, it's not a disk failure problem. We don't have to remove data from disk. Therefore, we want to set not to proceed with balancing. Self has subtly limit concept. Subtly limit means the smallest crush unit type that self will not automatically mark out. So default value is rack. In order to set a value from rack to host, we can add configuration like that or configure on runtime through that command. OK, our presentation is finished. Performance, HA, migration, and data replication is very important thing for operator service. If you want to operate storage service with self, there will be a lot of attention ahead. But we hope to keep going, your status. And we hope that the committee will share a better experience and insights in the future. Thank you. Thank you. Actually, the time is over. Anybody has any question? Please come to us. Thank you.