 Good afternoon, everyone. I'm Masahaki Nakagawa. I'm an OpenStack Swift engineer for NTT Data Corporation. Thank you for coming to our presentation. OK, today, we'd like to talk about know-how of challenging deploy operation NTT Docomass mail cloud system powered by OpenStack Swift abstraction. Docomail is 24-7 cloud mail system which has access from over 20 million people. This mail system stores user's mail archive in OpenStack Swift with petabyte scale capacity deployed by NTT Data. We have been successfully operated this service since September 2014 without any downtime. In this section, we will present the actual issue and challenges we have faced and conquered. OK. This is contents and presentation data. There are four things we'd like to cover today. Firstly, Mr. Kakehi will talk about project overview. He talks changes of Japanese mobile station and abstraction of this project. Secondly, I'll talk about process of migration Swift to ex-existed Docomail system. Then Mr. Kasei will talk about Swift technical challenges. Finally, I'll talk about large scale Swift operation. This presentation will talk about 40 minutes. We have a QA session at the end of our presentation. But if there are no time, please come to NTT booth and let's discuss with us. OK. Let's start project overview section. Please, Kakehi. Hi. My name is Sosuke Kakehi. I am delighted to have the opportunity to address you today. Today, I'd like to talk briefly about this project. This project was very big challenge for our customer and us. There are three points to tell you about our project overview. I hope that our session will be to your help. And first of all, do you know NTTDocomo? NTTDocomo is a leading company in the Japanese mobile phone market. This company provides a very wide range and very stable mobile multimedia services. Our customer, NTTDocomo, provides a cloud-made system, cloud-made service since October 2013. This cloud-made service supports multi-device access and over 20 million people are using this system and provided by OpenStack Swift. This system has two type of storage. One is high performance and very expensive storage for later-made data. And another one is high scalability and high availability object storage for archived mail data. And yes, this is Swift. We must save the archived mail data almost permanently in cloud-made service. And later mail data must be earlier response speed because it affects the user experience. So we are changing the storage to be used in accordance with the data characteristics. This slide shows you this system scale. It has placed the storage node to three sites. And each sites are more than 300 kilometers away from another one. This shift cluster has over 6.4 petabyte scale capacity. And it is composed by over hundreds of servers. So it is super large-scale and super wide-area storage system. Well, we do not tell you yet why we choose OpenStack Swift. So I talk our project background. Previously, we were using Feature Phone. However, the market has changed significantly with advent of iPhone and Android. As now, smartphones are popular. And many of the service supports the much devices. How about mobile mail contents? What was changed? There is no major difference in the types of email contents. But the data size of each of the contents has been increased. We need more and more storage capacity for cloud-made services. So what is our best storage choice? High-end storage has high performance. But unfortunately, it has extended performance limits and very expensive costs. We need to think about the other storage choices, which has flexible scalability, high reliability, and lower cost. We have started a study of this system from this background. Customer requirements was like this. High availability, disaster recovery, high scalability, and low cost, blah, blah, blah. So absolutely, we need a software that implements the storage using OSS with IA server. And we found the OpenStack Swift. We have confirmed the feasibility of these requirements within the source code, Swift source code, and understand the Swift architecture, and testing, and testing, and testing. Finally, Swift has been adopted. This is a project overview. And let's move on to the next section. Thank you so much. Thank you, Kakuhi. OK, let's start the migration session. NTT Docomot has launched a Docomot mail service since October 2013. And Swift was installed the Docomot mail system at January 2015. When did we migrate Swift to the Docomot mail system? Docomot mail did not stop user service. In this session, I'd like to introduce overall Docomot mail system and migration process. Sorry, there are many high confidential shinks. So this is some abstraction point. I'd like to start by looking at Docomot mail system overview. User mail is stored in back-end storages. In front of, sorry, if front-end server is received user requests, it will pick up user mail from back-end storages. There are two types of back-end storages. One is high-speed block storages, which stores user mail. Sorry, which stores later user mail. Another is large-scale Swift, which stores archived user mail. This Swift has six petabyte overcapacity. This is user mail access flow. When user sends or receives new mail, it will store in block storage firstly. By using block storage, we achieve the mail we send and receive with low latency. After a while, some user mail will be archived and stored to Swift. If user accesses their mail, Docomot mail front-end server selects block storage or Swift adequately and return mail. This is system overview. Next, I'd like to talk about migration process. Before Swift was installed, both archived and not archived mail was stored in block storage. When Swift was deploying, Docomot mail service had already started. And many users accessed Docomot mail service. This means that Docomot mail was already very important IT infrastructure in Japan. So we need to deploy Swift and move archived user mail to Swift with no service downtime. To achieve this challenge, we divided migrate process into four steps. First step, we deployed Swift and done system integration test, travel test, and parameter tuning. In this step, we were required to quick deploy and done many test cases. To achieve these tasks, we made use of puppet automatic deploy and tempest automatic test. The detail of this will be introduced by Mr. Akasai in technical session. Second step, after deploying and testing Swift, we started to copy archived mail to Swift from block storage. In this step, we copied only test user's archived mail. General user's mail was not copied to Swift. This means that test user accesses Swift when get their archived mail. But general user does not access. General user will access a block storage to get their mail. We operated this step for about five months and done long-term stabilization tests. Third step, after test user's archived mail copy and long-term test operation was done, we started to copy general user's archived mail to Swift. For talking measure against Swift's trouble, sorry, for taking measure against Swift's trouble, we decided to keep archived mail both block storage and Swift in this step. If Swift got fault, we could easily change differential storage to block storage from Swift. Last step, system durability test and launch service. Before starting service, we need to check whether this system can keep in service in high-traffic season. So we did durability test in New Year's Day. Many of Japanese sends New Year's greeting mail. So this make very high-traffic. So this is good timing to check durability test. And after New Year's Day, Swift has cleared this durability check, and we decided to remove archived user mail from block storage and Swift starts service. This is all of migration step was finished. Conclusion of migration session. Firstly, docomail has only block storage. We need to deploy and migrate Swift with no downtime. To achieve it, we divide migrate to four steps. First, deploy and test user mail copy to Swift. Next, general user mail copy to Swift is remaining block storage. Finally, system durability check. After this migration, we achieve no service downtime and no downtime migration. As I said, in migrating, we achieve some technical challenges next. So next session, Mr. Kasai introduce this. I'm Jose Kasai. Let's start technical session. In this project, there are three big technical challenges which I show in this slide. Then I will explain about these challenges more in detail and also our solution of them. The first challenge is the assurance of data durability of Swift. Japanese customers are often very sensitive about the quality of their system. Our system is a mission critical system. So we are extremely sensitive about our system. Everything should be under control in our system. And we should design behavior of the system not only in normal situation, but also in defeat situation. But in Swift, it's not so easy to design all of its behavior because Swift is a distributed system. And many components in many components on many servers co-work to build whole function. But we have to analyze every behavior before building system. To solve this problem, we decided to recovery test. We made hundreds of test cases based on three aspect. The first is the point of failure, disk or NIC process or not. The second is the number of failures. For example, one disk, one disk, two disks, three disks. The third is the range of failures. For example, one node multiple nodes at zones regions. As a result of this recovery test, we ensure its extreme durability, availability and recoverability. Swift can keep data and continues to work well. Even if there are no sniper who accurately break three hard disk drives storing same data from thousands of hard disks or no big disaster which suffers all of your data centers. Our second challenge is global distribution. Disaster recovery is required in this project. And we decided to distribute functions and data of Swift over multi-data centers. Each data centers are more than 300 kilometers away from another one. And you can keep and access to data even if one of the sites downs because of unexpected disaster. Now the placement is decided, but we have to check if Swift works well on such distributed construction. We have two points to evaluate global distribution. The first one is client request. When the client requested some process to Swift, proxy server talks to storage servers and order process and transfer data. In global construction latency between proxy and the storage may affect this talk. The second one is durability. In Swift, each storage nodes talk to one another to ensure all copies are stored. And in global distributed cluster, they have to talk over network with latency. To test globally distributed Swift cluster, we constructed pseudo-global cluster with simulated network latency. We assume that, and we change simulated latency from 10 milliseconds to 200 milliseconds and check how behavior of Swift changes. With pseudo-global cluster, we tested two things. First, to ensure that Swift can serve for client request properly, we tested object put, get, and delete. We checked its health by error rate and we checked its performance by turn around time for one request and throughput. Second, to ensure durability, we tested auto-recovery feature of object replicator which recover object from disk failure. We checked its health by error rate and performance by turn around time of one sync process and throughput. Here, I show a result of our global cluster testing for client request. There are no error caused by latency and Swift works well on network with latency. And response time degrades as latency grows up, but you can see effective throughput with concurrent requests. So we make applications send concurrent requests to Swift to get effective throughput. Then, I show a result of object replicator testing. We can see the result is very similar to result for client request. There are no error caused by latency and performance of one sync process degrades as latency grows up. But you can realize effective throughput with effective throughput with concurrent process. The third challenge is quality. Now, we never get satisfied with saying, now everything seems to work well. We want to say, everything works well. In Swift, we have two aspects about quality. Software quality and system quality. First, our solution for system quality is source code analysis. We read all source code and test all processes. And we customize Swift with making some official patches and some original patches. Next, our solution for system quality is automated testing. In Swift, we have two aspects about automated testing. API and nodes. We constructed automated testing tool to based on Tempest. With this testing tool, we can test all responses from all Swift nodes, including not only normal response, but also error response. Something like client error or server error. With testing tool, we can prove that Swift works properly and we ensure the quality of system building. Now, we solved all challenges. To ensure data durability, we did a recovery test in variety of failure pattern. And second, to realize a globally distributed cluster, we did performance test of front end or back end with pursued global Swift cluster. Third, to realize quality, we did source code analysis and automated testing. Okay, the next is operating session. Thank you, Mr. Kasai. And let's start operating session. Overview of operating session. Operating scheme of document mail is, sorry, high confidential. So, we'd like to introduce about NTT Data Swift solutions operation. Document mail system uses NTT Data Swift solution with customizing. Okay, let's start. Swift is considered by many nodes. Increasing Swift nodes, increasing operating work amount and operating hours. For example, system tuning, you need much operating work amount of changing parameters of many nodes. Adverse effects of large scale system is also to trouble frequency. Large scale system operator needs to work much frequently, frequently. If you scale out Swift, this program will be more seriously. System operator want to reduce system operating cost as much as possible. But private Swift tend to be against this needs. To solve this problem, NTT Data has no help of reducing Swift operating cost. From next page, I'd like to introduce some of know-how overview. First point to reduce operating work amount. We use some famous operating tools. These pictures in slides are example. Two-debate failed nodes were two-built scale out nodes, easily. We use Pixie Boot and Kickstart. Two accesses, many nodes in parallel, uses PSSH and PSCP. Two-same tuning is without fail. Use, we use SVM for configuration management and puppet configuration deploy. These tools are very famous and used in many projects and we can really own. I think you know these tools. Next, I'd like to introduce about how to reduce work frequency. Sorry, the tools which I have introduced like PSSH, PSCP can reduce work amount workload but cannot reduce work frequency. You can use it by using redundant technology by RAID, like RAID. But it comes to the cost tree. As a countermeasures against this, we have formulated pending scheme for low-emergency fault. We have considered work around for any system fault and decided priority of these. If system fault, which is defined as low cost priority, low priority, is occurs. Operator can spend recovery work. This let operator reduce work frequency. For example, low priority system faults are recovered in periodic maintenance operation. But it is not enough to reduce operating amount. This is because operator need to check and decide which spend one note file system alert is occurred. Sorry. Two more reduce operators load. We have changed check scheme or some low priority monitoring items. Checking by periodic performance check. Okay. Operator checks service health by using process performance information. We can check running time of background process by using sufficient recall information. For example, in this case, audit process, operator can check audit process times by recall. If process time is unusual, operator can judge this process condition is green. By this scheme, we have been able to reduce operator work load. Of course, we can customize which monitoring item set to this scheme. Conclusion of operating session. Swift is constituted by many nodes. System operating costs of Swift tend to be costly. Entity data has no how to reduce Swift operating operation cost using operation parallelized tool. Customizing for monitoring priority. Change monitoring items to periodic checks. DoCo mail service uses Entity data Swift solution with customizing. Entity doCo achieves reducing 60% to 70% TCO total cost of ownership for five years. Okay. Conclusion of this presentation. We introduce usage, challenge, and operating open stack Swift at DoCo mail service system. System migration with no service downtime. Three technical achievement. Reduce operating cost. Entity doCo mail has been serviced with no downtime. Okay. It's end. So, do you have any question or comment? Sorry, our English skills vary, but so please speak so slowly. Okay. My name is Pete Zaitsev and I work for Red Hat. Where I work on Swift as a core reviewer. So, I have three questions if I may. Okay. Dajobov? Hi. Ah. So, in Kaisai Sun's example, there was only one proxy server, but the geographic distributions of object servers was in place. Do you actually distribute proxy servers or was it just a test example? Once again? Which one? Hi. Correct. Okay. Okay. So, in this, we only have one proxy. So, if that proxy goes down, what happens? Yeah. Thank you for your question. Our test station is that system. That system, but actually we have another proxy site in site three or site four. So, one of the proxy sites downs, the proxy changes on other sites. Oh, okay. Okay. Understood. So, you have a load balancer that does the switching between proxies, right? Yes, yes. Okay. So, if I may, second question. Could you provide actual cluster numbers at least approximately how big the cluster is? And I know many companies consider it a secret, but at least REC space released some numbers that were approximate so that we know the... It, I'm sorry. It's high confidential. Ah. I'm sorry. I'm sorry. No. All right. So, and final question. Did you, I noticed like the biggest downside you mentioned was the high cost of operators or in the human operation of the cluster. Did you consider asking the community for any specific improvements that would help you? If yes, how did that go? And please don't write. Sorry about the English key. Okay. We have many trouble, there are many to challenge, but well, maybe I can explain a little bit. There was a list of patches and I recognize some of those patches. I reviewed them, but those patches were relatively small in scope. Yeah, like adding process name checking and stuff like that. But if you have any ideas about additional like more a bigger change to SWIFT, then it's also welcome. I was kind of wondering if you already have any ideas for that would have a bigger impact on reducing human cost for SWIFT, operator cost. In my understanding, your question is, which I have idea to reduce operating cost, okay? Yes. Okay. Actually, this our scheme, we can reduce operating cost to grammatically, so two more use operating cost. It's a trade off to reduce monitoring items and, sorry, monitoring items, number of monitoring items and operating cost, it's trade off icing. To reduce, okay, please come on. I can talk it separately. I thought maybe you already had some wish list items that we could share, okay. Well, thank you very much. Sorry. Hi, my name is Mariano Cunetti. I'm in the CTO at entercloudsuit.com. It's an Italian cloud service provider. I was very interested to understand if you can share. Am I going too fast? Okay. I would like to understand what kind of software do you use to move emails from later emails to archived emails. Do you use dog code as email server and do you use any proprietary plugin or did you write them yourselves? Thank you for question and sorry, it's one of the confidential point to the archived mail to suit it. Okay, so you cannot share anything about this? Okay. You cannot share any of the information. Yes, you cannot share. Okay. Thank you so much for sharing your experiences and I'm wondering if you work with any service providers or consulting firms to help you guide through the deployment process of OpenStack. As a customer? As, I mean, did you work with any consulting firms to help you deploy the OpenStack or docomo figure this out by itself? Without docomo, okay. Thank you for question and we, sorry, it's time to up. We have already some companies to, we have already conserved to deploy OpenStack and some company like Kirin. And Kirin is a very big, famous company and we have a Kirin session in Thursday. Oh, so NTT data is the consulting firm? Oh, okay. Okay, gotcha. Thank you so much. Yeah, sorry. Thanks. It's time to up, so thank you for coming and next time we'd like to improve my English skill.