 Hello, let's get started. Does that work? Yeah, it does good Okay So welcome everyone Thanks for coming to the session. I'm the CTO of co-bite. My name is Felix Hupfeld It's it's my pleasure to introduce you to Yusuke and Akira of Yahoo Japan Who will share their basically experiences and plans of building a large scale open stack infrastructure? You're probably not not aware of it, but maybe guessing from the from the name Yahoo is one of the biggest Internet service companies in Japan and basically serving the whole country And that is basically also the scale that their infrastructure is at And you'll get I think a lot of insights of what what's relevant to run infrastructure at that scale So please help me introduce or welcome Yusuke and Akira Hello, my name is Yusuke Sato and I'm an infrastructure engineer from Yahoo Japan And I'm responsible for private cloud compute and storage and this is Akira cameo He is open-stack engineer especially for hypervisor and software storage Today we will talk about our challenges of infrastructure and software storage Let's start with the introduction of our company Yahoo Japan is one of the largest internet company in Japan and we have many service For example search engine e-commerce mail news finance and more There are over 100 services and Our services are used by over 80% of Japanese internet users This is our debt centers We have several data centers in Japan and we also have one dead center in the United States And we are operating over 17 5,000 physical servers and over 120,000 virtual machines and our storage systems all had Over 60 petabyte capacity in total These pictures are showing one of our server server rooms in the Japan and the United States In our environments Many services are running on our private cloud systems Basically our private cloud systems are open-stack clusters And now we are starting to use Kubernetes as one of the platform of our services Our Kubernetes clusters are running on open-stack VMs and bare metal servers managed by Ironic We have over 17 open-stack clusters and so many instances as I told These clusters operated by less than 20 engineers To operate clusters of this scale we have to reduce operational costs We are working toward automation and improving efficiency For example cluster provisioning is automated by Chef and we developed a chat bot and We can take some action through the chat Also, we have the centralized system logs and metrics This makes us monitor and visualize it through the single dashboard This is our challenge to make operation more efficient and Next I'd like to talk about our storage complexity We have many private cloud clusters and we also have many storage systems These storage systems are dedicated to each cluster and operate it independently That is inefficient to operate Our challenge is to reduce storage operational costs and this is transition to software defined storage Today, we are mainly using storage appliance because that is capable of standing a heavy workloads and high availability But it's expensive and it's not flexible. So we are testing software defined storage Software storage can run on commodity hardware. So it is more flexible and scalable We tried to use SDS from two years ago. So we tested some software storage Then we choose core byte as SDS Why we choose core byte? Most important thing is reducing operational costs. So core byte is unified storage system We can use core byte as a back-end of object storage, Shinder block storage and Manila file storage and Kubernetes persistent volume So it will reduce storage complexity of our system and it will lead to easier to operate How can we use core byte as multi-storage back-end system? Core byte cannot speak multi-storage protocol like block and object. It is only file storage I will explain about these behaviors First about open-stack manila. This is simple Core byte volumes are created by user request via open-stack manila API Then virtual machines mount directly as file storage Core byte provides two methods to mount the volumes. They are NFS and core byte native client protocol Core byte native client protocol is based on hues and If you use NFS, virtual machines mount core byte volumes through core byte NFS proxy service Next about open-stack shinder. This is like shinder NFS driver Core byte volumes are mounted by hypervisors as file storage through core byte native client Files are created on core byte volumes and mapped to VM instance as block device via QM There are a slight performance overhead compared with direct block attach, but we can handle these shinder volumes as file It is easy to operate Next core byte as object storage Core byte has S3 proxy service. It allows hybrid access to core byte file system through S3 protocol Each virtual machine is accessed to S3 proxy server and We can use open-stack keystone for user authentication Finally, I'd like to talk about persistent volume for Kubernetes As Kubernetes already has the plug-in to integrate core byte volume as a persistent volume Core byte volume are mounted by Kubernetes nodes natively Pods can access core byte volumes as lead light many persistent volumes Next I'd like to explain about our current status of introducing core byte to our private cloud We already released a core byte cluster for Kubernetes persistent volume This Kubernetes cluster is small and beta. So far so good And now we are building two core byte clusters for open-stack shinder and manila These clusters are large-scale and require high performance. So we are designing system configuration carefully These configuration details will be described later Before about our system configurations and benchmark results of core byte I'd like to talk about challenges of introducing and operating STS in general These challenges are true not only of core byte We tested and operated some software defined storage until now But we think STS is still harder to operate the storage appliance for now Because STS is not matured yet On this point core byte also has some bugs, but they fixed quickly and carefully That was a good point. This is one of the reasons why we choose core byte But we think difficulty point of STS is caused by STS is a type of distributed system For appliance storage internal traffic like rebuild and regenerate traffic Control packets of each nodes go through high-speed interconnect or backplane But for distributed systems massive traffic are occurred on ethernet as east-west traffic So we need to have high band-wise, high availability and low latency network We understood that network is most important piece for STS through the past experience operating some STS products Then we designed system configuration and monitoring carefully Our configuration is designed for high band-wise network for massive east-west traffic These details will be touched later And we set up network monitoring especially for distributed system We are monitoring error count of nicks and ping latency of each nodes We refer to a paper of ping mesh that is written by Microsoft. It is used for Azure This is simple and effective monitoring methods for distributed system If you are interested in it, please check the paper These figures on this slide show the mesh ping monitoring of normal status and error status All nodes ping each other and each point is results of ping latency Green point is normal status and red point is high latency Right figure is shown one node error If network switch is broken, more red points were shown in the figure So we can find abnormal network status much faster So now I'd like to hand over to Akira Kamio here He will take about details of configurations and benchmark results I will explain our cobalt cluster we are now building This is our system overview Storage nodes and compute nodes are under closed networks This is our server configurations for storage nodes and compute nodes Each storage node has the dedicated NVMe J-Bof The J-Bof provides 15 NVMe drives to each storage node Compute nodes only have SATA SSD for boot This slide is showing the singular configuration I will explain how application workloads goes And how the cobalt server to cobalt client traffic And server to server traffic goes for each This is application traffic This network is only used by gene processes Physically, this is the single 25 gig ethernet port And not redundant Next, this is server to client traffic of cobalt storage Compute nodes have 25 gig ethernet single pass Storage nodes have 50 gig VPS bandwidth It is configured in lag with 2 physical 25 ether ports Network bandwidth between compute nodes and storage nodes Almost equals to accessing the local NVMe drives This is server to server traffic of cobalt storage Storage nodes have 50 gig VPS link with 225 gig ether configured in lag Next, we will touch on the connection between storage nodes and J-Bof The bandwidth between network to host have 15 gig VPS And the connection host to J-Bof is 16-length PCIe GenC bus This means the bandwidth is 31 gig VPS This is better for performance because network bandwidth is much larger than J-Bof's bandwidth Now, I will talk about the system configurations But I think you are concerned about performance This is our test environment Storage nodes and compute nodes are as described in this slide This test environment has 18 storage nodes and 70 compute nodes We use FIO as benchmark tool The benchmark pattern are sequential and randomly derived with several performance of block size This is leader IO performance from single credit nodes We compare the performance accessing to cobalt volume with local NVMe drive Cobalt sequential IO is lower than local NVMe This is shown on the right side of this graph But cobalt has better performance at random IO with over 120 edge gig block size Because cobalt is distributed system So we think cobalt has good performance Even if compared with local NVMe drive This is right IO Cobalt sequential IO is also lower than local NVMe But cobalt has much better performance at almost random IO patterns This is multi node lead IO performance of whole cobalt class Peak lead IOPS is over 1000K IOPS Peak lead bandwidth is over 40 gigabyte per second It has much better performance than our legacy storage system This is right IO Peak light IOPS is over 300K IOPS Peak light bandwidth is over 20 gigabyte per second It is also more than what we expected This slide is showing how much impact on lead light performance Comparing healthy cobalt cluster with the cluster has single node down The graph is showing lead IO is 16% down at worst And right IO is 30% down at worst But in most patterns there are not so many difference between the healthy cluster and the degraded cluster Next feature work We are still waiting for better cluster configuration regarding network architecture We think that MRAG and RAG are not happy Because it result to have Bender Locking and have huge L2 network We will try to do new configurations For example, server side L3 on close network and 100 gigabit is network Currently storage nodes have redundant path using MRAG This is what we think about the future cluster network architecture Each storage node has BGP client and exchange that route talking BGP We think this is more simple architecture than using MRAG Also the way to make current configuration more simple is using 100 gigabit is without redundancy The advantage of this configuration is making cabling simple However, it have the advantage that node will go down In case of single-seats failure from storage system perspective As I described in terms of performance wise We got the confidence to use cabling in our production But regarding cluster network configuration We are now still working on to find out what the base is Finally, I would like to summarize our presentation First, co-write is unified storage This means single co-write cluster can be the back end for multiple open stack components Which are Cinder, Manila and Oblix stage This makes our operational cost much lower Second, co-write has much better performance than our legacy storage system Finally, regarding support on evolution phase, we got some software bugs But the support team provided fixed patch pretty fast and they worked well This makes us have strong confidence to introduce co-write to our production system Thank you for your attention Are there any questions? Be happy to take them please Why didn't you pick Seth? We are also using Seth in not production environment But I think Seth is not so stable and co-write is more stable than Seth So there's a performance graph and basically with one block size it was above 100% And he's wondering why it gets faster when there's a node gone In the end, I guess we have measurement variability anyway So it's not taking into account the confidence in too long In the end, if you put that in production, you're above provision anyway And so you're having this variability there And I think the graph should show that it's mostly not taking too much impact You said you changed for some legacy storage system and I asked what it was Some major storage benches Maybe you know Maybe you can say if it's a block or file, is it not a sand? File storage We are mainly using very major NUS storage Do we have more questions? Otherwise, both Yahoo Japan and we are present We also have a booth in the exhibition hall and can come up anytime for more questions So thank you very much