 Okay, so I think it's about time to start Good morning everyone I'm glad to see so many people attending to this session since it's our first session in the morning And we had a big party last night and thank you everyone to join the session. So I'm Shintaro Mizuno I'm a senior research engineer from entity software innovation center, which is one of the R&D division within our entity group And this is one of our engineers from our team Takashi Thank you Shintaro Good morning everyone. My name is Takashi Natsume. I also work for entity cooperation. I am a software engineer Okay, so Our session is about our journey with OpenStack since we have started OpenStack project in 2011 four years ago and we will show you what we did and what we are doing right now based on the Lessons learned during the four years of our journey So this is the outline of my presentation first we will introduce ourselves and entity group what we're doing and then we will dig into what we did in our Early deployment and then go to what we are doing right now in the current deployment And what we have gave back to the community and upstreaming activity and our next steps so introduction as I think most of you in this room seems to be from Japan and I think You may all know but for those of you who are not familiar with entity group After the keynote and all the sessions, but let me briefly introduce our group So entity group is a largest telco in Japan consists of various subsidiaries providing So many telecommunication services including regional communications and long-distance internet service providers We have systems integrations mobile communications and so many other Businesses and as for the open stack We have variety of implementations in But most of the major Business companies Including entity communications entity docomo entity to resonant as you may see in the keynote Entity data they provide system integration service and entity smart connect has been doing few trial Entity IQ which is an R&D division in the US is doing their R&D and In this talk, I'm not going to talk about everything, but I'm going to focus on the earliest deployment which was our R&D cloud and public cloud Configuration and architecture So we will be focusing on those early deployments in this session so for the rest of the use cases you can watch the video from our other sessions from our group and And I think it's on the website of the summit already So we're not only the user of open stack, but we have been contributing to the community Since the bear summit Four years ago. I think no more than four years maybe so since then we have been contributing to the community Showing as in the numbers here I Think the numbers is not as compared to the largest deployment, I mean largest companies, but I think it's the number is not so bad for a telco like us and as a user Like top 20 I think we're in the top 20 of the contributing companies and We had 60 more 67 contributors throughout our journey and We're proud of them. They did a good job in upstreaming So I will Talk about why we had to upstream those patches and bugs and features Later in the session. So, okay Let's go deep in the details of what we have done In the behind the scenes So again, I will focus on our R&D cloud and a public cloud deployment in this session Which is the earliest I mean the earliest deployment in our group. So so this is a timeline of my talk As I said before we joined the community back in 2010, I think from there and The first use case would be it is Our first deployment in using fosom in back in 2012 I will talk about that first and then we will talk about what we are doing in our current deployment Which is using Juno and kilo So Back in 2012 Open stack was said to be production ready there are several use cases that use open stack in production and The hype of open stack has been rising so rapidly that everyone thought open stack would be ready to use in production environment But Including us people were still skeptic about its hype. Is it really production ready or is it really good? So we were also the one who thought about that But since it had quantum in fosom, so that was the feature that we really wanted so we decided to Put out our hands on fosom and test what we can do with open stack. So despite the fun features We were still worried about its quality. So we focused in QA testing of open stack So During this deployment we did function. I mean QA testing including full API function test and non-API function test which includes CLI's and other features and Full state transition test like including all the resource state transition including network and we did external system failure tests and APRS condition tests and Long-term stability test as we usually do to other systems as well and as I mentioned we Were really wait waited for the network features to get into the open stack. So we tested network as well At this time we used NICERA MVP which is now VMware NSX as our neutron back-end since he was almost the only Production in market at that time. So we tested using NSX And we tested function We did test tested its capacity performance availability Similar to what we test on our hardware switches It's included like throughput from virtual machine to virtual machine or virtual machine to external network via the router how many tenants I mean how many network can we build how many ports can we build how many MAC addresses can it learn and Also the availability was so important. We had Network node How long would it take to switch over and those kind of tests? Were we done in that deployment? so after the testing it took about few months to test more than few months, I guess and What we found was well, it had the features, but it has issues and weaknesses in the area like APRS condition it was So weak it lacked appropriate locking mechanism so that if you like create pull at the same time the port becomes error very easily and It also lacked the internal error handling. So if something happens during the API processing the resource goes error and All you have all you can do is delay it and also it leaves the offering resources like virtual interfaces and ports and also instances that you cannot delete from open stack API so you have to delete it manually and State transition there were no workflow management and there were also no rollback mechanism So if something happens, you cannot roll back to the original state like in migration if something happens to migration All you can do is delete the very virtual machine, which we didn't want to and API parameter validation there were lacking parameter validation so you can just enter flow parameter to break the system and HA H was wasn't so strong at the time. It took so many time to switch over to the standby node So those are the weaknesses and issues that we found in Folsom Well, it was the quality at the time So although We had good features in Folsom we had quantum network was great We liked it But it turned out to be quite fragile For public clouds, you know, it's too fragile to put it in the wild for everyone to use so Our conclusion that back then was Handled with care. Okay, so what we do now what we did in our Folsom based system. I mean based because we change a lot We built a proprietary system on top of open stack or before the open stack So that we can be more gentle to open stack from our users So we had our proprietary system Which consists of resource management host management User management, we had our own database which had all the resource information We also had our own GUI CLI's and API's and the important part we had a workflow engine and transaction management Module which control the open stack API's like similar to what heat does in in the current open stack and also like a mistro It's like heat and mistro combined. So we did it by ourselves. We also had So many patches in open stack as well. We we even wrote a single driver ourselves So this is our first open stack system. I know it's not pure open stack and people shouldn't do like this, but Into in order to use open stack in production back then We had to build those kind of systems But looking back, you know open stack has heat and mistro as a project I think it was not a bad direction forward But bad idea was that we did it ourselves So what we added we added as I mentioned we did We find users to hide open stack resources and provide like a business view of the resources and we did the GUI proprietary operational GUI that open stack didn't have host management monitoring resource in user management and Transaction management as I said We also add a feature called purge Which is to roll back or roll forward or clean up the API failure So we add features for open stack. We also had a workflow engine Execute a certain scenario consists of multiple open stack API so like create network great port and create virtual machine create and Attach them and create a virtual environment as a total like he does and We also had to strengthen parameter evaluation check before heading over to open stack And we wrote our own emcee Seen the driver as well So that was a technical part and we also had to talk with our business people Since they're also skeptic about open stack the question we had to answer back then was Why should we use open stack when we already have v-center or cloud stack? running stable But we like open stack and we still saw the future so had to discuss with them and try to answer those questions There are the some of the discussion point that we did We compared cost. Yes, it's open source. It should be cheap But it didn't turn out well because open stack is free to use but it's not free to operate it You have to have engineers to Manage it you have to have the support commercial support behind that so cost comparison. Yeah, it was good, but I didn't just come was the Number one item to convince them. We also did a feature comparison with a v-center We compared like what v-center can do and what opens that you do we created a list of the features when compared yes and no and but It was a bad idea Was totally different, of course, it's different the concept of and the architecture was different. So it was no use to I mean You shouldn't compare those right you have to stick with something you gonna use so it didn't work as well We also did a network feature comparison and This one worked Open-stack had a SDN solution. We use NISR NSX as I said, and it was you can create very flexible network like you can do in the current environment, I mean non-virtual environment and We also saw saw the ecosystem around the open stack especially around neutron and We saw various SDN implementation that we can select and use so They liked it also they saw the future of open stack through the network feature They since we are the network operator. We had like a goal to combine existing network service with open stack The data center and network together. That's kind of a roadmap could be easily Driven by open stack so this one worked so There were half convinced now so the last thing was its future growth It's community power We looked compared with like a cloud stack. It also had a community, but the power of moving forward In the open stack was much stronger Even at that time we thought that open stack could be the defect of the open source Software So those are the items and we had successfully convinced the business people and we had the go to the open stack in production So back to the technical Items as I said we patched open stack a lot and that was our depth to Pay So let me hand over to Takashi to describe more about what we've done uptreem To the community We developed about 150 patches The patches could be classified to these categories The largest number of bug fixes was for no one rival migration as you know Nova live migration functions quality was not good before So we fixed many bugs But Nova conductor was introduced in grizzly and then the live migration function was modified significantly in the community code and We also added input parameter checks and improved log output There were many patches and we consider that it would take a lot of effort to do upstream so We cooperated with Canonical and Did upstream with them with Canonical engineers? We have done upstream with Canonical for six months Some patches were merged successfully in the community code Some were rejected by the community and some are no need to upstream the patches that were done upstream successfully were Adding tests unit tests race condition bug fixes deleting and necessary things and necessary Nova console tokens Adding timeout parameter for example adding grants timeout parameter to Shinder on the other hand Some patches were rejected for example Multiplicity control function and the input parameter check Our multiplicity control function limited the number of concurrent operations It was added because of avoiding performance decrement due to heavy operations for example creating a volume from grants images Volume clones volume clone and so on it could not be merged because there was similar function API rate limit and Some patches that improved input parameter check could not be merged Because it should be added in next-measure API version And then there were many bugs bugs that had been fixed by other companies in the community Code and had not been reproduced in master code so the patches for them were not no need to upstream and We try to do upstream for not only bug fixes, but also our proprietary function And we are working for log request ID mappings and task flow upstream in order to Do upstream our transaction management and workflow engine function One of our functions was tracking API One of our functions was tracking API codes between components by using one common request ID But in our upstream process, we changed using one common request ID to mapping each component request IDs The spec for cross-projects has been approved Currently we Implementing the function getting request ID from HTTP response header and we do Implement log function to output output request IDs with in one line In the committee task flow implementation is work in progress Task flow is needed for our retry, rollback and API trace function So we also take part in task flow implementation We have a lot of things to do They are force-delete in rollback Optimization for error handling As Shintaro said we developed the drivers for EMC storage product by ourselves but we did not do upstream for them and Decided to use EMC drivers in the community code in our current system What we learned learned from the first release is that upstream first is very important The work of the development and fix is in vain because they have already been fixed by other companies in the community code our proprietary function and the tools have to be modified because play-requisite function cannot be merged and It takes a long time to do up story for our proprietary function since it needs coordination and Persuasion of the community and we are working for them now Then Shintaro will talk about our current system Okay, thank you Takashi. So in short, we still have our debt and we have to pay as well and but our first system is running very well, it's running more than two years in our lab and one of our business companies and No major issue. So Fawesome was Good if you handle it with care. Okay, it's running in production But we had we learned a lot from the first deployment So the next deployment our current deployment so based on the lessons learned We have undertook the next department Using Juno and kilo. So how we do it now? So first we have to change your mindset Don't be greedy Find a way to live with the open Community code which means don't change the open stack and Try to use it what you can get so How we do it? This is our basic Strategy or work flow of our thought. I mean how we do it for the features First try to satisfy with what you have with open stack or try to figure out with what you can get meaning don't change the code and If you want something to be added to open stack try to use write a spec or request for enhancement To realize your ideas. So don't write your own code internally it would take time but But someone would help you I guess and If that doesn't work and if you really really want it And if you can afford it since you have to manage it yourself then think of building outside of open stack and Don't touch the upstream code. So this is how we do it now for the features we want and for the bugs First report the bug to the community as fast as possible and wait to be fixed as a user and If you need it quick Then pick up the buggy yourself and fix it and sometimes the community won't fix that or the priority would be low Or sometimes they say well, it's not buggy. It's a spec and when that happens Try to live it live with it, right? By writing a document We write a lot of documents We write workarounds recovery manuals for operators When we hit those bugs or work around to of the those bugs we also wrote FAQs for the users no issues So that we can wait to be fixed and work around those bugs But sometimes the bug will be critical for the system and what we do currently is we just close the API Don't touch it. It's too dangerous Okay, so we currently we don't expose all the APIs some APIs are not even implemented So we only expose a few of the subset of the APIs and If all the above doesn't work well Create in-house patches if you really need it, but try to keep it minimum and try to upstream So we keep only few in-house patches right now Not 150 that we need to upstream so Based on those ideas and strategies This is how we what we did and what we didn't do in our current deployment We dropped everything that needed to change open-stack specs like features that would change the API behavior or Requirements that like Do do like open cloud stack or vCenter do thing requirements Okay, instead we create workarounds or leverage equivalent open-stack features to mandate those requirements But there are things that we needed to do mandatory things and for operation and And We did few of them without changing open stack first. We add API filter to hide immature API as I said Using Apache proxy in front of the open-stack APIs. We add Notification API log collection to is a external tool simple tool to collect necessary information to operate and for mainly for the building purpose and We also created a cascaded domain model by using keystone APIs This is done by our document manual. So we created cascaded domain model within the keystone and We also developed high availability for virtual machines It is open source, but is it external? To our system that we created to recover the virtual machine when something happens This is what we needed for the public clouds Since we our virtual machines are pets instead of cattle So this is the overall system View architecture as I mentioned we had we have a reverse proxy in front of open-stack API to filter the APIs One for the user and one for the operators. We also We also have the some operational tools Virtual machine HA which is called masakari and It has been presented in the demo session Tuesday, so if you missed it, please check the video. It's so it's online. I think and the notification API log collection to collect the API log from the Apache and Receive the notification from OpenStack with this OpenStack standard We add some agent in the compute node to report the error event to the Masakari But we didn't touch OpenStack at all. We used APIs. We had few agents, but it runs outside of OpenStack So Since we didn't touch OpenStack, it is possible to upgrade If the API is consistent the force inversion we gave up upgrading so This is our new Architecture right now So let me pass over to Takashi to dig more deeply in what we have configured for OpenStack This figure represents our current OpenStack configuration There are controller nodes, storage nodes, compute nodes, network nodes, and back-end nodes The controller process, a Nova API, Shindan API and so on, run in the controller node nodes and they are configured with active active But Neutron server and Nova console nodes are configured with active passive by using pacemaker Glass processes in the storage nodes are configured with active active Shindan volumes are configured multiple active nodes and one passive node My CK and RabbitMQ run on the back-end nodes My CK are configured with three active and RabbitMQ are configured with two active The network nodes are configured with multiple active nodes and one passive node About computer nodes, a compute nodes as Shintaro mentioned They are configured HA by using open source tools we developed Our current system architecture is based on Ubuntu 14.04 and stable kilo We configure six type foster aggregates for VM scheduling Three OS types and two Nova Flavor types Three OS types are for Windows Red Hat as the operating systems Two Nova Flavor types are for normal and that have much memory and our architecture supports multiple data centers multiple data centers and We contributed to the committee during current system development cycle In Shindar, we implemented the function that prevents users from uploading a volume that have protected properties In glance We implemented the function that prevents users from downloading an image that have protected properties These functions are for licensed images, for example Microsoft Windows And we enabled glance to use multiple file system stores as back-end and we proposed the function reloading configuration file by sending hung up signals to the community and The reloading function has been implemented in the committee In Neutron, we enabled an L3 agent or DHCP agent to start without selectable for auto scheduling, but manual scheduling available and we let Neutron not to stop all services on DHCP and L3 agent when turning admin state up of them to forest Then, Shintaro will wrap up our presentation Thank you, Takashi So that was our second deployment, our current deployment We are now aligned with the community. We can use community code as is and we can move forward with the community so from the lessons learned we look the system model that we presented today was for the public cloud, so I believe that OpenStack best fits in a Cattle model, which is like a web services. You can spawn out the VM. You can kill the VM anytime and It in those cases the entrance barrier will be lower But for the public cloud you have to care more about the virtual machine. So You have to treat your virtual machine as a pet Precious pets because customers virtual machine you get money with it so so that was the hard part of OpenStack using as a public cloud and There are discussions about where to implement OpenStack and we at this moment at least We are not still Confident about using OpenStack in like a core network feature virtualization or Or silo application virtualization because they may not need OpenStack or maybe OpenStack is still requires some more features or enhancements, but we are looking on those as well That but there would be much more difficult. So for our next steps The focus areas We are focusing on more upper-level applications like Paras or IOT use of OpenStack and Also the practical use of OpenStack in NFE area as well and we are not now trying to upstream the following functions to the major Project like NOVA we try to improve on-shelf performance. It is blueprint neutron availability zone support it is Almost merged to Mitaka, but it's a blueprint and there are codes We are working on Congress to use in opnfv use cases like doctor and And The big task we have to do is a cross project Which is a log request ID mapping so that you can track the log using the request ID and For Masakari that we develop you try to upstream so it's open source But it's not one of the open stacks. So we try to upstream if we can Have support from the community So this is about it for our presentation. We have some more presentation this afternoon They're both interesting you can learn about from them as well And if you miss any of the entity group presentations, please check the video. They're online. Okay? So we are now open for questions. How much time do we have left? I'm not sure So you have questions come up in the front because they want to record it. Oh By the way, if you want to know what that means Masakari no Please check the video from the demo session you figure it out Thank you. I got the impression from one of the slides Okay, it may be correct or not that you prefer to develop yourself than to finding and or Implementing or buying third-party tools and there are of course a lot of those and a lot of open source available so can you speak a little bit more about that which do you think is better taking something that exists like Simple thing like puppet or or chef And use it or implement your own function and run with it. Which do you think is better? Okay We believe using the existing open source tools is the better way the best way that if you can do it We all use you use chef any open source tools as much as possible, but there are some areas that Existing to doesn't apply to our comments and only in those cases we develop ourselves But most of the cases we use Existing tools like in the log correction tool we use flint D to collect logs. So we Use open stack. I mean open source as much as possible How many times how much time we have left one more question? Okay? Thank you. I'd like to know what's the borderline between the situation where you donate your program to the community and The situation where you decide not to donate to the community You mentioned that the if you donate it takes a long time But again, there have to be some borderlines Yes It is difficult, you know You try to upstream everything you try to write a spec you try a blueprint and Write a code but sometimes it's take long time but in those cases what you can do is wait to be implemented until then you like a Don't provide those services and wait and if you really want it like masakari that we develop ourselves like so Decision it would be is it mandatory for you to operate or provide service to the customer if it's not mandatory then wait if it's Mandatory you can if it's a critical thing then Try to build it but try to be minimal. So that is our strategy so Okay, okay So you were the innovation center for entity But entity has all these components entity data entity, isn't it? And there's multiple different workloads that you ultimately plan on running So what is your experience across these workloads? Have you actually started running any example work production workload? Or it's just a step. Okay Currently we don't have like a unified Infrastructure throughout the group company. We have different deployments in different companies. So the talk the architecture we talk today is for the public cloud and Other companies have like a private cloud using open stack. They have their different architecture much more simple I think so it is better to have the unified infrastructure But at this moment we have no figure out how to do that at this moment Thank you So I think we have run out of time where you will be