 All right. Welcome to the session on design considerations for the production cloud with OpenStack. I'd like to introduce, it's a tag team effort, which we put together. It's not an all-intel show. I'm Ruchi Bhargava. I work for Intel IT at the hybrid cloud product owner. And I'd like to introduce Xu Kuan Huang. He's from our China IT team. And you want to introduce yourself? Yeah, my name is Xu Kuan Huang, actually. Yeah, I'm from Shanghai China Intel IT team. We are working on the Intel product cloud. Yes. OK. And I'd also like to introduce. I'm Kai from Nanang Cloud. We are a partner of Intel to provide software service. All right. So the agenda topic today is basically what we went through and what are the big things one should consider when designing a production kind of a cloud solution for hosting. And in an enterprise, it is mostly enterprise-focused. We're not talking about the telco and the CSP use case. What I will go through is what are the different use cases within an enterprise which Intel is using our OpenStack cloud? What is VNUC IT enterprise IT cloud journey? What are the production cloud design considerations both technical and non-technical? And then we'll also have Kai talk about if that was the enterprise IT large business viewpoint. And he'll talk about how small and medium businesses can deploy a production cloud using OpenStack and how it is different from what a large enterprise would do. And please feel free to stop us for questions in between. There's a mic there. Or just shout out and we'll stop because we've got material which we put in the backup just in case we run out of time. But I didn't want to fall short on questions. So if you look at this most pointer here, but if you look at this, we've got OpenStack deployment. And the majority of it is enterprise IT shop, which is the enterprise applications, the ERP kind of applications, as well as office applications which are hosted on Intel cloud. And then we have our lab setup. Being a big design and manufacturing company, there are several labs or thousands of labs all over the country. And how can we provide computing solutions to them? So Hugh Kwan is going to talk about that too. And that's one of the big lab hosting use case. And then we have a new business use case. Which is where Intel and enterprises go into several new business initiatives. And you don't want to use the existing infrastructure solutions, hosting solutions, to make an investment there because they may or may not be viable. So what is the right kind of solution for them? And so that's what we call an external facing new business solution. And there are several initiatives. We have something, what we call an incubator, where for a short term, they don't need a lot of typical IT bureaucratic processes. They can get an environment up and running for a very quick time. And that's in the new business hosting environment. And that generally is external facing where they can work with their industry partners, exchange information without the restrictions of working with the enterprise capability. And as we've talked earlier about, we always have the older companies have an existing infrastructure. How do you integrate that with the new deployment of OpenStack? So we have an existing infrastructure which has proprietary hypervisors, proprietary storage solutions. And then we have a new infrastructure which is mostly KVM, ESX, as well as open source storage, as well as proprietary storage. So it's a combination. So that comes with challenges, and we'll talk about that. But it does also provide a convergence opportunity for all the three use cases. There are different organizations. We have a 100,000 people company with different silos of compute infrastructures. But OpenStack then provides us a huge opportunity for consolidation. So this is our IT cloud journey. If you look at 1.0, it was a proprietary orchestration layer provided by a vendor with proprietary hypervisor, proprietary storage. And it worked great, mostly a virtualization story for us. Then we did a couple of years ago, we started a 1.0 journey, which was OpenStack based. We wrote our own utility for orchestration. We call it OCU. I don't know if one of our persons, he's no longer at Intel, he's here who's the author of that. But we deployed that, and we basically integrated with, it was a hybrid cloud implementation. It was a pilot with Amazon. And I should have, I normally don't take names, but it's public cloud. And then we had Ceph. It was based on Ceph as a storage, and network was a physical network, and compute standard compute boxes. We had some great learnings from there. Majority of the learnings were that there was a huge technical debt which we incurred. We decided at that point of time that we are going to go with a hybrid strategy of 2.0, which is our hybrid solution will be in tie up with OpenStack based public cloud provider. We will have our groups which need Amazon's and the other cloud service providers, which are not OpenStack based, available to them for public cloud usage. But our hybrid solution will be an OpenStack solution, because we wanted to go purely native, use upstream code directly without incurring any technical debt for IT cloud's perspective. And the only development which our engineering team would do would be integrating with the enterprise, like the last mile integration with the enterprise applications. And I'm going to talk about that in a little bit. Here we have both hypervisors, which are open source hypervisors, as well as proprietary. We have SDN, which we implemented with cloud 2.0. And then we have storage, which is both self and proprietary storage. And as I've talked about this earlier and I would love to find out if anybody else has found solutions for better integration of legacy with the open source storage and compute. And what are the challenges? Any questions on this? So this is our current Intel Cloud 2.0 high level architecture picture. So what we have is the open source control plane. And that control plane can be consumed. It has API interactions in multiple ways. You, of course, have GUI, which is we use Horizon, which is the best GUI. And we try to make minimal changes to it. Because the more changes we make to it, it becomes more difficult to maintain every time we have a new release. Then we have custom automation opportunities. There's a large enterprise IT shop. There are lots of processes. So we've tried to automate a lot of processes. We call it automate IT. There were some security processes in order to get a firewall access for an external-facing web app. It required a lot of approvals. So we've done self-service approvals based on what kind of criteria they are. And they use the custom automation APIs through the control plane. Then we have policy or template-driven orchestration. That is when there are what kind of patches and down the wire patch installations. Those can be implemented using the APIs for those. That's a use case for that. And then we have PaaS automation. We have using Cloud Foundry. Some of you must have attended the presentation yesterday by Cathy. And so PaaS, Cloud Foundry, uses APIs from the OpenStack control plane. And then at the back end, we use multi-sender back end for connecting with different kind of storage solutions. We are using a SAN solution, which is the legacy infrastructure. We've got lots of investments there. So we do want to leverage that. And the whole picture is all about cost. So we have multiple storage solutions here. We have a physical network managed by a SDN front end. And then we also have multiple hypervisors. Multiple hypervisors become necessary more because we don't want to have people go to two different portals or control planes for our previous Intel Cloud 1.0, whatever they provisioned. So what we are going to be doing is all those VMs will be converged with the new VMs. And then they can all have a single plane where they can manage previously provisioned VMs as well as the new ones. And the primary focus and drivers for this was providing more self-service. There's a huge demand for that. And one of the challenges, which I heard from the Expedia presentation earlier from the keynote, is they didn't feel that there was a lot of free banana concept. We call it free bananas because when you give out free bananas, people take them. They don't even use them. They just throw them at the desk. So initially, we did face that, and that people would just consume a VM, just say, oh, it's really easy to consume. Let's go and provision a VM. But later on, we have now come to see after educating a lot of them that you are using resources. So once you consume it, you can't just over provision. And sometimes they may be real users. So we have seen some sort of alignment, or people are becoming, what's the term which Expedia used? They are becoming more aware that they consume only what they need. And so self-service has become much more of a viable solution. From their perspective, it's great. But from an IT cost management perspective, we've got to manage it. You can't just, there is cost to it. And then reduce migration impact on ROI. So when you have a lot of proprietary solutions there, I'm not going to use the example of infrastructure hosting, but ERP solutions. We use one of the major ERP solutions. And we realized that in order to stay current, there was no upgrade cycles without downtime available in that ERP solution. So we were at least 12 generations behind. And then we had to make a huge investment to come to the current capability. So we didn't want to face the same scenario in our hosting infrastructure. And that's one of the key drivers for going open source. And so it basically reduces the migration impact. And then user experience from a legacy, previously provisioned VMs to the newly provisioned VMs, they've had the same user experience with the added capability of cell service management. Cell service provisioning is good, but our previous capability was not allowing cell service management. They couldn't shut down, restart. They had to put a ticket to the help desk to restart their VMs or bring it back or patch them. Now they can do all of it themselves using the control plane, which we offer. And then it, of course, offers us good resource utilization, which is, again, a cost play. So let Shukwan take the next. Thank you, Ruchi, for introducing the cloud journey and the strategy. So go with me to see the detailed considerations for building a production open-stack cloud. At the beginning, I think most of you may start from a proof of concept. You have a DevOps team. You build a stable, small cloud for a limited customer to use it to prove this concept if it works. And from here, we should consider, from the beginning, the cloud architecture you should decide for the scale out. Because in the future, the cloud must be scale. And another important is that we should consider how to fulfill the customer, the requirement by the cloud functionalities. And most important is that we should keep this cloud stable, extremely stable, so that the customer can stay in this cloud and use it. And more and more customers will come to use this cloud. So more and more customers use it. The next problem we make is how to scale out this cloud. As I just said, at the beginning, the design of this cloud is for scale out. So you can easily scale out this cloud with a bigger cloud and fulfill much more customer. And as a company, we have the existing investment. So open set as a control plan, just you see in the previous slides, it can also help to manage the existing infrastructure. So a cloud can help us to save money cost effective. And normally for the investment, both of them can help us to save the human efforts. How to do that? Open set has the consistent API. And we can use those API to take advantage of this era to automation operate this cloud to save a lot of human efforts. By these ways, we can have more customers, have big cloud, but we have a stable DevOps team. The cloud of this maybe can keep stable. So we can save money by launch a cloud. When we have a big cloud here, we have keep going how we get the customer requirement. Customer may have the feedback of this cloud. We will keep more functions, more components into this cloud. And also, customer may encounter some problems. We will fix them. But to do that, we also need a good supermodel and operating model for that around this cloud. At this stage, we almost get into the production stage, except one mile. We have to resolve the integrate with your authentication system inside your company. We don't want the employee inside your company have different account to log in into a cloud and log in with other systems. So we should, for example, integrate Kistil with your own system inside your company so that the cloud can be smoothly integrated with your existing system inside this cloud. After that, I think we can get a production open-stack cloud. So just use the slides to introduce the programs from a small cloud growing up to a production cloud. Here, I have three major technical vectors to show the considerations. One is the stability. Stability is very important. We should consider it from the beginning to the end. Maybe no end for your cloud before ever. The cloud should be extremely stable so we can stay the customer here. The first element we should consider is the redundancy, several redundancies. For example, as we may know, service redundancy, we can build multiple controllers in open stack. So we can have redundancy, no API service, or other services. Also, we should consider about the server-level redundancy. For example, do you have redundancy never nick or do you have redundancy power adapter on your server? Also, in the rack level, we should consider about the redundancy. We'll put all the controller in one rack. There is actually no redundancy. If one rack is loose power, then all the controller will run, then your cloud will have problem. Also, from the data center level, do we have redundancy? And from the network perspective, do you have redundancy on your core switch in your data center? So there is the redundancy multiple levels we should consider about that. After we launch a cloud, the cloud will run a long time. In this long time, how we keep this cloud stable? I think the monitoring alert is very important. It will help us keep this cloud stable during all the lifecycle of this cloud. There are also many levels of the monitor and alert. For example, host-level monitor and alert. It's the hosted down. What's the load average of this host? And also, service-level monitor. Is some open-stack service done? Is it running all right? When it's done, how long it will send you the alert? There are also log-level metrics, monitor and alert. As an operator of open-stack cloud, normally people will check the logs to see any error happened. We'll find the root cause. So if you have some tools to monitor this log, get the logs and automatically send out some error alert that will help you a lot to keep this cloud stable. The last thing I want to mention is the VM-level monitor and alert. Also, cloud, you should find a problem before the customer say, hey, this VM is losing network connection or this VMY is slowly. So we should have a great monitor for the VM-level monitor. We can, for example, leverage the cell meter or install some cumulative agent guest in some VM to help us monitor those VM, get the metrics, and find the problem before the customers. There are many tools here to help us do that. For example, monitor, we can use a ganglion to get the metrics from a host. And for the alert, we can use Negros or Shinkan to help to send out the mail. Isolation, how many guys use a cell or host aggregate in the open stack? I think the isolation, what isolation means is that we can. There are several use cases in your cloud. Some people use VM to build a large workload. The VM requires, for example, 32 calls, maybe 32 gigabytes of memories. That large size VM, it will have impact to other normal VM if you put them together. So we can use the isolation concept, for example, host aggregate to separate different use cases to avoid the noise of some VMs and keep the cluster stable. And scalability. At the beginning, I just said, you should decide for your cloud. Air components should be able to scale out. What does that mean? When you use some open stack component, you have to do some investigation to see if this component can be scaled out, if that component can be scaled out. Take one example here. Previously, Neutron L3 agent, as you know, is not good for scale out. So when you use Neutron, you will consider how scale of your cloud going to be. If it's large size, maybe you want to use the L3 agent. So that is for design. Once you are made sure an error component in your cloud is able to scale out, the next thing is how to automation deploy it. If you have customer needs increase, you want to have the auto deploy component quickly into your cloud environment. For example, customer wants more VMs and current your open stack compute nodes is under heavy load. You will want to add a bare metal into your cluster and then put from the network and install the OS automatically and then use puppy to deploy open stack compute nodes and add into the cluster immediately to resolve these kind of issues. The next one is maintenance. How a small group of people can maintain a large scale cloud? API matters. The one of the important value of open stack is a prior to a consistent API. So many tools or many small, you can develop some small kids to invoke the open stack API and then to maintain this cloud automatically. When the open stack iteration is very quickly, there are many patches released. So to better maintenance this cloud, it's better for us to build a local CI or CD environment in your site so that you can test some patch and go through the CI gates and then automatically deploy into your production environment to do a smoothly upgrade. Last one is about the cloud data analysis. As I just mentioned, we have a lot of monitor tools. We can centralize the logs. A cloud platform generates data very big every day. There are many information you can get from this data. And this data is important for the operator of this cloud. So how we utilize this data? In our practice, we will centralize the, for example, ganglia metrics or log metrics and the shrink alert. It will go into one place and we can do some analysis on that. For example, 10 o'clock, one operator is found. Well, there is an issue in this cloud when we are done and he wants to find out the root cost. How he can do by using this cloud data analysis? He can search the centralized database and he will find out at 10 o'clock what's the monitor status of the host level and what kind of alert the shrink has found and what kind of log error has happened. By centralizing all the data together, you can have all the kinds of information together to help you to find the root cost very quickly. So it will help the maintenance very well. So I've finished the technical vectors. Next one, Ruchi will help to introduce the non-technical. So besides getting the technical benefits from OpenStack or for any cloud implementation, there are definitely several non-technical considerations also for most of this cost. In any enterprise, we all are there for making money. And so the total cost of ownership comes into play. And at least at Intel, for any new initiative, you have to go through a huge ROI analysis. And so what we had to do was you do a POC using Skunkworks, find old servers which are really old, and test out and do a tech evaluation. Once you have the tech evaluation done, that does work, then you do a POC, where you look at the cost implications and get forward-looking cost analysis done in terms of what is the cost of integrating with the existing implementation and what is the impact of the user experience? What is the cost of transition change management? Those are different vectors from a cost which play into the cost picture, which will then help us make a decision. And so from a total cost of ownership perspective, you do that analysis. And the approach which we took was we have 13 data centers across the company located globally. We are not going to be replacing all of them immediately with a Greenfield implementation of OpenStack. We have existing deployments of a fine-tunnel cloud. We use the single-control plane strategy which will bridge the existing, which will provide a transitionary approach and use the internal Cloud 1.0 as well as 2.0, manage it orchestrated through the control plane. And then when we have a 3.0 infrastructure which is ready to be deployed, we can phase out 1.0 and move on. So it provides a framework for moving forward. So that was from a total cost of ownership perspective. The next big factor is workforce transformation. From when our first internal Cloud 1.0, we have a pretty decent GUI. But what it meant was you'd request, and then the support group, the hosting organization, which runs the Cloud, was also used to pushing buttons on the GUI and managing it. But they would do the same thing over and over again, but very little automation. And that systems administrator, the sysadmin approach was missing. And that transformation to move from push button to automate IT, automate Cloud, from an operations perspective, was the biggest transformation which our organization had to go through in the last couple of years, which basically meant basic open stack training for the entire organization, which was not only designing it, but also implementing it and managing it from an operations group perspective. And so we provided basic training. Then we ran bootcamps for open stack with the, of course, there are external companies which provided the training. Mirantis was one of them. There were several local companies in the different locations which we have. And so even though there are plenty of companies available, but to provide similar level of training was a big challenge. So you got to agree on, at that time, we were on Essex. And Folsom was coming. And the consideration was, are we going to jump from Essex to Grizzly? Or are we going to Essex to Folsom to Grizzly? And the training companies, what kind of training they provide, whether it is based on what release, was also important. Because people who've never worked with open stack, when they take a training based on, I'm talking about a year and a half, two years ago. So today, if somebody learned on Havana and if they talk about Juno, what are the differences? The training is not comprehending that. That's a challenge. And you got to keep some pay minute attention to that effort also. And then, so that's from a training perspective. Now, from an operational perspective, when you deploy any new capability, the QA team does automated testing. But another method which we employed, which helped us train the organization, was we provided regression tests to a group of people who were going to support it. And before making any major releases, all of them had to run those manual test cases. And which basically provided them a knowledge of what are the different use cases the customers would actually use. And I think it slowed down our release a little bit, but then our support was excellent. They knew exactly what the customer would really want, or what issues they would face. And if it was not automated. And what else, from workforce transformation. So those are the key workforce transformation transitions which we saw. Then the last mile integration is on two front. One is on security. From security systems, how do you patch all our infrastructure hosts? What kind of, we have, say, we use BigFix. So how does BigFix integrate with our deployment? Then from EAM, which is Enterprise Access Management, as Shukwant said, you don't want them to log in to their Outlook or other Enterprise applications with one log in, and you want them to provision with the same log in. So what kind of integration did we have to do? So that was another, Keystone integration was big. And that's the place where we use some of our engineering resources. And what's the third one? So there's N security. How do we make sure that, I already talked about security. The next one is about service management. So to run an operations shop, you have to have good asset management, good incident management on these servers which we are provisioning. And so integrate it with whatever the service management capability which your company uses. We use service now. So how do we integrate with the service now? So all configuration items, any event driven log in. So when I provision a VM, it needs to go and make sure that it's provisioned in service now, irrespective of which method on that picture. We had PAS provisioning VMs. We had policy driven as well as GUI driven. So all of them, it doesn't need to be front end. Driven integration, it needs to be basically our control plane driven integration with the service management capability. And then as I talked about workforce transformation, that basically led to a support model transition. And I'm going to actually show a picture, this picture. So our first Intel Cloud 1.0, it was more people directly going in one orchestration capability which somebody was manning 24 seven. So you had a help desk which was basically doing a call or chat kind of a session. And but very little self help and very little event driven incident management. But with this control plane implementation and OpenStack, what we have transformed are in the journey of transforming our support models is the incident management becomes much lighter on people, much more on automation. And our goal is to have as 80% of the incidents resolved, either through self help or through event automation. And the problem management also again, how do we automate that aspect of it? So the actual engineering organization or the DevOps organization is more focused on problem management versus incident management. That's our goal perspective. And I'm going to go back to my foil and hand it over to Ty. Okay, thank you. I'm not from Intel. I'm in the green past. So we are a software partner with Intel because Intel is supporting more hardware level. And there are a lot of channels in China and more and more client customer require added value service plus hardware level. So our mission is to bring the best practice including like Intel to more private cloud client and or even public cloud client. So if we say open stack, if our journal to private cloud is 100 miles, private cloud open stack provide 19 miles. And like open stack release, like Red Hat and Merentys bring another nine miles. But today's topic is about the production best practice. So it's all about the last mile integration. So our mission is to integrate those last mile best practice into product. So the first thing is we need to find out what's the common things we can productize. So we compare the Intel best price and the channel difference. First is about size. And there are also a lot of other difference like they don't have self-developer authentication systems and monitoring systems, all those things. There are a lot of points we'd like to mention but because of a time limitation we only focus two points. The first is about the standardization. And so we are not going to create another release of open stack, Merentys, Red Hat, or even HP already done a lot of work on this. So we only based on those mature release to add more best practice configurations and software. The first stage is about the, we enable like, I saw another topic is about how to automatically deploy, enable the controller HA by script. Actually this is what we have added into our package. And after we deploy the machine, we can enable the controller HA and also VMHA. And another thing is about the, even we use the Merentys release or RDL release, RDL always in central S or Red Hat platform. And Merentys we use open too, right? We found some bugs. So we enable some patches on that release. So let's make our package more mature than RDL or full web. And we also enable some basic monitoring. We are used actually a centimeter to enable the monitoring, built-in functions. And like a workflow in China with any organization always require some approval process. So we enable workflow, but it's not an open stack workflow. It's another very simple self-developed workflow. And we enable the redundancy configuration. Our interesting thing we would like to mention is about the hardware standard. You can see this hardware standard is like, if we familiar with hyper-converging infrastructure, it looks like a Nutanix like, right? So this is something client is looking for if we enable open stack in a very simple way. Like we enable hardware like this and our open stack is always shared storage, built-in. So another software package we enable is Distribute the file system. It's actually like Intel's practice built-in on CIF and GlassService. In our product, it's a GlassService built-in, so which we can provide such a Nutanix-like open stack box to our client, very simple. Another thing is about our methodology to create such an environment. The difference is Intel's practice is focused on production environment. But if we are going to productize our best practice to more clients, our release is a package of the software. It's not a production environment. So the package actually is an RPM package, so we can release to our Intel channel, server provider, resellers, and they can use this RPM to install to their channel clients. And at the meantime, we provide L2 and L3 support and more consultant service, standard service to them to meet the last mile requirement. So that's it. So, questions, no questions? Thank you, Longfam Orange Business Services. I have a question about migration. So the migration of legacy, platform, and data under open stack. How do you manage to do that? Okay, so the question is, how do we migrate existing legacy application data or platforms, the VMs to this? So we actually don't migrate. There is, unless it's an absolute need, at least from an enterprise IT, most of those applications, if they are running on existing VMs which are on Intel Cloud 1.0, we will continue to let them running on that, but provide the control plane, which will manage it, I'll have access through the current new control plane. The same hypervisor is managing it, so there is no need for doing that. Anybody else, any questions? Are there any portions of open stack that are hard to deploy in a redundant fashion? Are there any portions of open stack which are hard to deploy in a redundant fashion? For example, in, for example, H-release, the neutral L3 is not, it's hard to multi-host, right? You find any sort of work around for that? There are some patches in the, some guys develop, you can enable that, but we don't enable the L3 agent, we just use the L2 agent to let, we use VLAN mode, so the VLAN can directly go through the physical switch, or outside, so to avoid the use of the L3 agent of neutral. Thank you. Bios? Among the technical vectors, you mentioned isolation, how do you manage isolation? And about the scalability, you mentioned automated deployment, which tools do you use to manage automated deployment? First question for, we use the aggregate host in OWA to separate different use case, for example, we have the high, for example, we can use the E7 CPU for the high workload VMs, we put all the VM on that kind of host, and have some, maybe E5 or the E3 CPU, that server is not so powerful, we run normal VM on that. Host aggregate? Host aggregate, yes. Then when you put a VM, then the VM can automatically schedule to different aggregate host. The second question is auto deployment, we use the public to do the auto deployment. Any other questions? For your OSes, did you manually set the MTU below 1500, or did you do jumbo frames on the Ethernet switches that were in between the hosts? We use VLAN, so do not have that problem. If you use GIE, maybe you have to set the MTU below the 1500, right? And, yes, can I answer your question? Okay, thank you. Anybody else? Yeah, okay. So, I talked about technical debt when we did open OCU 1.5, so it took us almost six months to develop a particular capability for that hybrid cloud implementation, and we expected at that point of time to also add on capabilities like auto scaling, because that was a requirement for the particular use case we were working on. We realized that we were not able to meet the timeline, and then why do that when the community is anyways working on it? And the decision which we did make was that if we do any development, we are going to contribute that code upstream and work in a larger team setting rather than working in a vacuum. So that is when we made the decision that it's going to go all in with OpenStack as far as any engineering goes, and not do any other integration. Because we have a very small team, and that small team focused on how to integrate it with our existing infrastructure, and when we do have the time, that's when we go and start contributing externally. What percent of your workloads today are run on OpenStack, and how will that look in the future? And then, I had another question, but I can't. So, we are in the process of transitioning to our production workloads for internal facing applications to run on OpenStack. So, today 100% of them do not run from an internal cloud perspective, but for our external facing, about 80% of them run on OpenStack. You remember your second question? No. Okay. You mentioned that you are using SDN alongside with the Neutron API. Can you elaborate on that? Don't you think that Neutron isn't ready for production? We only use the L2 agent and plus VLAN mode, so I think it's stable enough. That's all you have for question? Not. Thank you for coming and listening to our story. Thanks. Thank you.