 Good afternoon folks. It's late in the afternoon, but I'm here to excite you hopefully a little bit. We're going to talk about how to run enterprise workloads in the open stack cloud. What technologies it takes, what capabilities it takes, and what challenges exist. My name is Alok Prakash. I'm a product manager in data center software division of Intel. I was told to put this slide in. We were talking about transforming the business ecosystem and infrastructure. I'm going to hopefully touch a little bit of all of these three things in my talk. Of course, this is the legal notice. If you want to read this, you can read it at your pleasure later. I'm going to skip over this. Let's talk about running enterprise workloads. What's happening right now and what will it take to get there? For most IT shops, as we've gone around and talked to them, they have some applications that have already gone to the public cloud because they had urgent need. The line of business managers wanted to get it done and they moved there. So they had this shadow IT problem that they're trying to solve. That is get some strategic control, get some governance on that. On the other side, they're virtualized and they want to move up from virtualization to IT as a service. Both of these infrastructures come together as a strategy many of the enterprises want to collapse those things so that they can run all of those workloads on a single infrastructure. They want to start with a private cloud initiative for that servicing internal customers. They want to run all of these workloads including enterprise workloads in that same virtualized and cloud infrastructure while meeting their service level needs. The key is that enterprise workloads require much higher service levels than normal workloads because the cloud workloads because the developers left the company or they're just a package binary program. They cannot handle variations in performance, those kinds of things. Lastly, but not the least, if you have to service a line of business and grow really fast you need to be able to take hardware, software, everything that's integrated and deploy very quickly. While that's what IT wants to do, setting up and operating a cloud is difficult and a new set of cloud services and monitoring and management tools are needed to be able to handle the enterprise level requirements. We'll touch base on a few of these. I'm going to touch about three bigger problems that we've heard about in challenges in running enterprise workloads. First one is around the trust and the other one is around performance and availability. The trust problem, I'll characterize it as a nosy problem and a noisy problem. The concern there is that if I have one important virtual machine running there can I have a nosy neighbor because a cloud is after all like an apartment complex. You have multiple small little slices and can one VM pick into the next VM and maybe look at the data or do something to the VM. That's one concern, especially when going to public cloud. That concern exists something in the private cloud as well within the enterprise. So that's one problem. Is my system running on a trusted node? Can it be compromised? Is my bios okay? Is my hypervisor in a trusted white list? Has somebody booted up something that shouldn't be there? That's one level of things that people want from the cloud infrastructure. The other one is performance. Performance is a big challenge because if you think of it today we do cloud machines or machine instances based on vCPUs and memory. But a vCPU from five years ago is not the same as a vCPU running on a new processor. You've got huge difference in performances. This is very true if you go to many of the public cloud you can see you can get an M1 large or M1 medium and you will see depending on the time of the day and when you get it you'll get vastly different performance. I know that some customers that we talked to used to go out and get thousands of VMs and then run a little benchmark, see which ones work well, keep those and give away all the rest. So that's an unusual set of activities for a customer to have to go through to get their workloads running in that. And the other problem with performance is also how to avoid a noisy neighbor problem. That is if you have one virtual machine that's running, let's say it's running some streaming media application and it's always changing the frames of video. So it's going to trash the CPU cache and when it trashes the CPU cache the next VM is not going to get its pages in the cache and its performance is going to suffer. You cannot detect that by putting an agent in the VM or any other way. So that's an example of a noisy neighbor problem and the symptom of that is often in terms of customer calling and saying, hey, my application is low sometimes and these are the hardest problems for IT operators to debug because it might be the application, it might be the cloud infrastructure, it might be this noisy neighbor problem, it could be somewhere else. So what IT is missing is this set of tools to be able to remove this blame-starving problem of where is this, what is the root cause of the problem and be able to identify the probable root cause of the problem. And the other problem for most IT shops is a matter of trust in the sense that when I put my workload on that cloud, I don't have access to the telemetry of the platform and how will I tell that it was not the cloud provider, not my application. So those are the kinds of things you want the service provider to be able to provide some additional details. And it's very obvious that for most new workloads, people do want to go to the software-defined infrastructure. They want to be able to provision the entire application on virtual network, virtual storage, virtual compute and those three things are important. That is, while they want to deploy the workloads on the software-defined infrastructure, they've got the problem, they have to be able to achieve the same service level that they promised their enterprise customers today, and they have to be able to minimize the blame-starving in all of this space. Is it the virtual machine? Is it the physical machine? Is it the cloud infrastructure? Where is the root cause of the problem? So I'm not going to spend too much time on high-level details. Let's just talk about how you can solve this problem. What are people really looking to, you know, from a usage perspective, solving this. So what you want to be able to offer the customer is on the cloud services catalog that you should have machine flavors that have additional trust and performance metrics, saying, this is an important VM. I'll pick a machine flavor that must run in a trusted compute pool where every node as it has booted up has been verified to say that it's BIOS, it's Hypervisor, everything is in a white list. And we have some technologies that, you know, models and there was talked about in the previous session. So that's one example. So on the CalService catalog, I must see that the VM has the trust level. The second, I need to be able to say this VM should have so much performance, guaranteed, minimum. And then it should be able to burst up if there's enough capacity available in the system. So for that, I need to be able to, you know, know what's the capacity of the physical system and then as VMs are being deployed, it should be guaranteed it'll get the quota of performance that it is, you know, expecting. So machine flavors need to be enhanced with the extra specs carrying this data. And if the machine flavor in the catalog has that data, you need to, you know, and somebody tries to create an instance of that machine flavor, you've got to be able to trap that and be able to deploy it on the right machine. So if I said I want it in a trusted compute pool, you have to be able to trap that. So we need a filter in the NOVA scheduler that traps that call and looks at the spec and says, okay, I know where the trusted compute nodes are and I will go provision your VM on that node. And by the way, you asked a certain amount of performance. I will look on the system and also make sure that there aren't any noisy neighbors there right now. So that this is, you know, so that when I deploy that VM, it's going to get the performance it's expecting. So it's more of a dynamic scheduling rather than, you know, just a static round robin default scheduler that, you know, has. So those are the requirements. So one problem is just a basic resource scheduling. The other one is once you've scheduled it, and let's say your VM is now running, of course, you have to actively monitor it to make sure that it's, you know, nothing bad is happening on the system. So you have to be able to monitor it. You have to be able to get this performance usage metrics. You have to be able to report it back and help with diagnostics as to what's the probable root cause of that thing. And also manage your capacity so that, you know, how many more VMs you can deploy. So that's the set of problems that, you know, what is expected to be able to manage. So from an Intel, from Intel perspective, we looked at those gaps and said, you know, where are the areas that we can, you know, help most. So there are three areas that, you know, that we've identified that we're working on. One is to help people match the workload needs to the platform capability and capacity. Right? This is the box. It's got so much capacity. And this, and it's got these enhancements in the instruction set, and you have to be able to match the workload to that particular box. Make sure the VM lands on that node. That's one. The other one is to be able to find and address the software defined infrastructure issues. What's the root cause? Who's causing the, who's the noisy neighbor? Who's the, which VM is being affected so that you can then go solve the problem by either migrating the VM or migrating, you know, the noisy neighbor or the person who's being affected, the VM that's being affected. And last but not least is to, you know, have trust attestation of this multi-tenant structure so that you can say, oh, they're less likely to be a nosy neighbor on this one because I've attested all of the bios and formulas running, you know, safely. And, you know, we as Intel have visibility into our own platform. We have, you know, deep insight. So we can do the telemetry at the cache level, cache contention, memory bandwidth, all kinds of things so that we can assure that these problems aren't happening. So, so from an enterprise perspective, you have two ways you can go set up your cloud so that you can handle the enterprise workloads. If you already have a cloud running, you can enhance it with those three components that I mentioned. You know, you need a plug-in in the Nova scheduler, you need a controller that can collect the metrics from all the nodes, and you need an agent in the platform to get the deep platform telemetry from the nodes to identify the problems. And if you don't have anything, then you need a, you know, a turnkey cloud solution that will install OpenStack and all of these management solutions with it. So these are the kinds of areas that we're trying to address with products and technology and contributions to OpenStack as well. So, you know, and we are looking at creating, you know, what we heard from enterprise customers is that it's not enough that you throw some things, you know, code out somewhere. You've got to be able to at least have a product and support it and, you know, service. It has to come with all of that. So, we're looking at launching a product very soon for a service assurance administrator. So, Intel Radar Center Manager, service assurance administrator. So, it'll have all the elements that I talked about so far, you know, the ability to, you know, take VMs and put them on trusted nodes, ability to specify performance levels and make sure that those happen. And, you know, those machine flavors can then be reflected in the OpenStack. And also, you know, essentially, if you have a work application, like a heat application, you can have machine floors in there that have these additional attributes, especially maybe if you are running your database layer or some other application server that is, you know, more important than others. So, you can assign higher priority, higher more resources to those systems. So, what you can do, like I said, intelligent machine placement with automated provisioning with those machine flavor enhancements, scheduling of these instances on trusted nodes, scheduling to meet performance requirements, and then probable root causing. So, you want to be able to monitor. We know we have seen in our labs and in our experimentation that OpenStack components do fail once in a while. So, you know, you have to be monitoring not just your own applications and the infrastructure, you have to be monitoring OpenStack as well and be able to, you know, root cause when there's a noisy neighbor or when there is just a failed component that's surfacing the problem. So, you have to have some real-time analytics because the data flow that is required from each node coming up to the controller has to be live and continuous. So, it's, you know, you have to look at, not the traditional monitoring tools tend to put it in a database and then, you know, look at it, but that model does not apply and does not work well in a cloud model. So, you need to have more of a live and continuous monitoring and analytics on that. And when you detect the problem, you have to be able to, of course, kick off some remedial action. Move the VM to a different socket. If that's a cache-can-attention problem, maybe move it evacuated to a different node. And if you have live migration capability, you know, migrate it. And last but not least, what you want to be able to do for your platform in the server is that the server is booting up. Have a little benchmark that runs there that characterizes the capacity of the system. So, you know, okay, this box has this much capacity in compute capacity. And then, as you allocate VM, you are specifying the same unit for, you know, what is dedicated to that particular virtual machine and what it can burst up to so that, and so that you have clear visibility into the capacity of the nodes and how much the capacity consumption and, of course, you know, reported out to the operator. And, of course, based on that, you can also report out on any of the statuses. So, for example, if the, your compliance requirement, IT policy requires that these VMs should run on a trusted node. And, you know, what you would have is a history from the system that every time the node booted, its attestation was verified and you can generate a report and maybe it can help with your IT compliance. Or if you're in a regulatory environment, maybe some of your compliance and audit capabilities. So that's the gist of, you know, what we're trying to build and we'll be announcing soon. So, to summarize, to run enterprise workloads in the cloud, you know, people want to use software-defined infrastructure built on OpenStack. And you want to be able to enhance the OpenStack to provision and monitor the machine flavors and, you know, and specify the target service levels. And all of this has to, of course, have happened in an automated way, not requiring any operator in the middle of it. And, of course, the integration is also just as important. If all of these tools and technologies and data is available, you have to be able to integrate it with your existing IT tools and monitoring and management systems. So, the product, whatever product is enabling this capability must have REST API so you can, you know, grab the data easily. It needs to have a web console in case you don't want a program to have an enterprise customer who just wants to use the embedded web console to go do all of this management. So, that's a quick summary of, you know, what is needed for people to be able to run their enterprise workloads in OpenStack Cloud. Anybody have any questions? We have a demo of the product available as well. We had it on the floor. And if you want to contact me about more information or seeing a demo, you know, contact me directly and I can show it to you. Okay. We're kind of at the very beginning of our journey with OpenStack. Our enterprise has a lot of pets and they're very sensitive latency. Because, you know, we're used to coming from a, you know, virtual environment like with the API where you can resource, allocate, and define you're going to get this much resources. Do you know as far as parity in OpenStack, getting pinning CPUs, you know, I know KVM allows you to pin CPUs, but, you know, visibility through OpenStack to do that so you can guarantee that if you have an app and it needs to have dedicated access to your course that you're going to get it and you're not going to be over provisioned and stuff like that. Yeah, I think some capabilities are available through C groups and those are capabilities that, you know, in our products we use those capabilities. Now, your QoS requirement may be more than just the computer. Of course, it comprehends IOPS for your storage and, you know, networking QoS. And we are working on those technologies too. It's on a roadmap basis. For now, we focused on the compute side and I think we can, you know, meet the requirements but, you know, maybe we can have a conversation later and I can, you know, we can have an architect walk you through exactly how and what's possible. Yeah, the question was, you know, will our tools give a view of the process, you know, the processes running inside the VM. So, no, at this point the tools geared towards more of the provider, like the service provider or the IT guy, was not allowed to see what's inside the VM because, you know, there may be sensitive data running inside the virtual machine. So, this is, all of this is done by, you know, essentially the gist of what this tool allows you to do is that to run enterprise workloads, you run on a dedicated machine today. And what you want from the virtual machine now as you're moving into the shared environment is to get the same dedicated machine like trust and performance. And that's what the tool does. So, it doesn't know anything about the application that's running inside but it knows how the application or that VM is using the machine and is it causing issues. So, it's a operators, the administrator's tool, not the end user application tool. I had a couple of questions. One is with regard to security, you said you may have some highly sensitive workloads where you can provision in a more protected, secure. So, is all of that more done by the operator to secure certain hosts to be, to have a high level of security for certain hosts versus not and then you tag it and you place it. In other words, or does the software... Yeah, so, I understand your question. So, yeah, is it a manual process or is it, you know, done by the software? So, the, you know, if you saw the sessions before, there is a technology called Intel TXT. That's available in the platform and that's basically a trusted way of executing some code that cannot be tampered by, you know, what's running inside and what's our hypervisor. So, this way you can check when the system is, you know, you can make sure that the BIOS and hypervisor signatures are available and we have technology built into our software that can help you provision that and get it set up. So, if your system is TXT capable, we can set it up with the trusted capability. You may have two pools, one pool that's trusted and one pool that is not and now you have a choice when you're doing a VM to say, run in the trusted compute pool. Okay, and is this something enabled at the chip level or do you write system kernel software to do it? So, the, you know, the trusted compute pool requires a trusted TPM on the platform and there are many vendors who already have TPMs in the, you know, multiple... If you want, I can, you know, provide you a list of vendors. So, but our product will work automatically as far as you're, you know, deploying its concern as long as the platform is TXT-enabled. You can go to our GUI console and just say make that trust node and we can go provision it with the right... And in a related, similar question, from a service performance assurance, is it more reactive in that you're constantly monitoring it because apps can vary and you can have the noisy neighbor or are you actually at the kernel level or the chip level providing isolation? No, so we are currently at... With the current generation that have come out right now, we're monitoring only. Future Intel platforms will have, you know, additional capabilities that will allow us to, you know, just make sure no violation occurs but that's not the current generation. Currently we are monitoring. So we can, you know, you will have ample time to know, you know, how to intervene before the problem goes on. Hi. When thinking of target service level objectives, what kind of metrics do you think of supporting? Is it more just a CPU memory utilization or also network bandwidth, storage IOs or what kind of... Yeah, so in our current version of the product, we focus on the compute side. We are, you know, so this trust status, you know, whether it's running on a trusted node or not, you know, performance in terms of service compute metric, we've defined our own compute metric that is more portable across processors. So, you know, what you want to be able to do is characterize the capacity of the system and, you know, it should be aware of the frequency of the processor. You should know the generation of the processor. It should know, you know, how much cash is in the box because those things affect your performance on one generic benchmark. So, those are the kinds of things we look at and we produce the graphs in charts of what's going on on a per VM basis and at the aggregated host level as well. So, you can, you know, say, okay, that box is noisy or you can say that VM is suffering, right? Any other questions or feedback for me or any requirements or, you know, if you want to try this out? I see that the Decentra includes a smart placement engine. So, what is the difference between the smart engine and the NOAA scheduler? Smart placement, okay. So, what, you know, what we have is a plug-in to the NOAA scheduler filter. So, you know, when a VM is being proven, you know, when the user is asking for a machine instance to be created of a specific machine flavor, we know those extra specs carry these additional attributes. So, at that point, the request, the filter scheduler will, you know, request our controller to go find the right machine. And our controller has a database of all, you know, live stream, real-time data about what's happening on each of the boxes it has under its control. So, it knows where the noise is, where capacity is available and where the trusted nodes and untrusted nodes are and can go direct VM to that particular node. So, it's more dynamic in its VM placement. Okay. So, do you also support runtime policy? I mean, that if some hardware's load is very high and possibly we need to migrate somewhere from hardware load to some low-load hardware load. I mean, does your product also support some... Some priority for a VM? Yeah, yeah, yeah, right. Right now, I mean, there are ways of schemes of doing it and we can, you know, it's a longer answer. So, we can chat this, you know, offline. You can always, if it's an important VM, you can always give it more dedicated capacity. That's the easy answer. But we have thought through this process of assigning higher priority to some VMs or the others. But, you know, it's a longer conversation. It isn't an easy answer. Yeah, but it is possible to do that. So, actually, the state product also supports capacity planning? Well, no. So, I mean, you have the normal IT capacity planning tools. This is more capacity planning from a compute performance perspective. That's the data that we are producing. So, this is a complement to your existing data. So, it's not a replacement for what, as an IT shop, you would always already use. But if you're building a product for, like, a blade or a rack system and it's a, you know, it's a cloud-in-a-box type of solution using OpenStack, there we can talk about making something that's more like a capacity planner for that box. So, the scope is important. I mean, does the capacity planning support have customers tell them how to deploy the VM before they want to deploy other VMs? So, does it? Yeah. So, what you would see on our screen is the capacity of, you know, we would, like I mentioned, the way the product worked. It would run a benchmark and the system is booting up for the first time. We would characterize what capacity is available. We will show it for the node, for the host. And then, as you allocate VMs, you can go on the monitoring console and see what portion of the host is taken up, what capacity is remaining. And on a per VM basis, also, we will, you know, you can allocate certain dead blocks and certain burst capacity. You can see whether it's at what level of capacity that particular VM is. So, both views are available. I think I'll have to connect you to an architect to answer the specifics of that. We know cycles for instruction, cache contention, things that we get out of C groups. There's a whole bunch of metrics there. And also some performance counters that are generically available on Intel platform, but most people don't know how to use. So it's not proprietary, but it's difficult to use unless, you know, you have the knowledge. And we use, take advantage of some of those capabilities as well from the platform. But I can get you with an architect who can answer more specifically exactly what we use. Yes, yes, yeah. Most of it is pushed to our management controller, which can then look at the streaming data and then does an analysis of that. Oh, okay. Do we upstream it? So, yeah, the plan is to upstream a lot of these capabilities, you know. But the question is, which one happens first? I think we're, right now we're in the model of let's implement it, get it running, get it in the hands of people, and then we'll figure out the model for upstreaming it. Very soon in a matter of few weeks, the first version, and we'll be able to, yeah, the, yeah, the, yeah. No, it's independent of the release. The product will work on current generation processors, so it's not, you know, tied to a processor. Though when a new processor comes up, we will be there time to market and we will take advantage of all the new things that are there in the newer processors. So, it's a separate product right now. It's a commercial product. It's supported commercial product, though I get the exact opposite reaction for some other people. But yes, I think that's from enterprise customers. People want something that is a product and somebody is standing behind it. Any other questions? If not, hope this was useful and that, you know, you will find this technology useful maybe in your enterprise.