 My name is Pete Chadwick, I'm Senior Product Manager at SUSE. I'm responsible for our cloud infrastructure products, which not surprisingly include an open-stack distribution. Joining me today will be a lock for cash from Intel. What we want to talk about today is said pets versus cattle on the agenda. That's not really the intent of this because we don't think that it's a pets versus cattle discussion. We think that people want to run pets on cattle and at some level every application, every workload is somebody's pet. So this is kind of an image that I found that really kind of highlights what we think we're talking about, which is people want to deploy open-stack, they want to run applications that are quite honestly enterprise workloads that may or may not be tiered or web 2.0 applications, may not be designed for cloud. But even if they are, the question is once you start putting something into open-stack, how do you make sure that you can meet the same kind of enterprise SLAs that you've historically been meeting for the last 15 years with more either non-cloud solutions or more proprietary solutions? So when we talk about an SLA, there's really two components to it at some level. The first is you need to have reliable infrastructure. One of the things that I always talk about is in sort of the traditional enterprise way of the world, failure is not an option. And you talk to a lot of cloud people and they'll say, yeah, you're right, failure is not an option, it's a feature. And the sense of that is that I'm now taking the responsibility from ensuring reliability at the infrastructure layer away and telling the application developer that he now has to make sure that his applications are configured to assume that failure is going to happen at the infrastructure level. I'll be honest, there are a lot of application development teams at the enterprise level, they're scared to death of that because it's an entirely different way of thinking about how they develop applications. Scaleout developers are fine. I mean they understand how to do that, they've been doing it in the cloud, but there are a lot of teams that just don't necessarily want to make that leap. So you need to make sure that there's a reliable infrastructure on top of which you can start to build things out. Clearly applications need to be able to handle compute node failures, but you also, that's kind of sort of the minimum stakes to get into the business. More importantly is how do I make sure that my cloud is delivering the kind of resources that I need to to make sure that I'm maintaining my service level agreements. So I need to be aware of the services, I need to be able to detect hot spots, I need to be able to monitor load, I need to be able to trigger actions through heat or something like that based upon what's going on in my infrastructure. So the structure of this conversation this morning is I'm going to talk a little bit about some of the minimal things you need to do in OpenStack to get a level of reliability at the infrastructure and then a lock is going to come and talk about some of the things that Intel's been looking at and pushing upstream into OpenStack around service awareness. So first of all, the first question you really have to ask yourself is look at a typical cloud component and this is based upon the assumption that people are using either some kind of a distribution, pretty much all of certainly SUSE's product and I think most of our competitors products include some sort of a deployment node. We call ours the administration server that uses Chef, it uses Crowbar and that does the physical orchestration of the infrastructure and then it deploys the OpenStack services on top of that infrastructure. So when we looked at it and said we're going to start doing high availability and the first question we wanted to ask ourselves was really what are we trying to protect? Are we trying to protect our administration server? Are we trying to protect the control plane? Are we trying to protect the guests? Quite honestly we looked at the administration server and said that's something you use when you set something up, something that you use when you're trying to maybe upgrade your system. Quite honestly if that goes down it's not going to affect your users, it's not going to affect your day to day operations. It doesn't make a lot of sense to spend a lot of time focusing on that. Guests is a hard problem. We're looking at ways that we can do that. I'll be honest we're not there yet, that is an area that we're working with partners to understand how better to do that. There are ways you can do that with existing technologies but it's perhaps not as straightforward as your application developers and your end users might want to look like. So we decided we wanted to focus on the control plane. So what does that mean? If you look at the control node and that could be multiple control nodes, it's where all the OpenStack services, the database, the message queue, Nova, Glantz, Cinder, all those things run. So that's what we focused on to make sure things were highly available. And it's actually a pretty straightforward solution. Since the general OpenStack runs on Linux, it's just a set of services that are Linux based applications. You can use all the traditional Linux high availability technologies to solve that. And just so you understand, Linux high availability technologies are doing things like keeping air traffic control systems up and running and to keep manufacturing factories up and running. So it's a solid accepted technology that is being deployed today for mission critical applications. It's not some kind of strange, completely leading edge technology that we should be worried about. So what you do is there's some technologies that you can use. Pacemaker, CoroSync, and HA Proxy are the ones that we use. You use shared storage for database and message queue. And by deploying those technologies, you can easily get a completely robust and reliable control infrastructure for OpenStack. Now, a quick question, why is that important? Well, if your control node goes down, pretty much your cloud goes down. And the first releases of OpenStack that was kind of not too painful, your VMs would typically stay up. But now if you're deploying Neutron, you know, potentially you lose your control node, Neutron goes down, your VMs lose all connectivity. And if you've got 10,000 VMs out there running and all of a sudden they disappear, your customers might want to know why. So it'll minimum sort of the simplified structure that we kind of go out with for customers looking at doing proof of concepts or early pilots, simple two-node structure. You set it up with Pacemaker. You have the two servers, they're talking with each other. You have the resource agents running on all the control nodes or all the control services. And it gets you a pretty reliable infrastructure. Now, quite honestly, that's not the way we recommend doing it. Most of our customers who are doing pilots actually have three clusters. The reason for that is the database cluster runs best in a two-node environment with shared storage behind that. It can be, and the shared storage could be on something like a MedApp box or some other SAN, it does not have to be any kind of specialized storage. So you put RabbitMQ and Postgres on that. Because of traffic load, we tend to recommend putting Neutron on its own cluster right now. I think as technologies, as DVR become more mature and people start implementing that, that may be less of an issue, but right now Neutron is in the data path for networking. And so if you've got a lot of traffic going through, you want to have a number of servers that can run active-active to handle that traffic. So we recommend putting that on a separate cluster. And we also recommend doing that on an odd number of nodes without going into a whole HA discussion. You want to do that to guard against fencing problems if one node should fail. And then we recommend another three-node cluster for every other OpenStack service. And in general, once you've done that, you're pretty well protected that any single physical server that were to go down, the rest of your cloud would stay up and continue functioning as normal. And just as a plug for SUSE, we've actually automated all this through our deployment technology so that you go through, you configure the cluster, and then you just install services on the cluster and they're automatically configured with the appropriate resource agent so that you're up and running in a highly available fashion without necessarily having to go through all of the plumbing necessary to get this stuff working together. And we've got a booth downstairs. If you're interested more on that, we can certainly walk you through that and give you the magic key which can actually get you set up with a highly available cloud in a few minutes. So with that, in terms of talking about the infrastructure, what I think is really some of the more interesting stuff then is what about the workloads? What about the services that I want to deliver? And so I'll turn it over to Alok to kind of go through that. That picture of the dog sitting on top of the cow was very nostalgic for me from my days in India. So I'm going to talk about, you know, I'm Alok Prakash from Intel Corporation. I'm going to talk to you about, you know, the problems people face when they're trying to run pet workloads on nodes. So we've been talking to customers the last two years and, you know, the problem that most people were worried about when trying to get pet workloads to run on the cloud, whether it's public cloud or private cloud, and we would encourage people to do private cloud, is, you know, first question is trust. It's a multi-tenant, multi-workload environment. Can I trust that the system has not been compromised? The BIOS is okay. There's no root tool get there. There's no, the hypervisor is okay. And so that's one class of problem because I'm multi-tenant. And the other one was in performance. That is, can I have a noisy neighbor? And I'll illustrate what those two problems are. And then the third piece is that how can you assure me that those two problems are not happening on the node as I run my workloads there? So let me take each one of them at a time. So let's talk about performance assurance. You get a box. First thing you need to be able to know is what is the performance that I'm going to get out of this system. So you need some, and today people use VCPUs, things like that. But that's not a VCPU on the older generation processors, not the VCPU on the current generation system. So that's not a very meaningful number for performance. So that's one big problem. That is, how do I create a normalized compute unit? I get a box. I know it has so many service compute units or normalized compute units. And I can div it up into VMs as I'm doing that. So that's the first problem. How do I know the capacity? How do I do that? The second problem of performance is that you can have a noisy neighbor. And what I mean by that is in a system, you're sharing cache. And you may have a VM that's, let's say, streaming video. It's a media server encoding, decoding something. So it's going to trash the cache. And the performance of the next VM is going to be affected. Today's monitoring tools usually do not monitor at that level. So you would not detect when you are having a noisy neighbor problem, and you would not be able to prevent or take remedial action. So those are the big problems from a node and running multiple VMs in that environment. Similarly for trust, that is, I mentioned, how do you know that the system has not been compromised? Can you run a piece of code in such a way that it cannot be tampered with, and you can create a white list of, OK, this is the right configuration. And every time the node boots, you can make sure it's not been tampered with if you're OK. So these are the two big challenges. And we, as Intel, have been addressing those. The first one, from a performance perspective, we have some capabilities in the platform to monitor cache. And we are able to look at performance from instructions per cycle and cache and correlate the performance and the cache usage and miscache misses and come up and tell you what is the contention pool. This VM is being aggressive, and those VMs are being affected. So that's a key capability, and we're making that available. We have proposed blueprints to make that part of OpenStack. I mentioned, so these two problems, the nosy neighbor problem and the noisy neighbor problem. For nosy neighbor, we have technologies in our platform, Intel Trusted Execution Technology, where we can run this piece of code that then does the signature of the BIOS and the core components of Hypervisor and then create that white list and then at runtime you can compare it and make sure nothing has been compromised. So we did some experiments and we actually built a system that could plug into OpenStack, and I'll show you how we did this assurance piece. It has essentially three components. You've got an agent that's collecting this deep platform telemetry from the system. And it does the analysis, looks at all of the data and is able to detect when that noisy neighbor problem is happening. The other one is it collects all the data, can send it to an assurance controller, which is a KVM virtual appliance. And we have a plugin in Nova Scheduler so that plugin is able to plug into the service chain. So when a customer comes in, says give me a VM, make sure it runs on a trusted compute pool, no much performance based on that some normalized compute unit. Those extra spec requests are trapped by the plug-in. It's in the filter chain. It'll look at all the systems and say these systems do not have that trust attestation so they're filtered out. It can look at the cache contention on each of the systems and it can weigh all those nodes and give you the node that has the least contention so that your VM lands and gets the best performance possible. So this is how we have instrumented it. We have a technology. If you want to see it, it's available in the booth downstairs in the Intel booth and we can demonstrate that to you. And there are other aspects also that come in. One example would be, let's say a fan has failed on a server and it is running hot and the temperature is going up. You don't want the next workload to be landing on that system. So you want to be able to detect when a system is an unhealthy state and avoid that during scheduling time. So we have a blueprint for that as well that we propose. So those are the key ways you can enhance and run pet workloads and be assured that you don't have a nosy neighbor and a noisy neighbor. And what I have here are a few screenshots to show you how we went about implementing it in our controller. One was to be able to extend the flavors with burst capacity. You can say, if a host has 20 service compute units and you can then take the VMs and say for each of my work pet workloads I may want to say it should have a minimum of one compute unit always guaranteed and maybe burst up to two. Or more. So that's a range you can give on what's guaranteed, what's burstable. And we also have capabilities like core pinning. So you can say for this VM pin it to a core. So it never, you know, it's assured that there'll be no noisy neighbors. And we can monitor the cache across all of these pinned cores. So that's an example of how you can make it happen. We have, you know, on this picture you can see one of these nodes has a little lock icon to indicate that's a trusted node. It's a TPM and we have this piece of code that's able to attest that the node has not been, you know, doesn't have any issues from a trust perspective. Then similarly from performance point of view we can, you know, detect how much of the service compute units are available on the node and how much has been utilized by all of the VMs on the node. We can look at the cache contention as well or the contention level as well. And if you want to avoid noisy neighbor problem you may want to pin the cores and continue to monitor the cache. And when you detect a problem you want to be able to detect which is the noise maker and which is the, you know, noise affected VM and then take the remedial action. You can move either one of them, evacuate it from the node. Now those are in terms of just basic VMs and noisy neighbors, you may want to run CEF for example or some other application services on the node, right? So you don't, you want to make sure that those applications also don't suck up a lot of the resources or affect the performance of your VMs if you're going to be running both. So you have to be able to quarantine both and that's also a capability that we have experimented and you can see that in the demo downstairs. So I am not going to walk you through these things so that's basically my intent here is just to show you the class of problems that people are thinking of and general monitoring tools do not have that capability and so we are trying to make sure that all of the, you know, that capability a noisy neighbor, noisy neighbor problem gets addressed and it gets upstreamed into the open stack distributions. We have blueprints in place so you can check out those blueprints and help move some of those capabilities. Now I've talked about some of the capabilities that we have been doing that is, you know, doing the noisy neighbor, noisy neighbor detecting the victim VM but doesn't seem to be moving to the next. But we are making additional, let me come out of the node and we'll get out of it and then, okay, so yeah, but from an Intel perspective we have, you know, a lot more capabilities than we are upstreaming, we are among the top 10 contributors usually and but for this talk I think the key message that I wanted to give was those two big problems, noisy neighbor problem, noisy neighbor problem, that's what we are, you know, working on actively and we have some blueprints for normalizing the compute unit so you can just take a VM and say this is what it should take and it should work the same regardless of which system you are running on yeah, so that's, you know, we are having some technical difficulty on moving the slides for some reason but that's okay. It's one slide it does not want to present. Yeah, okay, let's keep moving forward so that's fine. So yeah, with that I think, you know, if you can look at the blueprints and support us that will be great and you will see the ones that I mentioned are, you know, by architect who's submitted them to the blueprints, Moodly Sundar and so, anyway, let me just move to the, okay, yeah, maybe this one second, so yeah, I just put this slide up. Oh, technical difficulties, okay. Yeah, I mentioned some, you know, some of the blueprints, we have submitted some one blueprint of the normalized compute unit. I'm hoping all of you will support and make it happen. We've submitted one on the platform Health, which is, you know, like I mentioned, if a system is not in a good state we can use IPMI to detect the health of the state, make sure that the NOAA scheduler and the scheduling algorithms is smart enough to avoid those nodes. And of course the contention pool that I mentioned, you know, how do you know who's the noise maker, who's the aggressor victim and how do you find out the VMs and you can, and we do that with some statistical algorithms as well as capabilities that are in the Intel processor to act at that level. Okay. With that. So I think that was kind of what we wanted to share with you, so obviously if there are questions for either, for either a local or myself, feel free, raise your hands. Or if you're curious to find out more. Yeah. The demos are available in, the demos available in the Intel booth. So I invite you to come over there and check it out. What kind of SLIs do you think you're... I mean, I hesitate to declare numbers of nines, but certainly we have, I mean, that infrastructure is being used by air traffic control. So I mean, it's in general pretty reliable from a physical perspective. In terms of doing the monitoring, I think it's really a question of, once you track it, then what kind of remediation do you want to put in and how fast do you want to track it? But in general, we're pretty comfortable that you can meet the SLAs you need to meet. Yeah, and if you're a service provider and you want to offer services, so you can say, here's my normal VM if you want to rent out this VM. But if you want the VM to be running only in our trusted compute pool and have so much guaranteed performance, here's additional SLA attributes that you can add on and price it more. So you can use tools that we are talking about to then charge the customers for that and provide that service and give them that security and safety that they're looking for for their pet workloads. Okay, yeah. We'll repeat the question. Yeah, so the question was, in a regular production environment you may do BIOS updates, things like that and how does the customer get back that information of when assurance has been done for the trust level. So in our demo, if you come down to the booths, we can demonstrate how it's done. But essentially what you would do is you would create a log of every time the system booted what the attestation status was and have that in the log and then generate a report that says, your VM ran on these nodes. All of these nodes were always attested to not have had any of these things. That is, they matched a white list of known good BIOS hypervisor and other parameters that you might have specified. Yeah, I highly encourage you to come to the Intel booth and then we can spend more time explaining how it is and all the technologies involved. So right now we have submitted the blueprints, but if you want to talk to us about contributing to it, we can talk offline and get you, you know... I think he was really asking, can he get a hold of the code to do some testing? Like I mentioned, the demo that we're doing, all of the code is integrated in there, but we are in the process of breaking it apart so it can be made into blueprints. It's not ready right now. But be very shortly available. RSE, our architect is there who's proposed these things and he might be able to... Murali Sundar is our principal engineer and he might be able to chat with you and then figure out how to get that code. Yes. Are there any... So the question is, are there any specific enterprise workloads that we would discourage anyone from running? I'll be honest, our initial assumption going into some discussions with customers and proofs of concept was that, you know, there clearly were a class of applications that they would not want to migrate and we have been surprised about how aggressively customers are looking to migrate workloads and I think it's, you know, some of that's just they go through the testing process and see what works and what doesn't and they clearly are identifying, you know, low-hanging fruit initially to say, you know, to say these are workloads that we feel comfortable with moving or they're doing brand new applications, greenfield applications that they can code to be cloud aware, but most of them have visions that they're going to migrate their application workloads over at some point and they just assume that we're going to evolve the technology rapidly enough either through these kinds of technologies or, you know, we've got some partners working on even fault-tolerant capabilities at the guest level that can really address almost any workload at that point. You know, I guess the other way I'd answer the question is there's two ways to look at OpenStack. You know, the first way to look at OpenStack is it is a way for me to deploy current cloud infrastructures to do cloud-aware applications. The other way to look at it is it's a great way for me to automate the orchestration of my infrastructure in a fairly lightweight fashion. And if you start to take that second view, which is the data center of the future is going to be OpenStack APIs and everything below doesn't matter, then you start to say, I'm going to move all my workloads. And we have customers that are starting to look at it that way. Okay, if there are no more questions, I'll give you 10 minutes back. You know, we'll stick around here for a little while if you want to come up and ask questions. I obviously come down to either the Intel booth or the SUSE booth, and we're more than willing to provide you more discussions. And free USB keys to set up OpenStack in a hurry. In fact, Pete, I'll take that.