 Seriously, you can't sit down. There's at least like six or seven seats on this side. There's a couple up in the front. Shuffle in, squeeze in. So I totally thought this started at 140, and there was like four people here. I was like, great, we'll just have a little private Q&A. No, I have to give the talk now. It's all right. How many of you saw me speak two hours ago? Okay, how many of you know who I am? The same people. All right, that's not helpful. My name is Joshua McKenty. I was the lead architect of the team at NASA that built Nova. The first launch of OpenStack was on my blog, well, pre-OpenStack, the first launch of some of the OpenStack source code was on my blog in May of 2010. So I've been doing OpenStack for four years, and it's only two years old. And I'm so over OpenStack now, I'm kidding. I just, I like to be thinking about what's next. And I gave a talk in July at OSCON, which I borrowed some of these slides from, that was about the future of computing and how that was happening inside OpenStack. But I wasn't very specific, and I thought I would come up with a few more specifics and talk about those specifics here, where it really matters. This is the strategy track. And I like the strategy track because it means I get to sort of say whatever I want. I don't have to prove it. It's a strategy. No, the, I've started using the term para-cloud to refer to breaking down the wall between infrastructure as a service and the things that happen above it. In the sense that para-virtualization is when the guest is aware that it's inside a hypervisor, because you can do all the right things. But the guest can't know that unless the hypervisor tells it. So this is, it's a funny tension, is to take advantage of para-virtualization, you need drivers and the guest, but you need support in the host. And to take advantage of para-cloud, you need to change everything in your infrastructure as a service layer so that the things that happen above it can take advantage of those changes. But to know what we need to change, we actually need a conversation between those two communities. So that's really the strategy behind this idea is that those two communities need to start to talk to each other. When I talk about performance, I am not talking about efficiency. I generally hate the idea of efficiency. Those of you who've heard me speak before, I always use this quote. Bustalakoff was one of the fathers of systems thinking. And the idea behind systems thinking is that in a complex system, the behavior of the whole system is much more dependent on the relationships between the components than the performance of any single component. But another way, if we spend all of our time optimizing a single component of OpenStack, OpenStack as a whole will get worse, not better. We need to optimize the connections between those components. We need to focus on the right things, doing the right things, instead of doing things in the right way. All right, so this idea of cloud-aware applications or breaking down the wall between, let's say, platform as a service, or infrastructure, orchestration, or automation, first off, why do we want this at all? Why would this be a good thing? Then what are the challenges in achieving it and what is our strategy to overcome those challenges? It's a very boring format for a talk. But this is the only slide that will make any sense. So that's okay. We'll still have a good time. For those of you who were there this morning, I did emphasize that I always offend somebody any time I open my mouth. And so I try and be an equal opportunity offender. I will try and offend all of you rather than just a few of you. Use the term elasticity very broadly. When we started talking about infrastructure as a service at NASA, the idea was that elasticity management was any tool that VMs for you, basically. So why would you want that? Well, your application is consuming a whole bunch of resources for a brief window of time. You need more VMs. Or it's not doing anything at all. Those VMs should be scaled back down. It doesn't have to be about load. Elasticity can also be in response to policy concerns. It can be in response to resource scarcity. In other words, I might scale your application down so that I can scale somebody else's application up, which is more important than yours. But any characteristic in your cloud infrastructure that should deal with scaling up and down the runtime components, not the resource pools. When we talk about the resource pools, we're talking about capacity management. So this is about the running processes. And this is my equation for elasticity. Told you this wouldn't make any sense. That's a spork, just for those of you who are not keenly aware. The best thing about a spork is you only have one. You have taken two utensils, and now you just have one utensil. So when you're camping, sporks are perfect. What you've done basically is you've taken the best parts of a spoon and the best parts of the fork and given it a single common interface. When we talk about elasticity management across a set of resources, what we're really talking about is generalizing the metaphor for those resources and their resource allocation in the same way that a spork generalizes the metaphor for a utensil. We take a spork and we combine it with web hooks. That is the web hooks logo. Most people who are not familiar with that logo, you should memorize it. It is going to become more and more important. Web hooks encapsulates everything that we learned about scaling infrastructure in the last 20 years, starting with email. Email lessons, we have RSS lessons, we have dealing with mobile applications, we have dealing with large-scale infrastructure. And what we learned is that push is better. Generally speaking, polling is expensive and retarded. I did this in OSCON too, sorry. I will apologize for what I said in OSCON, which was to use the term retarded in the mechanical sense of it is slowed down. It is at throttle at low. So polling produces a whole bunch of traffic for zero results most of the time. It's like a guaranteed cash miss. Push notification is, hey, the notifying instrument knows that something has happened. That is the only time notification goes out. Web hooks is based on the idea of using push notification as a basic primitive in systems design. That's what it's for. If you combine this idea of reducing the abstractions, getting to a common interface, with this idea of push-based notification, you end up with the platform for elasticity. That's a very complicated way of talking about this. So let me come to some different metaphors. We'll attack this one statement in a lot of different ways. When I was in Europe, shortly after the OpenStack launch, we were building a large cloud system for modeling earthquakes at a global scale. Earthquake modeling was a thing called spatial correlation. It's really a way of figuring out if there's an earthquake that affects this whole city, and we have a lot of damage on this side of the city, or we're gonna have less damage on the other side of the city. Are these two spatial relationships correlated, and how, what variables control that? Earthquake science has chopped into two parts. There's the hazard calculation, which is how much does the ground shake, and the risk calculation, which is how many buildings fall down. Different groups of scientists, different communities, different conferences, different pieces of source code. And the challenge was, in order to do spatial correlation, we had to add features to the hazard calculations that the only people who benefited from were the folks doing the risk calculations. We had to get those two communities to talk to each other. We're the first people to do it. We have exactly the same challenge now in OpenStack, and I'll point out some really obvious examples. Let's say you've got an application running, and you need to add another web server, because the load has gone too high in one of your web servers, you need to spawn another one. Ideally, the host tells you, it tells the elasticity manager that load is high, because if the load spikes very dramatically, very quickly, the instance itself will die, and it will be unable to report it. If your network is totally saturated, your instance may not be able to tell you that your network is saturated, it has no bandwidth left. So typically we want out of band monitoring of our infrastructure, but that would basically say OpenStack needs to report other performance characteristics, IO bandwidth perhaps, to an arbitrary API. And let's say that that API is exposed to every orchestration tool, right scale, scaler, service mesh, dynamic ops and Stratus, you have it, that tier. We don't have that API because it's useless to the OpenStack community. We don't consume it. It's not an end user facing API, but we need it, and we need to work together with that orchestration community to build it. It's not just about load, but that's an easy example. This tug of war between the folks who need to consume the features and the folks who have to build the features is a challenging thing to overcome, but we have to get there. I mentioned Webhooks earlier, I just wanted to give you the logo and the chance to write down the name in case I was mispronouncing it. Jeff Lindsey was on our team at NASA when we first developed Nova, and this idea of Webhooks has always been in the project, but very poorly implemented. There's a notification framework that went in briefly in Nova, but we really need this to be part of the entire project from the ground up, preferably in OpenStack common if Mark is in here. Let me give another example of what you can do with a Webhook. And we prototyped this at NASA. Let's say you're gonna use Swift for enterprise document storage. It's a legitimate use case, we did a fair amount of that. You can use Swift as a backing to Microsoft products that everybody use. SharePoint, thank you. You can use it as a blob store for SharePoint. You can use it directly as a Fuse file system, you can just teach people how to use the API, you can use a jungle disk or whatever else connects to it. It's pretty trivial to set up Solar and Lucene to index everything that goes into Swift. All of a sudden you have a full document index for your entire enterprise document store. Okay, so far so good. It's actually even more trivial to write a Webhook into Solar that notifies you anytime a document is added to Swift that contains a single word. In other words, rather than searching the document once a day or once a week or once a month or whenever you happen to wanna do it, you just register a keyword with your document store and say anytime somebody saves a document with the word McKenty in it, I wanna know about it. That should be a feature of Swift. The same sense that anytime someone launches a instance from a disk image that I created, maybe I've put together, I don't know, Vision Workbench, I wrote some cool software for processing images and I've made it accessible to the whole company, but I'd love to know when somebody uses it. Callbacks are fun. It's like getting mail. I remember when getting mail was fun before it was all bills. In order for this to work, there are a bunch of places where we have to attack this and this one I wanted to point out in particular because I think this is a strategy that the community at large is working their way to. Of an instance, the idea of a disk image is a crappy abstraction. And I talked a lot in OzCon about that we need new metaphors. We have this idea that the thing that can be run by Nova is only disk images registered in glance and they're somehow different from snapshots which need to be registered separately so you could run them which are also different from maybe a tarball of a disk image that's sitting in Swift. That's a, or because the end user should be able to say, hey, if it's launchable, I wanna launch it. I don't wanna think about the rest of that. Take that away. Please hide that from me. Cinder is in a great position to work on this. We talked about this at the Cinder hack day when we were working on the API. This as a strategy, if it's launchable, I wanna run it. Please stop giving me configuration options. Software can detect what hypervisor this image was built for. Software can convert that disk image to run on a different hypervisor. It can convert from a raw to a KUKAO2 image to whatever. I should not have to think about it. The reason I have a picture of a cookbook up here is because we don't wanna go too far. How many of you cook at home? Those of you, you order fast food. And ordering fast food is essentially like launching a disk image. Totally pre-baked. It's got application state perfectly preserved. Ooh, it's got a hard-coded IP address and a hard-coded root password. So it's really only good for one thing. Fills me up quick. If you go the far extreme, you end up as Betty Crocker or Julia Childs and you make your own Hollandaise sauce. In fact, you go out and you cut down the wheat and then you grind it and hand it to your own flour and eventually you get this perfect, it takes you like six weeks to make breakfast. The goal, the strategy for OpenStack should be the 10 second gourmet, right? The other way of thinking about this is there's a difference between preparation and configuration. And if you look at the tools we have in the ecosystem today, we have puppet, we have chef, we have juju, we have anyone missing configuration management tool of choice, CF Engine? Crowbar, thank you. Crowbar is a little different. Crowbar, and this is where we get into trouble. Crowbar is being used for hosts, not for guests. So I'm talking about the guests. What are we running in Nova? And the problem is we have the same set of tools for preparing those images as we do for configuring them at runtime. They're two different processes. Getting an application server stood up with Tomcat on it and a bunch of wire files downloaded maybe and all the security patching that has to go into making it something I can run inside my enterprise. That's a different process from that last 10 seconds of boot time where I really wanna make sure it's up to date and I wanna make sure it's got the right IP address and I wanna make sure it's connected to the right auth server. There are a number of people in the community who have been attacking this problem from different angles. Yahoo, blueprints for runtime metadata. How do we actually get a communication channel between the guests and the OpenStack infrastructure that works in both directions and that is reasonably secure? So we inherited the metadata approach from AWS. It's not secure. It assumes any user who has access to the VM is an administrative user because it's an unsecured HTTP get. It's not read right so there's no way to use that to pass information back to OpenStack but it's got promise. It is HTTP. It is in the right place. It does solve this. How do we communicate between guest and host? We have to be able to do that in a way that's agnostic to the hypervisor we're using. So this idea of gourmet I'm gonna come back to in a minute. I know I can see in your eyes everyone's thinking that this is a solved problem. They're like oh we do that 007 agent. The problem with a secret agent is they always turn on you. You ever seen a 007 movie where somebody doesn't get turned? It happens in every film, right? An agent is a root kit. My co-founder Christopher McGowan is famous for having said that at the Austin Summit, the first OpenStack Summit, the first technical session, the first thing out of his mouth is that agents are root kits. But it's true. So if we're going to have root kits in our cloud, let's maybe be really, really careful about how many of them, which vendors they come from, what part of the infrastructure are they really? So we come back to this, how do we solve this push-pull, this tug-of-war between the infrastructure and the orchestration layer? We trust the infrastructure in a way that's probably appropriate. It's the bare metal tax and other things we should be worried about and we continue to work on them. But we should trust the infrastructure to tell us the state of our guests in ways that perhaps agents are not the best way to do. Now, we're never going to get to an agentless world. I just want you to think about that. And every time you install an agent in your guest, think of it as 007, right? What are the chances that at some point he will kill you? On that security front, by the way, Dreamhost, if there's anyone from Dreamhost in the room, has been working on the other end of that metadata communication challenge, which is read-write metadata blueprints from the Yahoo side, from the Dreamhost side, it's how do you actually create a secure channel at the launch of an instance? And I know Canonical solved this ages ago. It's called a CD, it's called ISO 9660, because at least we have ACLs on that inside the operating system. We can use that, and as this blueprint's been done, there's config drive that was landed. But as a strategy, as a community, we haven't looked at this as how do these pieces fit together to solve this problem? Actors in cloud computing, a job of specifying who the various actors are. We have what we called the five actor model at Piston. I just wanted to bring it up. Any time we're talking about who's using what piece of infrastructure, we need to think about that the end user of the application is not necessarily the user of the API. We have customers where the folks who are allowed to turn on instances to launch VMs are not allowed to log into them. That's a very legitimate, very common use case. The auditor of any cloud environment has a set of privileges they have to have about knowing where VMs are running. They're not allowed to log into them. They're probably not allowed to launch them. They're certainly not allowed to read the contents of the disk images, but they need to have a lot of lower level access. So we have end users, we have users, we have auditors, we have operators. The folks that actually create provisioned tenants. They may or may not have all sorts of privileges on that infrastructure, but most folks operating community clouds, certainly financial services, certainly genomics, would like to see the operators not have access to the guests, if possible. And it is possible. And we finally have vendors. You know, we're a vendor, we provide security patching as a service for folks running private clouds. We have a very complicated Chinese firewall relationship where they can only see those patches when they have said yes, we'll see patches from the Bastion hosts, right? But again, these are being built, some of these things are being built by vendors that really should be in OpenStack Core. And that being one, we did this with CloudPipe, which I know we submitted to OpenStack, which yes, it's a horrible implementation. It's still the right idea. Administratively controlled VMs as a first-class citizen in your infrastructure. Pretty damn useful. I want to emphasize here, and I emphasize this in OzCon, and I want to come back to this, it's part of the underlying philosophy for the OpenStack community is a posture of humility. We have not got a single thing right to start with. Not a single thing, right? But what we do is we're iterating towards good through a series of smaller and smaller mistakes. It is worth pointing out that handguns, tequila, and OpenStack are all roughly speaking from Texas. From Texas. Platform as a service and infrastructure as a service are first attempts, first approximations to encapsulate a couple of key ideas of how would you work with infrastructure in a future where you can provision it for yourself, where there are no middlemen. Replaced operators with very small shell scripts. And they are good ideas, but they are insufficient. They are too siloed. They are too separate. This is like when we started with storage and compute and storage was on a giant tape that we rolled in and out of the other room and we connected to our computer. It took us 30 years to get it all the way down to being on the CPU. But that's logical convergence. Get the storage and the computer as close together as possible. Where you have to go through the same thing now between platform as a service and infrastructure as a service. We've been working with the folks at VMware to get Cloud Foundry running on OpenStack. Partly because people want both. And it's a really common use case. Hey, we want AppDev. We want infrastructure and we want platform. Partly because people need both, there is no solution I've ever seen that can be done 100% in platform as a service. And yet I've never seen an infrastructure as a service deployment that wouldn't be better with platform available. They're incredibly complimentary, but we've got to get the wall down between them. Platform can't get any better without some support from the infrastructure. So the ask here, there's always an ask. There's always a hook. The ask is for the right people to help. My approach to strategy and OpenStack in the last two years has been to write very, very little code. Always ends up getting rewritten. To constantly find the right partners to drag into the community. And most of the folks that I have badgered into being involved, it took six to 12 months in a lot of cases. It takes a while for people to understand what's in it for them when I show up. And I'm like, I want you to put at least 10 people working full time for free on this thing that by the way I'm gonna sell. But trust me, it's a good idea. We've got a lot of the right partners now. Some of them are engaged. Some of them are not deeply enough engaged. So I'm really, this is a shout out to the orchestration community specifically. Chef guys, the puppet guys, right scales, the strategists, the service meshes, the dynamic ops, the scalers. Please stop fighting over the host. So we get another version of another set of recipes for deploying OpenStack on the bare metal. Okay, you're not contributing value. The guest is lonely out here. Let's focus on the guest. And by the way, there's like 100 times more guests than hosts. I don't know if you did the math, but like there's way more of them. So we need this community to work together. We need the orchestration folks to tell us what the API should look like. We know they need one. And we really need federated auth. Keystone is a great framework. It is not the tool that we need. I know Tim has been talking about what CERN's done, Active Directory, there are other folks presenting on SAML today. I have been pestering Microsoft and Centrify and Okta and others to get involved because they have the expertise and auth that OpenStack needs. If we try and reinvent single sign-on, we will shoot ourselves in the foot so badly. It has been done. So those folks are willing to come and help. We need to give them, we need to help them help us. All right, so my, again, the strategy in the strategy session is to find the right people who can be press ganged into building this thing and then try and communicate not at the detail what needs to be built, but at the vision, the problem we're trying to solve. The other thing is I'd like to point out again in that we always make mistakes. I was wrong in the specifics about every single partner. On every case, we went out like the quantum project, for instance, ton of folks involved. It's fantastic. When we started, I had no idea, in fact, I don't think anyone else had any idea, that Nasir was going to take such a leadership position. It didn't seem obvious at the time. It seems obvious in retrospect. It did not seem obvious at the time. My money was on someone else. I was happy to see Nasir involved. They were not the folks I was really pestering. I'm like, no, you gotta get involved in this. I've been wrong so far of Cinder as well. I did not expect Solid Fire to step up and be like, no, we got this. We're gonna put the time in, we're gonna do the work. So I totally expect to be wrong about who's gonna lead the charge in orchestration. And I'm saying this, if you have a play in that space and you wanna dive in and get deeply involved, please get deeply involved. But for the folks of you who I know are involved, you could bring a little more to the table. Okay, so the summary here is webhooks. I didn't really talk about why, but I think I sort of talked about why. If we don't have this as push-based infrastructure, if the infrastructure is not responsible for telling everyone else what's going on in the infrastructure, polling will kill us. We can't scale that and we can't secure it. Scheduling is a first-class concept. And the idea of the guest is a first-class concept. We can't build this one hypervisor at a time. We can't build message channels around some hypervisor-specific way of passing messages. We can't build this around a deployment-specific method of scheduling resources or launching more resources. We need these to be very general idea. That's what I threw in there just cause I'm addicted to a location placement of data and the relationship between storage and compute. It is, oh, central to what we're trying to do with OpenStack. Still such a second-class citizen. It's a really hard problem. It would be fun to get some folks who really understand data locality to help out in common. Let's not do this in Cinder. Let's not do this in Swift. Let's not do this weirdly enough in Nova. Let's do this in OpenStack Commons so we have one general approach to these problems. And then we launch all the things. This is gonna take a while. Strategy means I'm usually like five years off in my time estimates. I thought we would be done with OpenStack. When we launched OpenStack, Nova was like 6,000 lines of code. I think the code base, the whole thing is one and a quarter million, right? So this will take a while, but I think this is the right direction. It seems like if we get this right, like there's no point in having any other kind of, about webhooks that's so powerful conceptually. I'm not advocating a particular protocol. I'm advocating the approach. The approach is that the subscription is ad hoc. In other words, you've got an API, it's like, hey, I wanna hear about this and only this. In a sense, I get to register ahead of time for a very specific class of events. And you can think about this, let's say elasticity management. I only wanna know when an instance is over 80% CPU utilized, right? So I can specify that in the elasticity manager and let's say scaler. I can say just tell me when any instance in the infrastructure is over 80% utilized or tell me when any network segment in quantum is over 20% packet loss or X rate of, like whatever the event is, we don't have to be prescriptive ahead of time in terms of defining those. We can have a very broad spectrum of saying these are the kinds of events the infrastructure could generate. And we just, honestly getting collaboration happened between the orchestration folks and the infrastructure folks is going to be the hardest thing we do. So if we get that as generic as possible the first time through and we let both sides build on it for as many iterations as it takes to get to something great, later on it will be easier to perform. Sure, so the question's about SAML. I wasn't endorsing it. I was just mentioning there's a talk about it today scheduled. So somebody's done a SAML implementation for Keystone. We worked on it at NASA because it is a federal government requirement under U.S. EOS. It is a very heavyweight protocol. So Lutucker and a bunch of other folks we had dinner recently and argued about the fact that nobody has done better than X509. Then we've been trying to make better single sign-on at the protocol level, at the concept level, for 20 years. SAML, like, most of the solutions are adjacent to the problem. They're not actually a central, they haven't attacked it squarely. So let's just work with LDAP. It's the best thing we got, we'll just make it work. I mean, how many of you have dealt with Active Directory in your life? Thank you very much, I know it's right after lunch so I'm glad you stayed awake.