 OK, we're on. We're live. So how's everyone doing? Good. Thanks for being here. My name's Joe Dandrea. I'm from the AT&T Cloud Services and Technology Research Lab. And I'm also an OpenStack contributor. And I'd like to recognize my colleagues on the slide there, Barath Mati Kaustub, who's here in the audience, and Guiyang. It's because of their efforts that I'm here to share our story with you. And I also want to give a shout out to Arez from our Tel Aviv team who came up with the name Valet as well. So you know, when we submitted this presentation and it got accepted, we were really happy. And we had no idea until we got here just how much the community was going to invest in advertising it. I mean, we're so humbled by this kind of attention. It's really, really special, thank you. But seriously, the Project Valet, it's a lot like valet parking at your hotels and restaurants. You bring in your car, the attendant sees how big or small it is, right? How expensive. They might ask you when you're going to be back to get your car. And they figure out exactly where to park it. And that's kind of like what this valet does, except it doesn't just park one car at a time. It parks your entire fleet in one shot. Valet is being developed in collaboration with multiple groups at AT&T, which you see on the screen here. And we're also deploying it as part of AIC, the AT&T Integrated Cloud. So the purpose of this talk is to discuss the problems that we saw that led us to making valet and our approach to addressing these problems. And perhaps there are others out there who have run into similar issues as we have. And maybe the concepts I'm about to share could prove beneficial to the broader community as well. And of course, we'd love to contribute to that. So before we begin, let's take a trip down memory lane and talk about resource placement. And when we talk about resources, for the most part, we're talking about figuring out where to place or schedule things like VMs and volumes, right? Which hosts do we pick in our cloud for those? So in the beginning, there was Nova, right? And Nova was good. Boot an image with a flavor and nick, you're good to go. That's our compute service. And the constraints for those boots were pretty simple. Pick how much memory you wanted, how many VCPUs, whether you wanted the VMs next to each other or far apart. And then came Cinder out of Nova. That's our block storage service. Same idea, create a volume, give a size. The constraints still pretty basic. What volume size? What back end do you want to use? Again, affinity and anti-affinity, right? Now that's all well and good, but if you have more complex needs, Nova and Cinder provide scheduler filters. And by using a combo of these filters, you can achieve various placement requirements. And the way filters work, each filter has one constraint check, more or less. And the hosts have to pass all the filters that are enabled in order to be considered. And then all the candidate hosts are weighed and sorted. It's kind of like Harry Potter, and like pick the top one, put one in Gryffindor and one in anyway. But that's basically how it works. And there are 34 of these filters in Nova today, as of Mataka. The ones in orange are the ones that are enabled when you install OpenStack out of the box. So I'll give you two examples. On the right side, you'll see the server group anti-affinity filter, right? That lets you place a set of VMs on different hosts. And the AZ filter, the availability zone filter on the left there, that'll let you schedule VMs to independent failure domains, right? That's nice. That's great. Now I have a question. What if you wanted to place a VM in a Cinder volume together? Can you do it? Well, it just so happens Cinder has something called an instance locality filter. You just have to make sure the Nova extended server attributes are not disabled and that either you or Cinder has the proper privileges. So let's say we have that. Great. But what if the VM hasn't been created yet? Sorry, you have to make sure the VM's there first. OK, fair. Well, the VM is there, but what if I did that and now there's not enough disk space? You want everything, don't you? And while we're thinking about all this, how about some other problems? These are problems that we ran into. We want to find a host with enough capacity to deploy a group of VMs that need to be together or place two or more VMs so that there's one gigabit per second bandwidth between each one. See, it turns out that when the location of both resources depend on each other, like in this case, we have VMs that need a certain amount of bandwidth between the two. Scheduler filters don't handle things as well as they could. And as you can imagine, being a networking company that's busy transforming into a software company, this ends up being really critical for our apps. Here's another example. Deploy a bandwidth-intensive VNF such that it minimizes the use of oversubscribed spine switches. That's a mouthful. So in this case, not only do you have to optimize placement by looking at bandwidth requirements between VMs and the VNF, you also want to avoid putting those VMs across any oversubscribed spine switches. I got one more. Replicate a service chain of related VMs on different racks for fault tolerance. Now, it's true that a cloud administrator can define availability zones or AZs at the rack level. They can do that. But it turns out it's not the best solution, because, see, the app would then have to specifically pick which racks, which reduces scheduler flexibility. And from the app's point of view, we don't really care which rack each chain goes in. All we care about is that the chains are in different racks that also meet our app requirements. We're not making this up. So here's an example of a real app we've deployed. This is Ceph. Ceph is an erasure-coded storage service, and it has some interesting requirements. The VMs are in green, the volumes are in orange at the bottom, and each VM volume pair needs to be co-located, fair enough. But then, because Ceph is what we call a K event erasure-coded store, at least K of these servers have to be on different failure domains. Plus, there are large bandwidth requirements between the VMs. And it's not just Ceph. At AT&T, we have other VNFs with similar kinds of requirements. So as we've seen, you can certainly influence placement with the help of scheduler hints, but they only go so far, because VMs and volumes are ultimately scheduled by Nova and Cinder one at a time. So what does this mean for our cloud apps? Well, cloud apps can have a lot of VMs and volumes, and scheduling all of those resources and managing their dependencies can become expensive, error-prone, and brittle. So unless you're a DevOps ninja, which you might be, but unless you are, it becomes challenging. And over time, you run the risk of resource fragmentation. You end up with all these pockets of unused infrastructure sitting around idle, or they're too small to do anything useful with. And that makes us sad. So what to do? Well, anybody use heat? Heat's got to solve this, right? Why not? Heat is the orchestrator component of OpenStack. It lets you deploy and update cloud applications holistically using templates. So it's declarative. That's great. I'm a bit biased, because I contribute to heat when I can. And it's very near and dear to me. You should check it out if you haven't tried it yet. Yay, heat. And this is a heat template example, a very contrived one. But the way heat works is instead of creating servers and volumes and so on one at a time, you can build a holistic picture of your whole cloud app. And you do that by declaring all the resources in a template like this one. And in this case, we have a server on the top, which Nova handles a volume in the middle. And on the bottom is a volume attachment. And you could see the get resource primitives there that pick the server and the volume when they say attach them to each other. So here's what my app should look like, heat. Go build this. Great, right? It is, except heat's constrained too. It's constrained by Nova and Cinder's ability to schedule VMs and volumes still, because it relies on the other services. So even though heat has a holistic view of everything, each service still schedules one thing at a time. So eventually, we go around in circles and we got to the point where we said there's got to be a better way. And that's how valet was born. Think of valet as a heat level scheduler. It works at the application level instead of the resource level. So if valet had a mission statement, we think this is what it would be. Anything valet does should meet this litmus test of helping to meet cloud app requirements while optimizing the resource usage of a cloud's infrastructure. And I know that might look a little sales pitchy, but it's really an important statement because it conveys those three highlighted points in a single sentence. And notice application, not VM, not volume, that we want to optimize against the infrastructure, the whole app. So back to our Cep example again. Gray circles in the middle are hosts. The green and blue are VMs, orange are volumes. So if you were to leave OpenStack to its own devices, of course, this is what you're going to end up with. What we really want is this, where each VM and volume is paired plus each pair is on its own host, which is exactly what we need for Cep. So how do you use it? Well, you use heat, of course, because heat's awesome. And what we've done is we've added some new resource types to heat. So you saw some of the ones before for a Nova server and a Cinder volume and a Cinder volume attachment. And we've added a few more. And we use these types to express our application level constraints. This is a pipe. You use pipes to state bandwidth and IOPS requirements as well. And Valet will use that info to place the resources in a way that minimizes cross-communication cost. So VMs with a high bandwidth pipe between them might get placed in the same host or rack to avoid oversubscribed spine switches. Remember, that slide, that's what we're talking about here. And then runtime enforcement is handled and delegated to QoS. So you could use Neutron for QoS. Awesome. At AT&T, we have an open source project called Tegu. Or Tegu, it's one of the two. And for IOPS, we use another one called IO Arbiter, which we hope will become open sourced as well. So here's how you declare it. In this case, we want to effectively reserve 5 megabit per second bandwidth between two VMs. And this is a SEF monitor in its client VM. And if it was a VM and a volume, you could change bandwidth to IOPS and Valet would do the right thing. And notice nowhere here are we mentioning any notion of affinity at all. See, that may be a consequence of our requirements. But the idea is we want Valet to figure that out. Here's another one. This is our multi-tasker. We call this a resource group. So there may be times when you want to explicitly have resources close together for performance or far apart for fault tolerance. Nova already has the concept of AZs, host aggregates, and server groups that server groups can use for affinity and diversity. And Valet provides a more general app-level grouping instead as an alternative. And it can do this because Valet can see the whole app. I'll show you what I mean. So first up is affinity groups. Our resource affinity groups are really app-defined AZs. And you're not limited to co-locating on a host either. You can co-locate on the same rack or the same cluster, if you want to. So here's how it works. Looks very similar to the pipe. We have our resources at the bottom, a seph monitor, VM and its volume, and we want to, say, place it at the affinity level on the same host. Sometimes an app wants to distribute VMs and volumes across multiple domains. For instance, to have failure independence across racks. And notice again, same as before, you're not limited to hosts. You can distribute across racks or clusters. Plus, you can nest your groups. Here's the example. So in this case, we're back to our seph example. And remember, we placed our VMs and volumes on the same host. Those are affinities. So we have a diversity at the rack level. And those are not VMs or volumes. Those are seph monitor affinity groups instead. So we have a diversity of affinities, if you will. So we're saying to place these three seph monitor affinity groups on different racks. And now, we have an exclusivity group as well. Sometimes an app requires exclusive placement, such as core infrastructure. And only VMs from tenants that are members of a group are supposed to be placed on a given host. And so the way that works is you set up a group with a name and you put members in it. And basically, tenants can be members. And hosts get marked for use by a group's members in just-in-time fashion. So when you come in with a request, when there's a available host, it takes that host, colors it, and only those group members are allowed to put stuff on it. Again, same idea. Now we have a name property. And we're saying only tenants that are members of that group are allowed to use it. Otherwise, Valet will honk and won't let you do that. And that's how you use it now. How does it work? Glad you asked. It takes five simple steps. Well, one of them is actually kind of hard. It's actually NP hard, to be honest. Valet has four main components, actually five, but we'll go through four of them now. There's a front end command line interface and a REST API, of course. This is where all the magic happens. This is our app placement optimizer component, which we have code named Ostro. And then everything else is done through plugins. So we interface with heat through something called a stack lifecycle plugin. I'll explain that in a minute. And on the Nova and Cinder side, we interface with their scheduler filters. We have plugins there, too. Our motto from the beginning has really been to be as minimally invasive as possible in trying to do this trial balloon. And plugins are a great way to do that. Thanks, plugins. So step one, discovering infrastructure. The optimizer is going to collect baseline information, including physical DC topology. What do I mean by that? Baseline information could be info about what's already been allocated. Could be the remaining infrastructure, details about AZs, host aggregates, all that stuff. And DC topology, topology for the data center, covers your host, rack, and storage arrangement, plus all the network connectivity between everything. Step two, we get heat involved. So the first thing we do is we make an app template. And we put the value resources in there to describe our app requirements. And we give it the heat and say, go build this. Now, normally, heat would ask its resource plugins to call out to Nova and Cinder. That's that green square in the middle there. But before it gets to do that, here's where the Stack Lifecycle plugin comes in. We intercept the request, and then we handle it. Again, I'll get to that in a minute. But now we will say, heat, hold on. We'll call valet and ask it to create what we call an app plan, which is where all the placements will go into. So I'll explain the heat stack lifecycle. Basically, every time you create a stack, there's a set of lifecycle phases it goes through. And it turns out you can intercept before or after any of those points in the lifecycle and do stuff. So that's what we do. Basically, before the create phase happens, we invoke the lifecycle plugin. It says, oh, I want to do something here. And we sanity check the request. And then we call the optimizer. So what is the optimizer? So remember that NP-hard step? This is it. Ostrow is have a mashup of an A star best first graph search and a greedy algorithm. And the goal is to find a placement that maximizes host and network utilization. We're not going to go into that here. But there's a link which you can get later on the video that gives a comprehensive evaluation. This was presented at the IEEE International Conference on Distributed Computing. And that's the link. That's the paper. I encourage you to read it if you want to check out how this works. Back to the diagram, step three. So now we have our app placements. And what we're going to do is ask Valet to hold those in escrow for now. And each decision, each placement decision is tied to an orchestration ID. An orchestration ID, what's that? Glad you asked. Orchestration IDs are a unique ID assigned by heat to each VM and volume before it's made. Turns out this is very helpful for us because the resources haven't been made yet, but we need some way to uniquely refer to them. And this is a very convenient way. It's the same type of ID that you see when you use Cinder and Nova to schedule resources. And heat sends these IDs once you enable the setting. It sends the IDs to Nova and Cinder using scheduler hints. So when the VM is created, that orchestration ID travels with it. You don't put that in your template. Heat does that for you. Now step four, heat's going to start building the stack. And heat's going to put the plugins to work, the resource plugins that we saw before for Nova, Cinder, and whatnot. And they're going to call the services out. And normally, this is where all the scheduler filters kick in. And they will. They can still kick in. And Valet's got filters there, too. And the filters are going to give Valet the orchestration ID and ask, where does this go? And Valet's going to look it up. Now you say, wait a minute. I used Cinder on the command line. I don't have an orchestration ID. That's OK. We're still going to place it on the fly for you. And in step five, now we've got things placed. Everything's happy. But we still need to monitor things. So we're always watching the Oslo message bus. We're aware of what's going on in networking, in compute, in storage, and the cloud infrastructure view that Valet has is kept up to date. Now, this is the fifth part I mentioned before. At AT&T, we also care a lot about high availability. So for AIC, we're using a persistent state store that we built, which we call Music. Basically, it combines Apache, Cassandra, and ZooKeeper together in a way that gives us the HA we need. So a brief look at our implementation. This is basically what it looks like. We have active and passive instances of Valet. We have a triple replicated persistent service. That's music. We have a fault-tolerant service to handle failover and a load balancer to round it out. So we covered a lot. What's next? Here's what's next. Our initial deployment to the AT&T integrated cloud is in progress right now as we speak. And our intention is to release what we have as open source once that's deployed. But more than that, we aspire, someday, to join the BigTent, hopefully, as a new project working with the community this time. Even if it means starting over, that's OK. That's really important, because we want to do well by the community. And of course, as you know, tossing code over the fence isn't collaboration. We hope that by telling this story and sharing our journey, showing what our travel loon looks like, that we can start a conversation that leads to open collaboration. So we would love to hear your feedback and your story. Is Valet something you'd like to see in OpenStack? Do you think it's a good fit? Would you like to be part of it? So let's talk out in the open. Thank you. This is Káustub with the mic. And we will take questions if anybody has any. I see people walking to the mics. Here we go. Come on up. Thanks. Very interesting. A small question on your step one, which was the discovery. So how can you actually discover the layout of your data center, switches, superfactor, network, and so on? So in the AIC deployment, we are planning to use a AT&T tool called Formation that was talked about in some of the other AT&T talks, which will document the exact topology of our data centers, which will be then ingested into this tool as the starting point. Often most deployments will have some kind of documentation of your data center topology. And for what value needs to do, having a static map is good enough for now. OK, thanks. Welcome. Could you talk a little more about your AHA solution, what type of failures it can detect, and how it can recover? Sure. Particularly about fencing. Fencing, OK. Fencing. In particular, yeah. So it's basically a combination of Cassandra for persistence. So right now, Valet is done in an active passive manner. So there's an active version of it that actually does the optimization and then spits out the scheduler decisions. And it uses Cassandra for the state store. So Cassandra is triple replicated. So we don't want to get into any split brain kind of situations. And in order to communicate between the two components of Valet, which is the front end and the optimizer, we also use Cassandra as a queue, as a simple queue service. In addition to that, there is a high availability service that is built using ZooKeeper that keeps track of who is the active node at any given point of time. Now, why use ZooKeeper? It's so that we don't have to worry about things like what happens when there's a network partition, who's the primary, who's the standby. So what ZooKeeper gives us is a very reliable leader election service. And that's used to figure out who the active Valet node is. And that's all really Valet needs to know, which one is active, which one is not. And that's sort of the implementation as of now. Thanks for your questions. Are there any other questions? No? OK, well, I want to thank you all for being here. I want to point out we have some other talks from folks from AT&T. And I'll leave them on the screen here so you can check those out if you're interested. And feel free to come up and say hi. Thanks again. Thank you.