 G'day. My name is Sam. I'm from the University of Melbourne in Australia. I've got two other people with me. I've got Matt here from Rackspace and Blair here from CERN. We're here to talk to you about cells, which is a concept in NOVA. This talk kind of really came out of one of the operator sessions when someone said, look, Rackspace is the only people using this, so we're here to give out use cases of cells. I guess not everyone probably knows what cells are for the only initiated. It's a way of segregating or sharding your NOVA installation into maybe a couple of hundred compute nodes, and it just makes your scale a bit easier. I've got this thing here. So how we use cells in Nectar. So Nectar is a federal government funded project started in 2011. We have eight institutions around the country. We've been running OpenStack in production since 2012, early 2012, with Diablo. When we upgraded to Essex, we moved to a cells environment. This was kind of before cells was actually in the code base, so we kind of took a little bit of a risk there. Worked with some Rackspace developers, found a random GitHub branch somewhere and merged it in a way we went. Thankfully, that got merged into trunk, and now cells is actually in OpenStack. The idea of the Nectar compute cloud is to try and get the compute near the data and the compute near the research tool. Again, we wanted to kind of federate that from a user's point of view to have one single cloud and kind of an easy interface for them to get into. We've currently got just over 5,000 users using this, which is getting up there. So we've got eight institutions around Australia. I don't know if you know Australia, but it's quite a big area. I think the pink dots are actually our cells, as they are, and the other color dots are some of the larger HBC and research tools around the country. So we use cells to kind of federate all these research sites around the country. They all have different hardware, different people, different administration domains, so it is a challenge in that sense, and with the main aim of making it easy for the users, there's one interface, one dashboard they can log in and they can launch over all the sites around the country. So we run a parent cell as well as most of the other OpenStack infrastructure, like Cinder, Glantz, Cylometer, Heat all centrally, and then each of the nodes is really just tasked to running compute nodes, really, and some non-overscheduler infrastructure. In that sense, we kind of have a core kind of knowledge base in one site, and then the other nodes don't really need to have as much knowledge base in our OpenStack, so it's been a bit of a low operational cost for us in that sense. Each site tends to run one cell, but some of them have multiple data centers, so they've got almost like a three-tier hierarchy where some sites will have two data centers, two cells. In terms of how we with cells, we kind of treat cells as a behind-the-scenes kind of thing, that's an operational deployment kind of mechanism, and in terms of what a user sees, they see available designs. I mean, these roughly match up to cells, but they don't necessarily have to. So how big are we in terms of scale and architecture? We're the smallest, we're kind of going from smallest to biggest, so you'll get more numbers. So we have six sites in production so far, we've got about four and a half thousand instances running. In the end, I think we'll almost be up to about 40,000 cores, and 1,000 hypervisors are spread out. We tend to have around about 100 to 200 hypervisors per compute cell, that seems to be a nice number in terms of splitting up the rabbit and the database connections, and we should probably have about 10 compute cells in total by the end of this year. So some of the pain points in cells, so cells are still deemed as, I guess, an experimental feature. Some of the issues we have, some of the scheduling problems, so if you launch an instance and it goes to one cell and some reason the cell doesn't, it's broken, it's not going to try a different cell, it's kind of dead in the water there, and also some of the information that the scheduler has is not really enough to schedule a certain cell. It might be advertising a lot of memory, whereas what I'm asking for is compute cores and it'll send you to the wrong cell. One of the other things is we don't have any people using cells, and that can be tricky when you're trying to get help from the community in saying that we have a pretty good relationship with these guys in Rackspace and CERN, and so we kind of have a bit of a cells. We compared notes, what, 17 times in the last two days? How do you do that? Something like that, yeah, we're all doing the same thing. Upgrades, so actually upgrades for us at the start, they were very painful, but in a sales environment now you can upgrade cells kind of one by one, so when we moved from Havana to Icehouse we upgraded our top-level cell, which was running Icehouse, and then we had all the other cells underneath Havana, and then we could slowly, you know, okay, we'll pick the cell and we'll upgrade that to Icehouse, that's working, okay, then we can move on to other cells. That was really good for us, we didn't have to do a big bang approach, and we can have kind of a test cell to upgrade first, so that was really helpful for us, so that's not so much a burden anymore. So some of the things we've been working on that aren't in the codebase at the moment, and we've, Matt will kind of talk a bit more about that later, as things like security groups syncing out there. Some of the things around EC2ID mappings, if you're familiar with them, you need them in the API, in the metadata, an instance might want to know it's AMI. Availability zone, aggregate sport, there's still really no support for that yet, but there are kind of ways around that, we can come and talk to us if you want to know about that, and same with our flavor management, so that's something that's been on the agenda too. So some of these things that we've done, we kind of have to make an assumption that there's only one parent cell. That's the end for me, I'm going to hand over to Blair Moe, and you can tell me about his side. So my name is Blair Moe, I work at CERN in the cloud deployment. So what is CERN? CERN is the European Organization for Nuclear Research, it was founded in 1954, and lab sits in the border between France and Switzerland near Geneva. The organization has 21 member states, and well, our mission, CERN mission is to do fundamental research. So what CERN looks at is looking for answers for fundamental questions like, where is antimatter, how the universe works. Quite ambition, in fact. So for this, for all of this scientific research, CERN provides different infrastructures, like a network of particle accelerators. CERN runs the biggest particle accelerator in the world, the Large Adrenal Collider, that is a ring of 30 kilometers, born 100 meters underground, and of course other infrastructures like cryogenic labs, and of course computational resources that we made available to scientists all around the world. So to give these computational resources in a more efficient way to our scientists, we started deploying a cloud infrastructure that is in production since July 2013, based on an open stack. At that time, it was the Grisly release. Since then, we provided two upgrades from Grisly to Vienna, and last month, we just finished the upgrade to Icehouse. In our cloud infrastructure, we are running two virtualization technologies, basically KVM for our Linux compute nodes, and Hyper-V for the Windows compute nodes. The infrastructure runs in two separated geographical data centers, one in Geneva, where we have around 94,000 cores and more than 200 petabytes of data storage, and a more recent one that we have in Budapest, Hungary, that we have available 21,000 cores there. So since we have this cloud infrastructure, we have been migrating all the applications, all the services that we have at CERN to run on top of the clouds, and we are converting those servers to be compute nodes on open stack. So actually, at the moment in our cloud infrastructure, what we have is 75,000 cores available, and we are running around 8,000 virtual machines. You can see that the number of virtual machines is not so big. The reason for this is that these virtual machines are machines with a lot of cores to process the LHC data. So why CERN choose to deploy cells in the cloud infrastructure? So several reasons. So we run our infrastructure in two different data centers, but we don't use the concept of regions. So we hide these from our users. We only want to have one end point. So the data centers are completely transparent for the users. The other reason is availability and resilience. So when you have thousands of resources, it's important to split our infrastructure in smaller chunks, because if something happens to your database infrastructure or rather MQ clusters, you don't want to, that your cloud is completely affected. The other reason is also to isolate different use cases. So as a private cloud, we have different use cases, and we want different hardware, heterogeneous hardware, and we want to spawn instances from special projects in special hardware. So cells is one way to isolate these capabilities. In terms of architecture, we have one API cell and eight compute cells. The size of these cells range between 100 compute nodes to 1600 compute nodes. Well, the reason we have so few large cells with so many compute nodes is historical reasons. So what we are doing now is to split these large cells that we have in the smaller ones. In fact, we believe that for us, cells between 200 and 400 compute nodes is the ideal fit. Also, we have the internal concept of shared and private cells. So we call a private share cell when only a few projects can spawn instances there. This has to do with capabilities of the hardware that we have there. For example, we only want that special projects go to compute nodes that have SSH cache, for example, or they are backed up by diesel generators in case of power failure. And then we have the shared cells where anyone can spawn VMs there. We don't have any scheduled restrictions. So of course, if we use cells, there are some features that are not available. And these are the ones that we miss more. Security groups, if you are using Nova Network, security groups are not available. So when we started the deployment of our cloud infrastructure, initially, we had a beta cloud, a small cloud with few under nodes that we opened to our users to get early feedback. And at the time, our users started creating VMs. And you need to tell them that, well, you cannot SSH your virtual machine, you don't have access to your web application because you are not creating the rule on security groups to have access. So you need to do that. When we deploy our production service, of course, security groups were not there. And users were already relying these features and using them. So this was really frustrating for them. Other missing feature is the flavors. So as a private cloud, we have special use cases, so projects have special flavors. And Nova allows that so you can create a flavor and dedicate it to a project. However, if you use Nova API, you are only interacting with the API cell. The flavor is not propagated to the children's cells that the driver uses to respond to the instance. Of course, you can think, well, I can deploy also Nova API in the children's cells. And you can do that. However, if you then run the same command, same API call in the children's cells, the flavor ID there will be different from the top cell. So they will not match. And you can create a big mess with that. So what we do now, we need to sync the different databases manually for this in order to work, basically. Aggregates. So the case of aggregates, the problem is a little bit different because the top cell API is not aware of the aggregates. Only the compute cells are aware of the aggregates. And this causes some problems because if the top cell is not aware of aggregates, the scheduler, the cell scheduler is not aware of aggregates. So, and also if you want to have availability zones in your cloud, you can't because an availability zone, in fact, is one aggregate. So you, in fact, need to tweak Nova in order to have this feature. Also server groups. This is a feature that landed in Nova in Icehouse. It's really interesting for us. It's where you can define your policy affinity for your server groups. And, well, it's completely, it doesn't work if you have cells. The self scheduler, as Sam said, well, is a very limited scheduler. Doesn't know a lot of information. And one also another problem that it has is if it selects one cell as a best cell, it will always select that best cell if the resource consumption doesn't change. So it would be nice to have some randomness that if a cell is misbehaving, to go to a different cell. Cellometry integration. So, cellometry is important for us for monitoring and I've all metrics. And if you want to have it with cells, it's challenging. One of the reasons, the main reason of this is that the cellometer compute agent needs to query Nova API in order to know which virtual machines are running in the compute nodes. And the Nova API is running in the top cell. However, the problem is that the domains in the compute nodes are matched, died these in the children cells. So the agent, the cellometer agent will not be able to identify the virtual machines that are running there. So as a consequence, you will not get any information. Of course, you can work around on this as we did. So you need to set up Nova API on children cells. But also we need them to set up a keystone for that. And also a glance for that. So it's a lot of work. It will be nice to improve this integration. So what are the certain challenges concerning cells during the next months? Until the end of the year, we're going to receive more hardware, more deliveries, and we're going to add into our cloud more 74,000 cores. Basically, we're going to double the capacity of the cloud. So the problem is there are so few use cases running cells. So how are we going to organize this? This new hardware? How many cells are we going to have at the end? Considering our expectation to have 200 nodes per cell, so we are expecting to have more than 30 cells with all hardware that we have. So then the question is how can we manage all these cells considering the lack of experience and use cases on this? So let's see. So Matt, now we'll talk about Rackspace. So as they said, my name is Matt Van Winkle. I'm an engineering manager at Rackspace. My team is pretty much charged with keeping the public cloud up and running, which means we don't sleep a lot. For those who don't know much about Rackspace, which you probably do, it's a hosting company that's been around since the late 90s. We got to cut our teeth on dedicated hosting, web hosting, those kinds of things. We offer private clouds, the public cloud. We let you stitch them together, lots of fun things like that. As far as our cloud itself, our open-stack-based cloud has been around since about August of 2012. It's currently in six geographic regions. There's a map up there. We do upgrades from Trunk fairly regularly. I think we're running on one, a poll we did around August, September, right now. We've got a poll in testing that was from mid-October. So we try to stay pretty up to date there. All of our compute nodes are Debian-based. However, we run our compute nodes as VMs on our hypervisors. It sounds crazy. It doubles the number of nodes you have to manage. And it does post challenges, but it actually allows us some flexibility we really enjoy. And then just to give you some rough numbers, I don't usually calculate these, so this math is probably off a little bit. But we have tens of thousands of hypervisors spread across those six regions with, what, 330,000 or so cores. And I've calculated just over a petabyte of RAM under open-stack management in the public cloud. So unlike these guys who have to go out and get funding for new hardware, I have the reverse problem where, because people pay us for it, I have a supply chain department and finance department that are continually landing new gear in my data center. So I have to figure out how to get it online as quick as possible. And we're running somewhere just north of 150,000 virtual machines. That was as of a couple months ago. So these numbers change all the time. So why do we use cells? One of the big reasons we use cells, aside from just kind of the natural sharding, is that we actually offer several classes of flavors, we call them. If you go to our website, I think you'll find general purpose, you'll find high IO, you'll find one called standard. We just launched some workload optimized flavors. So they're diskless hypervisors that support either high CPU or high memory. And so one way we kind of manage the growth of each of those is by grouping them into cells. On top of that, like I said, we have a constant stream of hardware coming in. And so far, based on the way certain things work that we're testing like live migraine and some other functions, we like to keep the cells homogeneous. So while we have multiple flavor classes, we have multiple suppliers within those flavor classes. And so you basically want to have one specific hardware type in a cell. In the future, when host aggregates and some of those things work a little better, that might change for us. But that's a big reason. Network actually, things outside of OpenStack end up being the biggest sizing factors for us. Obviously, public IPs are interesting, and we have to make sure we've got those sized in a way that customers can use. But we actually find drives the most around our sizing is our private IP space. So we allow customers to connect between Rackspace products over a private network. And a lot of times we size the cell based on an efficient use of that space. Because even though it's private IP space, it's still limited. And we want to make sure that we're being as efficient as possible within each region in allocating that. We've learned a few other things. I think the 200 to 400 sweet spot we learned was good because with OVS, there's actually problems with broadcast domains getting out of control of certain tendons start doing bad things. And so it was easier to kind of, we had some get as big as 600. We kind of started paring those down. In general, it's a two level tree. Like someone described, we have global APIs and resources in every region with anywhere from three to somewhere to 30, 35 cells in our biggest regions right now. Let's make sure. Since Belenro brought up the point about private versus public cells, all of our cells are available to all customers based on the flavors we're trying to build. We obscure that right now. We are looking at potential availability constructs. I'll use that generic term down the road. But we have tested and validated the ability to do private cells as well, where we combined a cell to a tenant. So we've had a couple of potential customers that have come in and said we really want to do this. So we have the ability to do that if necessary. I'm not going to repeat too much of what they said. We're seeing some of the same problems. I'll call it on the scheduler side. If you work with, does anyone work with cells that's in the audience besides my guys? When you actually attach a cell to a region, you basically link it up in a database and it's good to go. And there's really no way to throttle it. There is what they call waiting, which says you can relatively say these cells are more popular than these cells, but it's still a function of math and available RAM. So the short version of that is if something breaks, it's hard to stop sending bills to the broken cell. So there's a few functions like that we really need in the scheduler. And then, of course, we're running Neutron in a very large deployment. And people ask us how we did that, since it doesn't support cells. The simple answer is lots of duct tape. We had to get in there and make it work. So obviously, getting cells to a very fully feature complete status and NOVA puts pressure on the other projects to figure out how to support it. So that's a big driver for us. Some additional challenges, like I said, we're constantly sort of evolving our product line. And so with each new sort of flavor offering comes the potential for new hardware, new cell sizes, some of those kinds of things. The multiple vendor sources I mentioned already and I'll be honest, we're still learning kind of the exact math that goes along with the global services, so the regional level services as cells scale out. What I don't have today is an easy formula that says for every 10 cells, I should have X number of API nodes, X number of glance nodes. We're still learning kind of as we go. And while cells gives you the ability to sort of charge your database and spread the information across all these little databases, in its current implementation, there's still a master global database that's got a copy of all that information. So you still kind of end up with a large, unwieldy database if you're not careful. Case in point, we just got through doing some pruning in a couple of regions for anything that was a deleted instance record because NOVA likes to hold on to those for a long time. That was older than 90 days and I still have several hundred thousand instance records in those regions. So the reason I bring that up is because there was two design sessions this morning about cell sort of feature completion and where are we going with cells and how do we make it a first class citizen. And the bottom line is the NOVA dev team has looked at it and says it makes a lot of sense. It allows for NOVA to scale very well. It solves some problems that are out there, but it needs to be sort of fully into the mix. So I think there's a lot of work to be done. I think we're talking several like we're on K right now, so I wouldn't be surprised if it's in or in before they're fully done, but the sessions went really well. I think there's already some targets for the K release, which includes getting the upstream gates fixed and functioning with cells and then we'll sort of move from there. The nice thing about it from a to tie it back to the database piece is right now if you run cells, not only do you have this big global database, you also have a middle sort of RPC layer because you've got a cells service that runs at the regional level with its own rabbit MQ and they sort of help pass this message down to the cells and then the cells have an additional cell service, the rabbit MQ and the scheduler that we all sort of expect to see. The idea with where we're going now is to try to get rid of that middle RPC layer, make the database at the global level really a mapping that says this instance is in this cell and then sort of let the APIs know how to go from there and talk to the levels below. That's the very simplified version of what took place this morning and I'm sure there'll be some evolution of that. I think we all walked out of it feeling pretty good about how it went. Seeing how there was some worry that we would get into this thing and it would be decided just to chunk it out the door and the three of us would be really, really sad if that happened. I think the follow on, the three of us are also involved in a large employers working group that's meeting tomorrow so I think one of the things we're going to try to do is put some specifics around some of these features that we want to take them right back to the devs for that particular process. I will say if you're an operator, as the operator movement sort of continues to grow, find a working group that matches what you do and get involved because we're seeing real turnaround right now in the process of being involved in the design sessions, going back to as a working group, coming up with specifics and turning it right back around to those devs. Really good stuff. I think we actually have time for questions so go for it. Can I have questions for your first speaker? Sure. Yeah, sure. Yeah, good presentation. Just one question. You said about 1,000 hypervisors with 40k cores. Does it mean the 40 cores per hypervisor? Yeah, it probably works out to that. We have some hypervisors that have about 64 cores and some that have 24 but I guess on average, it probably works out to be about 40 years. Okay, so this is real data. Is it? Was it sorry? I mean, I just doubt 40 cores for hypervisor. We've got some brand new gear that's 40 core processors. Yeah, they're out there. Awesome. We have some AMD hardware. We've got 64 cores and hypervisor. Is it in production? Oh, yes, yeah. Okay, cool. Thank you. I have a question for the first speaker. You said your Cinder is in the different locations centralized. What about the Cinder back end? What is the latency from your cells to your back and over ice because how are you managing that? Okay, so we have Cinder API and Cinder scheduler that's central. Okay. But then all the Cinder volume services are all dotted where the cells are. Does that make sense? So the actual storage is within the cell itself. Makes sense. Yeah, so there's no, you know, you can't attach a volume from one cell to another cell. So how did you, did you end up changing the Cinder scheduler to mount it or? No, there's no modifications to Cinder. Cinder has a notion of availability zones already and we've kind of matched them up together. Yeah. I was wondering if you have any advice for moving from a cloud that does not use cells and sort of uplifting that to, you know, now using cells or did everything you guys do was all net new? I've been cells since the beginning, so what do you guys take? We actually were not at cells at the start. So we started in Diablo and then I think when we upgraded to Essex we moved to cells. It wasn't actually too much of trouble. You just have to kind of build the top level API cell and then kind of just sync a few things up to that database and then you're, you know, you're essentially good to go and just not that easy, but you know. I think that the answer that's going to change too as time goes on because there's going to be a release of OpenStack in the near future where when you install that release you've got cells. Now the default will be effectively one cell kind of transparently, but we just don't know the timing on that. So if you need it now, then I think you kind of have to go that route. If you think you're going to need it in a year then you might be able to get it without a whole lot of extra effort on your part. And going on that route it will actually, there will be a migration strategy for people who are just using a standard stock standard Nova without cells to move to the, you know, because it will be the default. Thanks. Do you know, Andy, have you heard? I'm not completely familiar. I don't have as much direct involvement with Neutron. I think the general premise from the last summit and it's probably carrying over a little bit this way is we just still have a lot of base functionality to fix. I think people from RaxBase brought the idea to it and they're still in the let's get the house kind of the ship in order and then we'll come back and talk about, you know, the fancy furniture and the things like that. So yeah, so yeah. So from a plumbing perspective our cells translate to dedicated VLANs and so Neutron can kind of assign information but the network piece is sort of established for it. And if you want more details, Andy right here in the front row is the good person to talk to about that. Punt. Oh, sorry. I attend an event of Huawei. They are spoke about cascading and do you know about it and do you think that OpenStack will be in that direction or it continues with the cell, if you know. So we were all in that session and I think the specific need they were looking for was really more of an application level thing. It was the coordination of clouds. It could be argued it was cascading but when we really dug into it, I don't even know that it was that. I think Sam's kind of doing a version of cascading with cells today. But I don't see it diverging. The Nova Devs are pretty dead set on getting cells into a version that is kind of the default install. Now how you chain multiple clouds together after that I think is there's a number of ways we can look at it. But I don't know if you guys have any thoughts. I have a question for the first speaker. How did you upgrade your data centers, your cells and keep the consistency between all the data centers? Well for starters we didn't have consistency so we kind of took a staging approach for upgrading OpenStack. Really we just put a bit of planning. We made sure we could run different levels or different versions of OpenStack in different cells and so it was really we kind of took, I think it was a week or two. We upgraded the first cell and we just left it running for a few days to make sure everything was fine and then once it happened we kind of did it one every couple days. If I came over it was just that we had a small five-minute outage on our cell when we did that. And for rice bake, do you have any problem with floating IPs? No because we don't use them right now. Plain and simple. I mean we're working on it so we're going to have to figure that out but there's actually some plumbing related issues that go along with that from sort of how routing is done for us because again we're part of a larger company so we have backbone groups and stuff like that so that's a TBD but today we don't we don't do that. Okay thank you very much. For us it's by region so each region has its own API endpoint. Now you guys can. Yeah so we all only have one API endpoint. Then we have some schedulers some filter on the cell scheduler that a user a special user with a role can specify data center but these are very specific guys. We do OS aggregates for the cases that we don't want to set up a cell with a few nodes that have that specific capability so we aggregate the nodes with that capabilities in one more large cell and the OS aggregates is basically that capability and then we we filtered the request the request that we allow for that for that project to go to that cell and then inside that cell the scheduler selects the specific aggregates so is the way we use aggregates. Anything else I think we have. Yes here how do you guys manage flavor how do you create them in production. So in case of CERN flavors basically we we don't use the API for the flavors we if the request is accepted we we have a CLI that we update the database directly. Okay yeah same here yeah just copying it's good how it's got dumped and then pop it in the other things like that yeah okay thanks I think that's all great all right thanks thank you thank you