 Hi, my name's Sam. I'm from Melbourne University down in Australia. I'm part of the Nectar project. This is a kind of a three-part talk. Here's myself. We've got Belmiro here from CERN and Matt here from Rackspace. We're just going to give you a bit of a talk about sales and Scaling OpenStack. We gave a talk in Paris six months ago, so this is kind of a bit of a follow-on from that talk. So for people who are there, we'll hopefully have something different to tell you. So Nectar is the project that I'm part of in Australia. It's a government-funded project that started in 2011. We've started Early Cloud on the Cactus branch and kind of started production in Diablo and we've been slowly upgrading ever since. We're mainly on Juno now. A few Ice House nodes around. We're kind of part of a federation as you can see. We've got eight sites around the country that kind of geographically dispersed. Well, we've got, lucky enough, in the research network we have big networks, so we have kind of a big pipe between the data centers. So we use sales to really join up all these sites around Australia. The reason we kind of went for sales is there's really a couple ways you can scale OpenStack. One of them is regions and the other one is going to scale up the sales. We wanted to have a single end point for our users. We have, most of the users on our cloud are researchers and scientists. They're not all IT professionals, so we wanted to give them a nice, simple interface. We want a single set of security groups, a single set of key pairs, and just one central front door for everyone to kind of go to and use our cloud. One of the other reasons that we see as an advantage of using sales is the, I guess, the OpenStack expertise or the lack of the OpenStack expertise that we have. And so we can kind of have a core team in one location, like handling all the central APIs and things, and really each of our sites around the country, they all need to just deal with the compute nodes and a small subset of OpenStack. So that really is another reason that we go for sales. How big is Nectar? There's eight sites of us around the country. We're growing pretty rapidly still. We're currently at our final deployment in terms of size. There's around 700 hypervisors, so we have around about 100 or so per sale. I guess we're not as big as these guys, but in terms of how we scale our system, I think we're a decent size. In size of people as well, I guess we're a small team. We have three people running everything locally, maybe less than that. We have a high staff turnover, so as everyone does, I think. Interaction with other services. In terms of with ourselves, we have other OpenStack systems involved. We try and have a swift region in each sale, and then we can use local glance APIs to then feed out and over. Especially when we're quite geographically dispersed, having a local object or a local image really is the main thing for booting is quite important to us. We also have Cinder around the cloud. All of our sales will have at least one Cinder volume host. We're using different back ends at each. All of our sales are the same, but quite different. It's up to each site to determine what vendors and stuff they use. Solometer is another thing that we use to varying success at the moment. We have a large Mongo cluster, and we really just kind of fire everything up to it. Some things work, some things don't. We're a Nova network installation, so when we started, there was only a Nova network, and we're still on Nova network. We do have kind of a big network around all our sales, so in the future, we might be able to leverage that when moving to Neutron. In terms of actually what we run on each sale, this is just a list of everything you run, APIs at the central level, and just really just the compute stuff going on at the sale level. One of the things that we really need to deal with is scheduling in terms of sales. Where do we put instances? We have geographical requirements. We have users that want to schedule at a certain sale, or for us, an available zone. We kind of use the sale term as an internal thing for scaling, and we use the available zone as a user thing. They kind of match up one-to-one in terms of a sale on the zone, but not necessarily, and so users can choose where they want to launch their instances. Now we're doing, we have some sales that are, we have a test sale, I guess we call it, that we only let certain tenants on. We use filters there. We have some sales that have higher performing nodes, GPUs, fast.io, things like that. We use the same things that in a non-sale environment we use, aggregates and flavors really to determine where those instances are scheduled. The other thing to do with scheduling is when we're bringing on new sales, it's one of the, it can be quite an issue. We want to test out this new sale before we open up to the public. We don't want to, we don't want to flood a new sale. When we bring on a new sale, part of the scheduling is how much RAM is available, and so if you bring on a new big sale, everything's just going to port it at, and if you've got a faulty new sale, then it's even worse. So we use a few techniques there to kind of slowly bring on a new sale. We will start off with having it restricted to only certain people. We'll kind of treat it as a pre-production. We'll have beta users on there, and then slowly we'll open it up to the global scheduler, so we kind of have a nice smooth on-ramp. We also run an open stacks system to manage all our infrastructure, which has been something that we did just before Paris, I think, and it's been quite important to us, and it's really made it easier for us to deploy the infrastructure. We've got so much infrastructure now that we need a cloud to manage that, and that's been really helpful for us. Everything is mainly virtualized in terms of our control infrastructure. We have physical databases and rabbits, but everything else is pretty flexible, and that really helps when we're upgrading. I think we've kind of got a process now when we do our upgrades. We've just almost finished the upgrade to Juno, so we have about half our sales are on Juno and half of them are on Ices. We've kind of got a process of doing conductors and APIs, and at least the last two upgrades we've had have been not impacting to the users. We can kind of do a sell at a time, and rolling upgrades is just a lot easier. Monitoring and tender sales, because we're kind of a federation of different organizations, we don't have the same people at each sale, and so really we have to have the firm interaction between the rabbit servers and sales, so that's where their kind of contract is. We send a message to your rabbits, and it's really up to them to deal with that, and how the instance gets scheduled, or how the request gets handled, and so having strong monitoring around that interaction is quite key for us, and vice versa, so there's some requirements. We have to make sure that our system is very highly reliable, and one of a good test is really just the console log. Spend up an instance in all of your sales, and do some monitoring on console log, and it's a, because it's a call, you need a response, it's a good way of actually testing all the functionality that sales does. The future is that we're looking at Neutron at the moment. You know, Nova Network has got an undefined future, and I think, you know, we think Neutron is the way to go. There is no real differences in terms of, with sales there's a few little things that we had to make work, but really Neutron can work in a sales environment without any changes. We're looking at, because our networking is very simple, we just have a single flat network, public IPs everywhere. It's probably a simpler migration for us to move to Neutron. Once we're at Neutron, we're hoping to get some more of the fancy, you know, tenant networks, and load balancers service, and some of those high level services, but that'll be a slow progress, I think, and, you know, throughout this coming year, we hopefully will get there. You know, we're looking at other services, I mean, all the other services, really, that are high level, don't really interact, or don't need to know about sales, which is quite nice, you know, it's a, I think that's all of what I've got. Yeah, and I'll pass over to Belmiro. So, thanks. So, my name is Belmiro Moreira. I work for CERN. So, what is CERN? CERN is the European Organization for Nuclear Research. It was created in 1954. It has 21 member states, and is located in the border between France and Switzerland near Geneva. CERN mission is to do fundamental research, and is the biggest international scientific collaboration in the world. We have 10,000 scientists from more than 100 countries that work at CERN. For the scientific research, CERN operates a network of particle accelerators and detectors that they are used by several experiments. One example is this one. This is the Large Hadron Collider, and this is the most powerful and largest accelerator in the world. It's a circular accelerator. It has a diameter of 27 kilometers, and it sits 100 meters in the ground Geneva. When in operation, the detectors that are connected to the Large Hadron Collider can produce one petabyte of data that needs to be filtered, stored, and analyzed. To analyze all this data, CERN provides computer resources to scientists all around the world, and to help us on that, we now provide the cloud infrastructure based on OpenStack. So, six months ago in Paris, I presented what was the motivation of CERN using cells, and now I'm going to give you first a perspective what is the state, what change, and now what are some problems that we are facing. So, the cloud infrastructure is in production since July 2013. We are now running Juno. We finished the upgrade three weeks ago, and we started also offering it to our customers. We have two different virtualization technologies in our cloud. We have KVM and also Microsoft IPv, and we're starting upgrading our scientific Linux 6 nodes to Cento7. The reason for this is to continue to use the OpenStack distribution packages. Our cloud runs in two different data centers. One is located in Geneva, the other one in Budapest, and in numbers of numbers, during these six months, we add a few more nodes. So, now we have 120,000 cores in our cloud. In average running, we have 11,000 VMs, and in total, we have 16 cells. We also changed the way in these six months how we deploy cells. Now, our new cells have around 200 nodes, and this means that if they are smaller, we're going to have a lot of them. However, if one fails the impact in the infrastructure is not so big. Also means that with smaller cells, we can have smaller control plane, which is much easier to bring back in case of failure. For this reason, we now are not clustering rabbit at child cell level. In fact, most of our problems in the past were related with network partitions in our rabbit clusters. Different cell types have different requirements. We have what we call three cell types. The compute one for CPU intensive tasks, for example, LHC analysis. The service cells that run web servers, databases, so on. And the critical ones that run very critical applications that we have like the radiation control systems. Now, we map one availability zone to a cell. This means that a cell is only one availability zone. We don't use aggregates anymore in our infrastructure. It turns out if you have thousand of nodes managing aggregates, it's really challenging. We still have big, big cells, the ones that we set up two years ago. We have cells with eight hundred nodes. Our biggest one has one thousand seven hundred nodes. We are looking, how can we split these cells? Possibly a next talk, next summit. So last summit, I told you about the all the motivation, problems, and challenges that we have at certain running cells. So what I want to do today is go through some of those problems and tell you how we are dealing with them. The first one is to prune over databases. So during these two years, in our cloud, more than one million VMs are created. Of course, we don't have resources towards one million VMs in one go. So for new ones to be created, others needs to be deleted. In average, we have eleven thousand VMs running. With the creation deletion rate between 100 to 300 VMs per hour. And over time, of course, the databases will grow. And this is not only a problem if you are running cells, but if you are running cells, you multiply this problem by the number of cells that you have. Because each one has a database. The problem is that when you delete an instance in OVA, it's only self-deleted in database. All the information is preserved there. At certain, we have the policy to preserve this information at least three months after the instance is deleted. After that, we can forget this information. We can delete it completely. So how can we do this? So OVA has the ability to archive deleted rows. However, for our use case and using cells, this is not really ideal for us. So we needed to build a small tool, basically, that goes to the top cell, to the parent cell, checks out all the instances that are deleted before a specified date, and then removes all the records related to those instances, and then goes to all the child databases and delete exactly the same instances that they have. So in this case, we keep consistency between both databases. When we delete, we really remove one VM from the top one. We also remove it from the child. Well, all the code is available on GitHub. You can have a look. And also, I wrote a small blog post explaining more of these works. So next, cell scheduling. So we have different cells. Each cell has different requirements, because we have different hardware, different location, and the network configuration in those cells are different, different type providers type, so on. So how can we make sure that a VM from a project is scheduled to the right cell? So child cells have the capacity to expose to the parent one capabilities. So we use that feature, basically, to inform the novicell, the novicell scheduler, the capabilities that the child have, and also we use metadata that the user can provide when it puts an instance. The other thing is novice scheduler, novicell scheduler has the same function as novice scheduler, so it supports scheduler filters. So what we did was to build a set of scheduler filters, like the selection of this data center, select an availability zone, that I provide our type, and map a project to a cell to help us on this. So the code again is available on GitHub. You can use it to write your own scheduler filters. Probably it will be a good help on that. Flavors management. So because different cells have different requirements, the projects that run at CERN all of them have specific needs. So all of them need special flavors. So how can we manage different flavors per project in a big infrastructure? Unfortunately we cannot use novice API directly, because when you create a flavor using the novice API that is connected to the parent cell, the flavor is only created there. It's not propagated to the child ones. Of course you can think, well, but we can also run of API in the child cells, as we do, and I can use novice API individually in each one. That doesn't work because the record ID that when you create the flavor in the top cell will be different in all the childs, and then you create a big mess because you think that you are starting a VM from the flavor small, and in fact you are creating a completely different flavor. So initially what we did was going through the databases and add all the flavors individually on them, but since we're starting to add more and more cells it starts to be risky and difficult to manage. So again what we did was a small script that goes through all the databases, selects a free record ID and use that and propagates the flavor in all the cells available. The trick here is to add the flavors in the child cells as public. If you do that then all the management could be done in the parent cell. So if you remove it then you don't need to mark this in your child cells or if you want to dedicate the flavor to a project then you only need to do this in the parent cell. Again you can check out the code on github. So increasing the number of cells also increases the challenge of keeping everything working. So how can we make sure that our cells are performing well before and after being deployed into production? So for this we're starting to look into rally so for benchmarking and also test functionality. We have multiple scenarios basically for the basic functionality for example create and delete the VM using nov and eC2api for example and we integrated all of these with our keyman infrastructure. The reason for this is that the UI that the that rally provides doesn't give us a historical view about the running tests. You only see the iteration. So integrated this with Kivana and for now we don't have alarming but will be nice in the future to have some kind of notification if a test fails, a scenario fails. We are looking to do this and probably we will use Spark for this. So this is one example about our integration with rally and our keyman infrastructure. We can see that in this example we have three cells and different scenarios here and we can immediately identify in the it map that we have which is running which is failing and you see that we have this per hour. Something that to do the rally UI at the moment doesn't provide us. Well, certain cloud doesn't look as bad as it looks here. Okay, also we can go deeper on this and see the duration of this scenario and what was the duration of each individual operation. So for example in the first case to boot a VM and delete it it was around 29, 28 seconds and you can see that most of the time here was to boot a VM and then to delete it. Okay, so I hope that this will help you when you are deploying your cloud infrastructure using cells. So next Matt we'll talk about cells like its rack space. So yeah, so just to give you a little background, so I work with our fleet management group at Rackspace. So I have several engineers that are tasked with sort of automating both the physical and the virtual nodes that make up our cloud and I'll talk a little bit more about that as we go through. Just quick background, so managed hosting company, it's been around for a while. At this point we have over 200,000 customers in a 120 countries around the world. From a cloud perspective we're still running six basic regions that span these particular locations that hasn't changed yet. Tens of thousands of hypervisors you can read the rest of the details growing all the time. I have one of the nice luxuries in that I kind of this constant stream of gear headed my way because we use it to make money so I know a lot of people struggle with capacity. Our struggles are keeping up with the flow of capacity coming in in a lot of cases. Most of our cells, the deployments run anywhere from three to just still about mid 30s I think in our largest regions as far as number of cells in a region and that's largely dictated by the size of the data center we have in each place and the growth of customers in those places. We size about 100 to 600, not so much on the 600 side anymore. I think it's when we first started some of our original standard flavors got about that high. We actually found some problems with broadcast domains. If tenants started doing bad things, so that and some and sizing of our IP blocks for our internal network that connects different products we've kind of brought that down a little bit. So most of them now run about 100 to 150 with our newer hardware models. We break up cells really based on hardware more than anything else. We offer several flavors to our customers. Some of the names are on there, the general purpose, the hyo, compute, optimize all those things. We'll have multiple cells of each of those in a particular region but that's the primary way that we break up our hardware. We are working on exposing some additional constructs around that, maintenance zones, so the idea that I can schedule instances near or far from each other within certain flavor types so that when we have to go do work we're not necessarily bringing down everything at one time. We do run a separate DB cluster for each cell. The other thing I'll point out, I think Sam mentioned they do the same thing. We run all of our control plane and it opens that cloud itself. We actually use cells within that one as well. They're for two reasons. One, we have multiple hardware types now available in our internal cloud. Secondly, we offer some of that space out to other groups at Rack Space to do their own testing or development work and so I want to make sure I can isolate the things that the instances that represent my control plane for the public cloud from say an IT group who's testing a new version of the billing system. And so to do that we use some of the same techniques to isolate cells within this private cloud to individual tenants. Cell scheduling, like I said, multiple cells within each flavor class. Also, we have in most cases multiple vendors and some of our different hardware types and so we try to keep a cell homogenous from a vendor perspective just so there's no quirks there. This is especially important for matching the CPUs exactly. We are starting to use lab migration more and more for internal maintenance work and this is block live migration so this is not with attached storage. So as we try to make that more and more transparent for customer standpoint we just don't want to run into problems with CPUs not matching exactly so that's why we we keep hardware as homogenous as possible. And then within each flavor class you're going to have different flavors offered much like Lermont talked about where the child offers up the flavors it supports. So for example in our general purpose offering we offer a one to four eight gig flavor class flavor within that class. Tenants are scheduled by flavor class first and then by available RAM. We don't schedule on any CPU information it's purely on RAM right now everyone in that flavor class gets the same DCPU allocations for the class. I would like to have scheduling based on IP availability I'll talk about how we get around that now but that's probably the one other thing I would love to see from a scheduler perspective because sometimes those two things don't line up well. And then just to point out there is a lot of work going on with Cells V2 around the scheduler itself. This is one of the the blueprints out there. There's about three I think on Wednesday when we have the large deployment team working group one of the sessions we're going to dive into Cells V2 blueprints and so this is one of them that we're going to dive into. It just has to do with where scheduler starts updating some information about the instance earlier in the process than where it does today but you can dive in more if you want to read up. So deploying a cell pretty similar. We run everything like I said in the cloud so we lay down the control plane with an ansible playbook the DB pair, the Cells service scheduler rabbit the key pieces we're still testing out conductor so we don't have those integrated yet. We run a playbook as well that populates the the the same flavor stuff that we talked about earlier and then we run a separate playbook that bootstraps our hypervisors. One thing I'll point out here I'll talk more about it on Thursday in another talk but we actually run our compute nodes as a VM so from a node perspective that gives me twice as much to deal with and extra headaches but it's a nice isolation piece so worst case I can blow away my compute node and not affect the underlying instances except for you know taking action against them. So the bootstrapper in our case pretty much is responsible for getting that VM created getting the updated code pushed on it laying down the right version of OVS on the hypervisor some of those things and then pushing routes and whatnot that we need to talk to other Rackspace products. After that we provision IP blocks we test and then we link it up through another playbook which you mentioned getting the some of those things backwards we've actually had one case where the Ansible assumed things in a way we didn't realize at one time and so it actually linked up one cell as the global and caused some interesting shenanigans as things tried to schedule to the wrong place so that's almost as dangerous as the flavor classes getting advertised wrong. All right so we too had to deal with purging the DB nodes we had we were a little bit further along than CERN was when we started messing with it but just to give you an idea so our largest regions right now are running up around or over 50,000 VMs at the moment and some of those regions though we have thousands of VMs being created and deleted in an hour in fact I would tell you that a lot of that is y'all's code submissions to OpenStack drives a lot that in some of our regions so obviously we start stacking up those deleted instance records pretty heavily. We also have a 90-day retention period for deleted instance information and I looked this morning even in my small regions I'm still sitting on about 132,000 deleted instance records and all the metadata to go with them in that 90-day period so but what does that mean? Well here's an example of the API latency in one of our larger regions you can see that back in November it was running you know we kind of reached a tipping point where it was starting to slow down a little bit and then that sharp drop of the end was us successfully purging all the global deleted instances as well as the cells and I see Jesse sitting out there now so he remembers those days so anyways there are benefits beyond just the data that gets that you get taken up you can see actual API improvement getting all that still data out testing our cells so we also have to figure out how to test cells that are being added into a working environment all the time the primary way we test the new cells with the bypass URL essentially you just add an API node in our case we call it we use an admin API node supports a few extra functions but we add a separate API node temporarily from the time we finish provisioning the cell to the time we link it up and all of our QE tests are run against that API node so we're not actually having to put it into production so to speak until we're comfortable with it the trickier part is when we're adding additional capacity to an existing cell because there's just no concept from a host perspective of I'm sort of kind of in production but I'm not really ready for primetime so there's some things we can do there's targeting filters all those kinds of things we actually have a change I wasn't able to find the code on my schedules off after all the Venom Shanae against last week but a couple of my engineers are working on a filter that will actually modify the targeted host filter to still build even if the host is disabled and so our hope is that we can use that to actually get all of our host provisioning completely hands off our goal is hopefully by this year a cabinet can roll in and be turned on and as long as a network switch is configured everything from there after from installing the operating system which is in server today to all the boot strapping to all the testing to all the linking up happens hopefully hands free that's our goal is this year we'll see how far we get and so we're going to need those host host based hints to work even on disabled hosts for that to happen the other big thing that I thought I'd bring up from like sharing pain and agony is the concept of disabling a cell it does not exist you're kind of linked up or you're not and when you're unlinked there's a lot of problems for the existing instances that live there the reason this is important us is actually IP space I mentioned this earlier so we've actually had cases where the amount of available IPs depletes faster than the amount available RAM which isn't a big deal except that unfortunately right now at rack space to get additional IP blocks added does require manual steps by other teams to go update routers and whatnot I haven't I don't have the ability to completely do that programmatically and so used to what we would do is we would just we have a resolver service that picks up alerts and handles them for us and it would just go and it would add it would add an offset of a big number right and that works most of the time but there are still cases where a brand new cell had so much available capacity that even with an offset or even with the weight high it would win the calculations and it would just keep hammering that cell with builds and so what we did was we cooked we cooked up and this is kind of a dirty hack right now we're trying to come with a better way but we cooked up a special filter that instead of specifying a big value we specified a specific value so anytime we now put minus 42 on a cell our scheduler knows don't send anything stop what I'm doing move on go somewhere else so I pasted the bulk of that code in here the slides will be up later I think we'll post them yes yeah again it's not it's not the cleanest way to do it but it allows us to then allow to have that same resolver service pick up an IP alert drop the minus 42 or for any reason if we detect any massive anomaly in a cell our ops guys can go in and add the filter the weight right there and then we know that at least no new builds are going to try to go there till we sort it out and so here's a shot from that same management application where you can see a cell weighted the other day minus 42 based on IPs more than anything so neutron and cells this will be the very highest level overview of that but essentially we have neutron running and we do run it with cells that's largely the based on the fact that we wrote the quark plug-in the real drivers for the quark plug-in had a lot more to do with the fact that we provide we have multiple provider networks per instance by default we give them a public and a private internal network also we bridge those things straight to the to the outside so when you get a public IP under instance it's not going through a firewall or something it's bridge the outside but to pull it off with the with the quark plug-in and I've got the link there for the code I'll talk a little bit more about on Thursday I think but we give we borrowed some ideas from back when we had quantum and melange so we have a tenant for each cell each cell becomes a segment from this perspective of neutron our subnets which for us represent an IP block are assigned to those segments and then NOVA will request ports from both the public IP blocks the private IP blocks and actually we allocate the max dynamically as well so it'll pull all those things from from neutron based on this setup so I think I don't know how much time we have but in a couple one minute I think one minute yeah so one question now yes sir what's some of the major issues you've ever faced in one village or anything like that from which in general in general we'll be good so yeah also messaging was probably the most recent one it would knock over rabbit quickly I mean from my perspective we try to pull code quite often and so usually it's finding the things that weren't tested at any kind of large scale and figuring out how to deal with those quickly once we've deployed it but I don't know what you guys have added to that I haven't yeah so we can talk more offline if you want to see cool thanks thank you thanks guys