 First things first, can everybody hear me? Excellent. One second, technology has let me down. Let's try that. A few introductions, who am I? I'm James Page. I'm a technical lead in the Ubuntu server and OpenStack team. I've had the pleasure of that role for nearly four years now. Prior to that, I worked in UK financial services for a fairly well-known brand driving open-source adoption and a virtualisation strategy, which was interesting and very cost efficient. In terms of Ubuntu and OpenStack, I've been involved in OpenStack on Ubuntu really since its first release into Ubuntu, which I think was the Bexar release. It seems like a long time ago now, and the project has moved a long way in that period. We've done some really interesting testing with it over that period, which is what I'm going to be talking about. Let's start with some scale testing history. Last cycle, we partnered with AMD, and we had a number of these C microchassies, 10 in total available to us, each with 64 servers, all four core, eight threaded, 32 gigabytes of RAM, and SATA disks, and we tried to push an Icehouse cloud as far as we could and as quickly as we could. We found lots of bugs. We found lots of bugs in our deployment tooling, in OpenStack itself, but our headline figure six months ago was that we were able to get a 370 node cloud to 100,000 instances in just under 11 hours, which at the time was a pretty cool figure. We were very proud of that. We managed to move that on to 168,000 instances sometime later on a lot more hypervisors. The cloud by then was grinding pretty hard. We were stressing the infrastructure quite extremely, but we did flush out lots of interesting features in OpenStack. For example, how the OpenStack dashboard behaves when you have that many instances in your cloud and the admin screen start collapsing because they can't figure out, they can't get enough data back quick enough from over a neutron and things like that. We fed most of those changes back upstream. This fixes in for a lot of stuff and we've improved our own deployment tooling. For example, the 500 nodes took around about 13 hours to deploy six months ago and we've now repeated that test and things have improved. This cycle, HP were kind enough to give us two weeks in their discovery lab with some moonshot chassis. For those of you who don't know, these are very high density, small footprint chassis. The cartridges we were working with were the M350s, which are Intel atom based servers. They're all 4 core, 8 threaded, 16 giga-gram with micro SSDs attached. They're fairly good from an IO perspective, gigabit networking. We had three of these chassis to play with. 540 servers in total available across those three chassis, so very dense deployment. We repeated our testing. We went back and we did a nice house test to see where we could get to with that. The first thing I will say is that that 13 hour figure had reduced to two and a half hours. The time from no cloud to cloud you can run stuff on was just over two and a half hours, so there's a significant improvement. Some of that we can credit to SSD, so the speed of IO is really important during install. A lot of that was improvement in our deployment tooling to make it much more efficient when you're managing this many numbers of servers. I'll talk about that tooling a little bit. Our headline figure with Icehouse was 100,000 instances in just under six hours. That was on more hypervisors, so we did a little bit more bandwidth. Then we moved forward to the current lease and we looked to Juno. Turns out that pretty much behaved the same way. After a little bit of tweaking we found some interesting scaling regressions, specifically when you get to very high instance densities and how many RPC messages flow across message buses and stuff like that, which I'll touch on in a minute. It reproofed the scale of OpenStack during the Juno release. We saw some really good improvements, specifically in Keystone. Scale's very much better on a single unit compared to Icehouse. It has much more leverage in using OS processes rather than venting the threads, so that's given us some really good performance improvements there. We've seen actually some improvements generally as well. Our 14.04 release kernel, the 313 kernel we've released with, has improved over six months. We've got six months of stable fixes gone into that kernel and the efficiency of running KVM instances has improved a lot. The overhead in terms of processor of running instances has greatly reduced from a couple of hundred instances on a node maxing out that node to that being, if they do not do much, less than 5-10% utilisation. A really good story in terms of how things incrementally approve from releases we've been to over the period of a long-term support release. So what does that look like? This was the cloud we built on the HP Moonshots. The big blob in the middle is Compute Nodes, if you haven't guessed. We ran with eight cloud controllers running Mova services for neutron controllers running neutron API services in back-end RPC. Just a single Keystone and Glance. They're not particularly heavily loaded. Once you've got Glance messages out to hypervisors, for example, they're all cash, so it's very efficient. So we didn't need to scale those out and consume resources there. We actually ran in a split broker configuration, which I'll talk about in a minute with BravertMQ, which gave us some benefits in terms of message performance and throughput. Just a single MySQL instance. SSD really helps there. We'll touch again on that in a minute. And we just had a single Neutron Gateway node and Layer 3 routing. So how do you get to this level of clouds quickly? And what are the things you need to be thinking about when you're designing a large cloud? So the first one of those is really messaging. OpenStack is a busy place from a messaging perspective. Creating instances, lots of messages. Periodic tasks, lots of messages. Telling all your hypervisors about the instances of running in them and their port and MAC addresses. Neutron gets very busy. So you really need to think about how messaging is delivered to support OpenStack services. Same applies for database. The backing persistence store for your cloud is critical. It's critical for the operation of the cloud. If it disappears, it's going to be bad news. If it's performance constrained again, it's going to start having problems. Network is also critical. A lot of things consume a lot of bandwidth, both from a back-end administration perspective and the instances that your tenants are going to be running in your cloud are going to be consuming a lot of resources as well. So you need to consider that in the total picture of your deployment and the underlying storage. So the spindles that instances are going to be residing on, the performance of those is critical. If you're going to be doing Cinderblock services, BotDevice services, you need to consider how you achieve performance in the back-ends used with Cinder to deliver service to your tenant instances and your customers. And you need to consider the OpenStack services themselves and how you scale those out, how you provide HA resilience and security. So let's focus in on the first one of those topics. Messaging. RabbitMQ has been the default messaging solution in Ubuntu OpenStack. Since we've had OpenStack in Ubuntu. It's a mature product, gone through several major releases since 1204. And it deals with the load pretty well. Deals with the very heavy load that Neutron needs to support cloud operations. For example, on the 500 node cloud we had with the Neutron chassis, we often saw 10,000 to 15,000 messages a second running through the RabbitMQ instance that was supporting Neutron. So it's very, very message-intensive with the reference ML2 implementation for SDN. Nova's a little bit quieter. Maybe an idol of 300 or 400 messages a second and then picking up to maybe 1,200, 1,500 messages a second. It's not as noisy from a messaging perspective. So we also need to consider how we provide resilience. Fortunately, RabbitMQ is built on Erlang. Erlang has some nice native resilience features. So we can leverage that via RabbitMQ itself to use native resilience features in RabbitMQ. That allows us to mirror cues across brokers, which all communicate in a cluster. If one node dies, then another can pick up the load and all the clients should reconnect and your cloud should continue to function. However, you're still going to be constrained by the performance of any one of those brokers and there is an overhead to mirroring. So it's not a cost-neutral thing to move from a single node to a cluster node. What other approaches do we have to dealing with messaging? The other option is to run with split brokers. So we know Neutron is particularly busy, so pushing that out onto its own broker makes quite a lot of sense. It means you're not going to be disruptive to other services like Nova and Neutron, which are less intensive from a messaging perspective. But there's some limitations here. So Solometer, the metering tool, very much relies on notifications delivered via messaging and it likes to do that on a single message bus at the moment. So supporting Solometer in a fluent way of running split brokers is not currently supported. It would be a nice feature to see in Solometer. You might see that, I don't know. But it would be pretty cool. In terms of security, RabbitMQ supports native SSL and that's probably the best way to secure at the moment. I know that the FFSNP project as a whole is looking at how to secure messaging and other options are being looked at, including encrypting the messages themselves, which then makes it very transport neutral. So whether it's RabbitMQ, Cupid or any other option that comes along, then that works really well. RabbitMQ does scale quite nicely vertically, so adding more memory, more processes, allows you to deal with more messaging and more clients. It also really benefits from SSD. It will aim to get all messages onto disk so that if there's any type of failure, that they're persisted. So having something underneath that really helps with the IOPS does help that. This is the deployment that running the split brokers is what we did our most recent scale test on with Icehouse and Juno. This is the Juju GUI view. Juju is the service orchestration tool for Ubuntu. You can see on the left and right-hand side there, we have NOVA, a neutron-dedicated message brokers, and then the stacks of open stack services around that. We also ran clustered cloud controllers and neutron API servers to provide resilience in the cloud as well. So what other options are there in addition to RabbitMQ? Some of you may have heard of ZeroMQ. ZeroMQ is a brokerless messaging solution. It provides an abstraction there on top of TCPI sockets that has a lot of semantics that are familiar in a messaging system. It's point-to-point, so if you need to broadcast a message out to a number of nodes, the client is responsible for delivering nodes to the endpoints. So there's a certain amount of open stack deployment topology discovery that's required, so each client needs to know about what things need to know about what data. And it has some real promise. In September, we spent a week or so sprinting on this just to see how it worked and whether it was worth progressing further. It works fairly well with NOVA at the moment, but there's some inefficiencies in the driver implementation, which means it doesn't work well with Neutron. It doesn't deal with massive fanouts at all well. There's some inefficiencies in message sending. So it's not ready for Scout out just yet, but I think it would be an interesting space to watch in the next cycle. So moving on to database. Again, system full state, really important. MySQL is the fully supported MySQL offering in Ubuntu. It's a very traditional approach to clustering. Active Pass if we share a block device. Something a lot of people are familiar with using the cursing pacemaker cluster stack on top of that to provide virtual IPs and stuff like that. But there are some other options, so Pacona and MariahDB both provide active active MySQL options. So these aren't scale out solutions, but they do provide improved HA. So a right to any node in the cluster is a right to all nodes, and it won't return until that's happened, but you are constrained by the slowest node in your cluster. So you need to make sure it's balanced. Otherwise you just kill everything by the slowest node you've got in whatever cluster size you work on. We generally work on clusters of three to avoid split brains. Two nodes is not great without any type of arbitrate to ensure that network splits don't result in two things, trying to do the same thing at the same time. It's also tuning options around MySQL as well. Concurrent connections is a really key thing that a lot of people forget. I think the default in MySQL is something ridiculous, like 150. You don't get far on that. You need probably more like 10,000 for a busy cloud than any number of neutron and nova services. Again, native SSL by MySQL, if you want to use it, has the overhead of SSL on each connection, so it needs to be considered within your total deployment as to whether your infrastructure can deal with that. So there's some other things that leverage messaging and database within OpenStack itself to attempt to provide increased scale, and the one that's been in place for the longest time is Nova cells. It's been in OpenStack for a few cycles now. I know it has some fairly large users. Basically, what that allows you to do is federate messaging and database within an OpenStack deployment. Rather than all of your hypervisors having to run off a single message broker, it allows you to do multiple message brokers supporting subsets of hypervisors within your cloud. Then there's connecting tooling, which is part of Nova called Nova cell, which does the message routing between API cells and compute cells. There's multiple tiers with that. The idea is that this allows you to limit the load on any given underlying database or message broker service. Obviously requires a bit more infrastructure potentially, but it has the potential to build larger clouds. We've done some testing around this. We did some testing on the moonshots, and we had some success, but we also found some issues as well. It's still described off stream as experimental, and I think I'd agree with that. The other limitation is there's no similar construct in other projects. I think for Cinder, it supports cells in terms of it knows how to deal with that when Nova is using it, but it doesn't have some of the scanning issues that Nova has. Neutron doesn't have any similar concept with the MLT reference driver right now. If you want to do cells with Neutron, that is possible, but you're still using a single message broker to provide all your Neutron services across the cloud, so you've still got a potential choke point there. What about the OpenStack services themselves? Most of these just scout out. A lot of them are lightweight API and RPC services, so with HAProxy on the front end providing load balancing for API requests. It's a pretty easy story to scout out most of these things. A message broker with the topic in queue configuration that OpenStack sets up allows RPC workers to be distributed across nodes and deal with and scale out horizontally to deal with loads coming from compute hypervisors or wherever that might be coming from. Up until Juno, there were the Neutron network gateway nodes that provide north-south routing of tenant instance traffic in and out of cloud were a bit of a pinch point. You could do some scale out horizontally, but there was limited support in terms of resilience. That's changed for Juno, so Juno introduced the concept of RouterHA on the edge and the distributed virtual router which pushes east-west traffic routing down into the hypervisor layer rather than it being on dedicated infrastructure points. That's really held. Fairly new. I haven't tested that at any scale yet, but I'm hoping to do that next cycle. There are pinch points within that infrastructure, so during instance creation your Nova controller and Neutron controller nodes are going to get very busy. They deal with a lot of RPC messages for instance setup. During our testing, we run with a concurrent instance creation of 100 instances at any given point in time we'll be trying to be created and on the cloud controller nodes we had in our moonshot test that was driving a CPU load of 60% utilisation across all 8 nodes and then combined that with periodic tasks coming back from the hypervisors including power state sync stuff like that, that easily spikes up to 100% across 8 nodes on a regular basis. I think there's some opportunity to do optimization within OpenStack itself but it's a busy place. Even a cloud not affecting change is a busy place from a messaging perspective. In our reference implementation architecture we use Apache for SSL termination. It's proven in terms of its scale and we use that to front as a proxy to underlying OpenStack services. It's well security sported in Ubuntu is the other reason so we know we're going to be up to date in terms of security patches and stuff like that. This is a pick out from the HP moonshot test which shows the load on a single Nova cloud controller instance in those 8 during the background of 60% of instances being created and then the spikes periodic tasks run across the cloud. Hypervisor The default hypervisor in Ubuntu has been KBM for as long as I can remember and has been involved in Ubuntu anyway. There are other options so Zen's there if you want to use it. It's not something we particularly test. Following on from earlier announcements today you also have options with Lexi. There's been a Lexi driver via Libvert for a while now and we've been developing a native Lexi driver during this cycle which takes the Libvert layer out as a small direct control and influence things like unprivileged containers and stuff like that to provide increased security. Other things to think around on your hypervisor layer so your instances are going to require some storage. That storage is going to sit on blocked devices on the server typically so having good IO performance can really help with contention. This cycle we've been looking at Bcash as an SSD based caching layer on top of spindles to see how that improved performance. It looks pretty promising. We've got that in 1410. We're going to do further testing on that this cycle and see if that's a good way forward in terms of not having to put pure SSD in your compute nodes which is not a great idea. You can also back off to other storage devices so Nova now has a native Seth backend this cycle which looks interesting but again you're just moving load on to your network so some interesting challenges around that. The fourth aspect I mentioned that first slide was network so typically I found that gigabit is sufficient in the control plane so for messaging, database, API calls even in a large cloud you're not going to be stressing gigabit networking hugely especially if you start splitting out internal networking, admin networking and public networking so you can split those out and provide different network access points into those services quite easily and that's supported in OpenStack. The real pinch points are going to be on tenant instances so if you get up to any level of instance density with any significant workloads then 10 gigabit on the back end support tenant network traffic both between hypervisors and to the outside world is a pretty good idea and the same applies down to the storage ware as well if you get any amount of block IO against the Seth backend that's going to start pushing gigabit pretty hard you're going to check on that before you can actually exert the spindles to any any extent on the back end so having 10 gigabit there is a really good idea the other thing we've been looking at this cycle is IPv6 support and this isn't really IPv6 support for tenants this is how do I run and open stack cloud on IPv6 we've done quite a bit of testing around this we found that the support in Noveson and Keystone is pretty good initial and glance requires some hacks right now and Swift is not there it's tricky, there is a hack but it doesn't scale particularly well and for a front end that's going to be potentially serving a lot of public accessable API requests it's not a great solution right now a lot of this is in a deficiency in the dependency of open stack I think we're going to try and spend some time this cycle to see if we can actually fix the underlying issue rather than trying to work with workarounds all the time so we can see how that works out we've also been doing some work in this space with a company called Stam who have quite an innovative SDN solution and it's all based on the Linux kernel it's not upstream in open stack yet and I know they're working towards doing that so we've been doing this on behalf of a partnership with the telco we've been working with so last element was self storage so we talked about the 10 gigabit client access layer into a self cluster being pretty important you might want to consider having dedicated high bandwidth for the back end cluster network as well so this really saves you if you lose any number of self modes in your cluster those resyncs between the remaining nodes to restore replica counts of blocks stored in your self cluster will then be performed over the dedicated back end network rather than impacting on client performance on the front end it's really about risk mitigation if you can afford to take that impact on the front end then fine just run single networking if you want to offload that to a different network then you've got options there as well in terms of the underlying storage within a self deployment on separate block devices for a while with the intent of using SSDs with back end spindles that works pretty well you'll see some benefit especially with IO spikes but self is really about the heavy lift in my opinion it's about dealing with thousands of instances accessing thousands of block devices all at the same time getting a good consistent performance so those help smooth the bumps but they're never going to give you the extremes performance that a direct attach and for example it's going to give you okay so how do I tell my overstat cloud scales to the extreme so I'm going to talk first about how you deploy a cloud to this level and the first thing I would say is deployment repeatability is critical if you ever want to tune a cloud to get this level of performance so having a fast repeatable deployment process is really really important we do that using a few tools that we've developed over the last few years so we do all our bare metal server provisioning with a tool called MAZ metal as a service we do our service source creation on top of that with juju and then we have a number of open stack charms and seff charms and myskill charms and ravellend and juju charms which are the encapsulation of all the devops knowledge on how to deploy those individual services and how they relate to each other so what that gets us to is a single line deployment that deploys 509 clouds so in this line we're using a small tool we have called juju deployer we're going to tell it to bootstrap an environment we're going to use this configuration file called openstack.yaml which is a yaml file containing charm definitions, configuration options and relations between those things and numbers of units and that sort of stuff that you know or I'm a huge split topology and if we want to change something so we want to turn on layer 2 population and switch to VxLan networking we can just delta that into the yaml terana if I redeploy it two and a half hours later we have completely new clouds that we can then re-benchmark retest you won't get it right first time so having this is really really important so verification and benchmarking again is really important we've been using Tempest for a while as the upstream functional testing project generally proves pretty reliable this cycle we've been using rally for benchmarking from projects started by our friends at Mirantis we've found it really good for doing general benchmarking, boot delete cycles in and over whatever it might be and we used rally to drive the 100,000 instance test as well so kind of boot and forget type activity as well it also has some pretty good reporting showing you beta and provides JSON so you can chop and change it and analyse it how you like last thing you really need is really good monitoring so understanding how your cloud is behaving is virtually impossible unless you can see the stats of all the servers that are running within your cloud so we used ganglia for the last year test it's quite lightweight from a monitoring perspective it's very asynchronous and it allows us to look for patterns so on the right hand side there the two graphs there from the nova claire controller nodes and you can see the effect that a periodic task has on networking throughput on nova claire controllers so that allows us to see patterns, analyse a code see what events in OpenStack that's having on a regular basis is causing that and then deep dive into actually understanding that and trying to resolve those issues the top was going from 0 to 250 nodes in about two and a half hours so I'm right somewhere okay so that's all I've got to say haven't got any time for questions or do we need to move on? no we can take a couple of them any questions? no, change your mind gentlemen I have a question regarding to that two rabbit mq brokers so how do you configure that brokers I guess some components point to one broker the other point to the rest yeah basically so your nova's configured to point all your nova components point to one broker the neutron components point to another then how about the cilameter well exactly yeah you can't you can't configure cilometer in that deployment because it relies on a single broker for notification so it's the limitation against the broker alright sorry, before you go we wanted to finish on a high so with a real live customer story we've been working with telecommunications providers, financial services for a little while and we're lucky enough to have two gentlemen here from Sky so I'd like to introduce Will and Matt from Sky to tell you some of their story with Xtreme Appasite thanks Mark good afternoon guys so let me do a couple of introductions first and foremost I'm Will Westwick I head up Enterprise Technology an infrastructure strategy and enterprise architecture for Sky I'm Matt Smith I work for Will and I'm the cloud platforms and engineering manager doing the deployment and implementation and support of our OpenStack implementation so Mark asked me a couple of days ago just to spend a couple of minutes talking about where Sky is as a business today why OpenStack is important for us and what we're doing with it and then just share just literally a minute or two of some of the lessons we learn and our experiences with Ubuntu and Canonical so first of all for the benefit of everybody in the room who is Sky so Sky is Britain and Ireland's leading home entertainment and communications company we have over 10 million subscribers today so it puts us up the top notch there with paid TV globally I think fundamentally what we're about is millions of families giving them better choice for TV we have a choice around content and it's not just about linear streaming it's also about video on demand as well so it's about giving people choice and improving that TV experience we are also the fastest growing communications company in the UK as well so we have over 5 million customers now on broadband and telephony services which is a real fast growing area for us and fundamentally for us the most important thing for Sky content is at the heart of everything we do so we own some fantastic riots right across from sports the English Premier League, Formula One entertainment shows such as Game of Thrones and so on but the interesting thing for us today is Sky's business is evolving it's changing rapidly around us we have our set of emerging connected products and services that are basically taking the focus away from traditional broadcast technologies and into the IT data centre and that's forcing us to look at actually how we approach our data centre technologies and make us actually think about an open cloud approach to infrastructure with open cloud we look for flexibility and we look for innovation and commercial sustainability and that led us to making a big bet on OpenStack and we made that bet and from there we hope to gain a lot of benefit from the fast pace of innovation around the OpenStack community and also flexibility for our anagility to help us with our application to support and application deployment so once we've made the bet on OpenStack we had to choose at that point do we go straight from trunk do we go for a distro and if we go for a distro which distro is it that we actually choose and that's the choice we've made I think Matt's going to talk a little bit more about that so we chose Canonical there's several reasons for it major factors one of them was commercially favourable favourable against the competitors out there so we went round to all the competitors and Canonical lots of discussions Canonical sort of tipped the edge a little bit we really admire the Canonical's software product philosophy in that you can have the same product supported, patched, bug fixed with a subscription and run exactly the same product patched bug fixed without a subscription and there's no sort of penalty for doing that so we can have a development environment non-supported environment fully supported, fully patched and both exactly the same exactly the same product so a key flexibility for us recognise a Canonical ability to deliver so we see that Canonical's done pretty much the majority of the open stack implementations in the world and some of the largest and that gives us a sort of key leverage in using those skills in getting it deployed in sky with a minimum of fuss the long-term support that Canonical provide for 14.04 in Icehouse is what we're deploying really fits in with our time scales and life cycle of our open stack deployment and another thing that's sort of a bun-two is recognise as the number one OS for the cloud and we use that internally for our OS leveraging the same skills for open stack deployment fits really well I just add to that part of that very broad deployment breadth we saw that as a significant play for helping us de-risk open stack so open stack was new some deployments are successful some aren't so successful so what we wanted to try and do there is select a distribution and a partner to help us de-risk it as much as possible to ensure the implementation was successful and speaking with the Canonical team it's a good cultural fit with the guys at sky so we have gotten really well with people and it's sort of gelled very well very good partnership with the Canonical team and in choosing an open stack distro you need to choose the deployment tool so James described juju and mas so we looked at the other tools that other distros provide and we thought juju and mas were the forerunners and give us the quickest way to deploy had some good experience with juju and mas and with the team so we ran into a bug with juju reported it to Canonical they put it on launchpad.net we watched it being fixed around the world so some chap in Australia fixed it for 8 hours went on to a chap in Chile went round the world so the bug fixed a point release and 2 weeks later we've deployed it and upgraded and everything fixed that type of flexibility from a partner a technology partner is essential for us today as we start to build very very very strategic products for us around over the top video around connected set top boxes and so on we need to make sure we're working with technology partners that can respond almost real time 2 weeks is what we're looking at so identify a problem fix it and drop it and same tools so we removed all HA capability out of open stack and redeployed it in a few hours with the guys from Canonical with no downtime at all really really good stuff one thing I'd like to mention is thanks to the team that have been helping us so Ante we'll start off somewhere and one thanks very much you've done a top job any questions for we've got a few minutes for questions please first we've got a gen 1 deployment node count of 100 Canonical's reference architecture just on the networking we're I wouldn't say deviating but we're having additions so I think on the node count piece four availability zone spread across two data centres basically two regions if you like and our approach to open stack incremental approach generation by generation so Matt refers to our gen 1 drop so the first footprint if you like and the first configuration is generation 1, 100 nodes spread across those four az what we'll then start to do is build that out and extend it as part of a generation 2 and a generation 3 and so on so we're sure that we had a reference architecture that could expand but also capture changes in open stack as each new release drops and as the technology and the platform emerges basically any other questions at the back there the green hat we did some testing like James has done so we're using GRE and maximum instances per node was something on 128GB was something like 135 instances and no errors so it's a good experience one of the biggest challenges I think we're going the stage that we're at now one of the biggest challenges we're going through is actually working with the application teams around application migration on to open stack the design the architecture of the design the implementation we've actually found with Canonical's help quite easy that's not the hard bit the hard bit is getting your applications to adapt and the architecture of those applications to adapt so that you can get them on to a cloud platform and actually to be optimised for cloud and get the advantages from it that's the challenge that we have today Alrighty well I think that's all we have time for thank you so much to Will and Matt for giving us a little insight