headsday from Sky to do a presentation on OpenStack on the Enterprise. We do some introductions, a bit of background about Sky and the multiple businesses it has, and what our objective was to deliver cloud and why we needed it. Being at Enterprise we needed to integrate into the usual Enterprise eco-system. We'll talk about that for a short while. And my colleague will talk about upgrading OpenStack. And we'll talk about running OpenStack as a BAU support operation. And then some of our multi-tenancy users and their issues. So, I'm Matt Smith and this is Alan Chevita. We both work at Sky in the OpenStack team. And for the past two years we've sort of architected, delivered, grown, and supported the OpenStack platform. Sky is a company, 30,000 employees, we've got something like 22 million subscribers. Our primary business is satellite TV. We've got a lot of businesses, I'll talk through a few of them. We're also a telco and ISP with SkyTalk delivering telephony to our customers homes and broadband with DSL and fibre to customers homes. We do some online TV, with SkyGo and now TV, and that delivers both video on demand and live TV to customers homes over the internet. Sky Media is advertising sales for Sky channels and other terrestrial channels. So what's the problem we're trying to solve? We're a technology company and we have around 3,000 developers developing software for our set-top boxes, broadcast processing systems, the advertising sales, the online TV, online shop. And it's sort of technology driven between the business and the customers. So we've got lots of development teams, lots of applications and we need somewhere to host these applications. We needed a sort of one delivery mechanism, so instead of having sort of cloud islands delivering different applications, different clouds and different sets of infrastructure, we wanted a sort of single API, single road map for our cloud and it had to deliver software defined networking and software defined storage into a full software defined data centre that was multi-tenant and the only product that we could find and still is is OpenStack. So we looked at lots of enterprise vendor products and they were, you'd have to sort of stitch them together and create like a frankincloud, whereas the OpenStack has got one road map going forward and it's all fully integrated from the ground up, delivering everything software defined in the data centre. One of the key drivers is cost, so we're always told to keep our cost down and OpenStack, where to compare OpenStack against public cloud providers and also against the enterprise software that was available and I'm sort of glad to say that OpenStack beats all the others by a really wide margin. So what we've delivered in our data centres is, we've got many data centres, two primary ones and we've delivered two regions, OpenStack regions with multi-availability zones in each region. There's something like 80 plus tenants or OpenStack tenants on the platform with around 400 users creating instances, networks, load balancers, ports, storage or self service and they sort of deliver their infrastructure themselves or create it themselves and then deploy their applications on themselves and they use many different processes for this, one is heat, so they're quite expert at using heat and also Ansible. From a data centre footprint perspective, something like 7000 cores and 400 terabytes of storage. We initially deployed Icehouse a couple of years ago, we went through an upgrade to Kilo earlier in the year and I'll talk about that and the sort of processes we went through to get that done and from an end user perspective they see services, our OpenStack services, so they're NOVA, Cinder, Glance, Keystone, we do Cinder, presents through the SEF storage, we also do Redus Gateway for object storage. We have Neutron and we do both ProviderType VLAN and ProviderType Overlay Networks and we also have Heat and Solometer, so users use Heat and they can combine that with our Solometer to do autoscaling. Go through a couple of the applications hosted on the platform, so we have sort of customer facing applications like Sky Tickets where Sky sells events and music tickets to the public. We have Sky Cubox which is in the customer's homes and we push software to the Sky Cubox and we also retrieve customer journey information back into our OpenStack cloud and that sort of process and we make the UI journey better for the customers analysing this data. We also have business-to-business applications, so there's a VOD portal that we transfer video assets between Sky UK, Sky Italia and Sky Deutschland and we also transfer video assets in and out of studios like Sony Pictures and Paramount and we also got sort of like high profile applications, so we have a CEO dashboard and this is a collation of all the information and data from across Sky, so sales of going through the call centre, people watching Sky Go and now TV, people watching Premiership Football, how many people are connected to broadband, a full sort of one pane of glass for the chief executive officer and his executive team who use that to gauge how the business is running. So it's sort of hosting every sort of application from customer folk facing, executive office facing customer homes and business-to-business. So when we started with OpenStack we could have just put it in the corner of the data centre and said, you know, you're on your own, but the users who deploy their applications to expect a certain level of service and integration into the enterprise ecosystem. So we initially started integrating, we integrated Keystone into Active Directory, so the end user has a single sign-on type experience, so they log into the desktop, they can log into OpenStack and use the CLI using the same credentials. We integrated OpenStack into ServiceNow, so we use ServiceNow for incident management, change management, CMDB, so any instance that gets created is populated into our ServiceNow CMD and that links into sort of incident management, so when an application is running on our cloud, the user can be confident that when they deploy their monitoring it's connected through the CMDB into incident management and provides the right call-out mechanism. It's also integrated into OpsView and OpenView, so all the OpenStack services connect into there and OpenView and that feeds back into incident management and the call-out enterprise system. Also for capacity forecasting, so we've had quite a lot of growth on the platform, we use BMC's capacitor optimiser in the enterprise, that's connected into our salometer data and it drags data out of salometer and we can forecast replenishment of servers and growth going forward. And from an end user perspective, we integrated F5 into as a load balancer service provider, so instead of, we've also got HAProxy, but F5 gives a sort of enterprise. Fully resilient load balancer that our end users can use. So along with the enterprise ecosystem, we use the change management. Change management system in ServiceNow for keeping all our users informed of changes to the system, any upgrades we're doing, any sort of improvements and work that's happening on the platform and Alan's going to talk about how we went around upgrading our OpenStack platform. Hello everyone, as Matt mentioned, we started our implementation of OpenStack with IceHouse. At the beginning, as a deployment system, we rely on Juju, which is a canonical product, mainly because our infrastructure is based in Ubuntu, but at some point we decided to turn Juju off, mainly because the product was doing some things under the hood and normally we like to know what is going on. Some points happened that, for example, the Nova controller node had a different version rather than the computer nodes and so we had some issue with that and so we decided to turn it off. But Juju is not only used from deployments, but it's also used for config changes. You can't do your config changes manually because Juju will overwrite your modifications and it's also used to deploy new nodes and to perform upgrades. So because we turned Juju off, we had to find a new way to deploy and manage our infrastructure and at the beginning we started with just simple batch scripts pointing to an HTTP server from when we put up some config files and then do our notifications, but now we ended up using Ansible, which I think suits perfectly our needs. This will not be a sort of technical guy on how to perform the upgrade because obviously the time is too short, rather just be some suggestions based on our experience that might help you to solve or prevent some of the issues we had during our upgrade path. The upgrade path itself is quite simple. You have to modify some config files. If you're using Ubuntu like ours, it's apt-get to this upgrade and perform a DB schema upgrade on the OpenStack components. Obviously using the right options and the right config files could be really challenging, mainly because as far as I know, there's no official or unofficial documentation on how to perform upgrades. This was the biggest issue we found and I think if you ever try to perform an upgrade, the documentation is really poor or can reach zero in terms of quantity. So where to start from then? Well, there are some places you can start to look at. First one was just to look at the installation guide. Installation guide can provide some really good examples, so you can try to compare these examples provided with your examples, try to find differences and additions and it's quite a good place to start with. Then another great place is to look at the OpenStack config reference manual. Look especially for deprecated and new options. This happens quite often between each release and sometimes could be quite a nightmare to find which is the right one. For example, some of the options have been moved from the default section to other specific sections. This happened in the transition between Heiser's Tequila for Rabbit and Kiston. Just a few words on Kiston. I think with Kiston we did one big mistake, which was we used LDAP service account to authenticate our service accounts. So our main default driver is LDAP. This was fine with Heiser where the main concept was still not so mature. But with Kiston we noticed some big problems, mainly because for Kiston if you use the main you can use LDAP as your driver or you can use SQL. There is a big restriction if you try to use SQL driver and you have multiple domains. The main restriction is that Kiston currently allows only one SQL driver to be loaded at the time. If you try to load more than one SQL driver, Kiston will raise an exception and we will fail. As far as I know this limitation may be lifted in the next version of Kiston. It's still present in Mitaka and if I'm not wrong it's still present in Newton as well. So probably in Ocata they will do some changes. But the suggestion is to start your implementation using SQL. If you plan to use multiple domains, use SQL as your default domain with your service account in SQL and enable LDAP authentication for all the other domains for your tenants. Another great place, this is probably the best place to look at if you are performing an upgrade, is to look at the release note section. For each OpenStack service there's always an upgrade note section on the release page with some very useful information on how to perform the upgrade. We found a really good one for Nova and as far as I know this was the only place where it was mentioned is when you do the upgrade from I source to Kilo you should run a sort of background migration of flavor metadata information for an old location to a new location. This will be done, in Kilo will be done automatically on the fly by the Nova conductor service but you must perform this operation immediately after running Kilo because in Liberty the new location, the old location for the metadata will be dropped and the only place where we were able to find this information was under the upgrade note. This is sort of CPU intensive process so if you have a huge number of instances be aware that possibly you can have a sort of CPU load on your system but the command is quite simple and has an option to limit the number of objects you are transferring and so it's quite handy. For some project like designate for example I think spend one month, two months maybe trying to find how to configure properly designate especially in if you have two regions we are currently using designate in production. We find how to configure it looking at Garrett review page under some comments probably of the main developer designate I think and sometimes you can find useful information look at Launchpad or Git majority of the contributors in Launchpad could be the developers itself so we can recover people or people deeply involved in the project so they know what's going on and you can find useful information about it. On Launchpad just to mention we found we had an issue at the beginning doing the Nova DB schema upgrade and on Launchpad we were able to find a fix. Obviously Google could be your friend sometimes even if you don't want could be a sort of last resort. Talking about services I think it's widely agreed that you should start with Kiston and do some tests before moving on if all is okay try to move to Glans, Cinder all the other satellite products could be heat, Cilometer and leave Nova and Newton as last services. This is especially useful I mean this is important when you are moving from Liberty to Mitaka because there's a well the sort of bug that the Nova compute node must be upgraded before the Newton node this to prevent the delocs to be fed up with some warning errors on sort of missing the Biff interface so and this is not information I was able to find on the upgrade nodes section. Obviously try to automate as much as possible this to avoid user errors or to minimize the downtime especially if you have a huge number of compute nodes try to parallelize and obviously if you are also running GRE networks overlay networks this could be really helpful and it's very important to have a sort of development environment that should match exactly your production environment this could it's not all the possible because you know could be some death center restriction some policy restriction or budget restriction as well because this is very important because a multi region environment with different availability zones requires sometimes some extra options that they were not required with a single environment this happened to us because our dev environment doesn't match exactly our production this mainly because of have some data center restrictions and our dev environment lies only on one region and we had during the upgrade we had two main unexpected problems one was with clowns and cinder and we were able to find the solution under the config reference manual because a new option was added it was just simple region hyphen name and you have to specify the region for each glances in the service and we had another issue with horizon you were able to log into the main page but you couldn't select the region we fixed this problem by looking directly to the python code because you know we're just in the middle of the we just finished the upgrade so it was a bit of pressure all the things should work properly we looked at the python code and we found how to fix it so another suggestion if you try all the other things and you don't know what to do try if you have the skis or if you're confident with python language try to look at the python code you can find them sometimes it could be really useful and last but not least this may be the first thing you should have is have a strong rollback and back up obviously I didn't mention before but in our infrastructure for each service we are running for each service sorry our services are running on LXC containers each service has three LXC containers on three different compute nodes and on the top of it we have cornsync peacemaker and h-approxie so for us it's very easy to perform backups we are just using cp command we shut down the container we take a copy of the container to another location preferably on another server and in the same way it's very easy for us to roll back simply shut down one of the containers delete and copy over so it's quite easy some nodes on sef upgrade sef upgrade was really straightforward we are now running gil version was just an happy to get this upgrade mainly just if you're moving to gil be aware that you have to change um the file ownership on the OSDs from root to sef user which is just introducing this new version and can be a long process and then check the compatibility between the sef cluster and the sef client running on Cinder, Glance and compute nodes because we face sorry we encountered some issues because we were running an an older version of sef cluster with the newer version of the sef clients and this was they were not compatible and to conclude some notes on the mitaka upgrade um the upgrade from isofs to killer was a sort of one step upgrade we were able to move directly from isofs to killer without the need of going through juno uh this is not possible as far as note to in for the upgrade to a kilo to mitaka soon because there are some bugs so the first steps is to upgrade your killer environment fully kilo to the latest version then move to liberty so all your nodes compute nodes and all your services must run liberty code and then you can move then to mitaka the upgrade was quite smooth the the software I have to say by itself is really solid really strong uh rock solid no unpredictable behaviours obviously there are some bugs which is normal um the first version of mitaka was quite bad in terms of the upgrade path especially kiston we found some issues on the db schema migration some of the tables were missing especially if you were running for a long-term killer version that's quite funny but but it now is fixed on the latest mitaka release and we have some mineral issues with glands and syndrome they are again fixed there's only one big issue that is still ongoing which is a qm liberty bug i'm not sure if this is present on other distribution other rather than you've been to but it's still you've been to basically it's due to a bad appam or implementation so if you try to learn a great an instance libverse and qm won't be able to access correctly some portion of the memory and the process will fail the fix is quite is a sort of temporary fix it's quite easy you need to change some config files one line in appam or but the problem is that you have to shut down your instances and power on again a simple restart is not enough so it's mean because appam or takes care of all the process or process running so in some way you need to recreate and you be integrate and you beat okay that's all right so like i said we've been running this a couple of years now and we've picked up quite a lot of skills in running open stack we have a support contract with canonical and we sort of we sort of triage the problems that come through and we sort of we solved probably about 80 percent of the problems that we encounter with configuration files or simple of the more easy linux type things 20 percent of the stuff that we've sort of come across over the past couple of years a sort of more low level libvert kernel problems that we've handled off canonical and over the years we think canonical done a great job and sort of reiterating what Alan said about the open stack software we haven't really had any bugs it's more sort of configuration that we've ran into configuration issues and we sort of solved those ourselves whereas the issues that we sort of had are generally more to do with linux or kvm or network drivers etc i was going to talk through some of the issues that we've had yeah as i said the software is solid so we didn't have any major issue with open stack itself but many of the issues we had they caused what there was a sort of performance issue and most of the performance issue we had they were related to poor this performance of the underlying host of the host hosting our containers especially with my sequel so be aware that a bad my sequel performance could affect many services firstly nova conductor at some point we started to have our logs fed up with many many warnings and error from nova conductor and we couldn't figure out what's going wrong with nova but then after some debugging we found that my sequel was performing really bad and so because we're using as i said elixir container we simply moved our my sequel cluster to another to a more performant compute more performant compute host and the issue is gone and this sometimes it's really hard to do some debugging especially performance on open stack components because one if one service is doing bad sometimes it's affecting many other services that if you look at they are not strictly related one each other at some point someone decided to rename flavors so don't do that unless strictly necessary especially if you're running kilo we found sort of bug if you try because there's no nova flavor rename command you can do it through horizon but basically what horizon does is it deletes the flavor and creates a new one with the same characteristics but with a different name but the problem is that there's a sort of mismatch between the nova flavor the flavor id that nova sees with the flavor id that actually is in the database and if you try to resize a lime a grate in instance after renaming a flavor you will see some logs like some error log like i can't find a flavor with id 67 and go that the problem is fixed now canonical did a great job the problem doesn't exist in liberty it exists only in kilo so they back ported basically killer functionality the killer fix patch to the sort of the liberty patch to kilo and they provided us the solution i don't know if it's still available to i think i don't know yet but be aware that i mean you can come to this problem if you try to rename a flavor we had other two i mean we have other two lime migration issues with in kilo and imitaka in kilo in ice i've heard say lime migration was working fine but as soon as we moved to to kilo we started to see some constant failures on live migrations and absolutely unpredictable was quite random constant and random we isolated the problem and we found that we had to disable the tunneling of the of the live migration there's a flag that you can remove from live migration flags inside the nova comfy file and without the tunneling the live migration was fine but this caused another issue which is quite so was bad critical because under we noticed that under heavy IO disk IO we had constant data corruptions on instances being transferred and it doesn't matter which was the operating system so it was something inside of it the fix apparently i'm quite sure in liberty this fit this problem doesn't exist anymore so the moment canonical is still looking at the differences between the qm package of kiloversion and liberty and they try to figure out what is causing the problem then we have fun issue with load balancers we noticed that some customers had some issue to if they were using load balancer the h-approxim not the f5 driver but only this happens only with h-approxim they noticed that they were weren't able to reach more complicated say web pages so if a web page was quite simple there was no issue but if the page was quite a little bit more with more code with more more complicated they would just get a count which the page after some debugging we found that the issue was the empty value of the top interface inside the load balancer namespace and the fix is quite easy i mean you just need to lower the value from 1500 to 1400 but this is not a permanent fix because as soon as you reboot your computer your neutral node the value will revet back to 1500 so we know that and so we create this also some alarms and some checks so we are checking constantly the load balancer empty value of our load balancer namespaces interfaces on the namespace and we try to prevent this situation this was another issue we had recently was the with NF contract we noticed that under heavy usage network usage some we had sort of constant consistent packet loss on some of the compute nodes and the problem was about the NF contact table basically was our mistake because we used two short value and the fix was quite easy just on the fly fix you can modify the NF contact table max value and we did it on the compute node affecting compute nodes and this solved the issue it's very also very important for us was very critical the newton service order restart because if you have a big number of network and routers on your neutral node even if the L3 agent is reported as started the newton the router namespace creation is not complete yet especially if you have this big number and because the sequence of restarting is L3 agent DHCP agent and then the load balancer agent the DHCP agent was failing was failing to create the interfaces so if you then try to create an instance your instance won't get an AP address because the DHCP agent wasn't ready the fix is quite easy and simple we just put a slip command on the upstart scripts so we have a L3 agent running after 60 seconds and the DHCP agent after 70 and 80 and the router balancer at last after 90 seconds so going on back to our users so it's a multi-tenant environment with sort of 400 users creating instances and networks etc all day every day and we have your novice user and your expert user and everyone in between so the first one we get or get a lot of is I can't ping my instance and this is sort of the novice thing where that they haven't configured their security group or they haven't given their instance a floating IP we use in sky quite extensively slack for collaboration and we find that that's really useful that the sort of more experienced users will help along the the less experienced users in resolving these simpler issues and that sort of relieves us of quite a lot of work in sort of training people up he doesn't know but we use call this instant support using slack because there's people writing and we immediately have to yeah we created some problems ourselves so when we deployed ISAS originally we the the version of openv switch that we used wouldn't support provide a type vlan and provide a type overlay on the same compute node so we had to have separate availability zones for each of these neutral networks and the people who sort of create their private network create an instance put it in the wrong az and you get a no valid host found and then they'd ping everyone on slack saying this isn't very good and someone will report back oh you've put it in the wrong az we've resolved that since upgrading to kilo we've got a new version of openv switch supports both network types on compute nodes and we've collapsed these availability zones down we still have an issue with users of az confusion and they with Cinder so they'll create a boot volume on az1 and they'll try and boot it in az2 get the same problem and again it's a sort of user education process that slacks really good resolving for us ongoing same thing again education for using glance and deleting images when you've still got instances running running that image and they try and resize and it comes back and says failed user education and one of the main things that we find is that majority of the users who use the platform aren't very network centric or know how networks work so when they create their own network and routers and DHCP agents and address space they don't always get it right and it's it's it's a it's a sort of education process to teach people I'll train people up to know how networks work and how to create your own networks I think a common one across the whole of OpenStack over the past couple of years is a confusion with the command line interface so we originally we would have all the separate command CLIs for Nova, Neutron, Cinder, etc and the combined OpenStack CLI interface came along and it didn't have everything in and our users would go you know this isn't very good it doesn't work here and it does here and it doesn't uh we sort of mitigated this a little bit by creating an image that was public in our clients repository that had everything bundled into it and the users could just copy their rc file across onto an image and use all the CLIs um all combined um so it sort of said it many times that the the OpenStack software is really good it's it's never seems to go wrong it's just it just works and the majority of the issues that we receive from our users are you know I've deployed my database and my app and my web tier and it doesn't work it's your platform that is wrong and we sort of have to do the DevOps role and go in and work out why they can't connect to their database in a different data center or the replication doesn't work or the web service can't contact the app service and it's it's generally a problem with the end user not being able to deploy that application quickly it's about it isn't it yeah cool okay if you have any questions I think we've got a few minutes possibly one one two okay a lot four it's four of us well four plus me yeah guiding them in the wrong direction everything everything from deployment to administration to upgrades at the new nodes support people and because it's really hard to find to be honest um I mean it takes a long time to skill someone in using all the components of OpenStack so it's quite hard to find it's quite hard to find sometimes right people and we are the regional team that will build the our platform so we architected it we support it we expand it we do everything yeah yeah yeah okay there's a there's a good point about that I didn't say is that we haven't modified the OpenStack code right so we've just put uh scripts or programs that do the interface we don't um we don't want to modify the OpenStack code because we want to upgrade without any pain so we've we've just done this integration process we did it a few years ago with VMware products doing exactly the same thing and uh doing it with OpenStack was frankly a lot simpler didn't take long right so we didn't do all in one day the canonical support uh is general it's for the platform and they provide also support for the you've been to instances that users create but we see 80% of the time it's faster and easier for us to fix problems by ourselves but not because we are genius but you know sometimes the error logs are quite um clear you know where's the issue and we had now especially now we had two years of experience so we have a really good case of issues and we know sometimes we know okay this could be this problem look here or look here because otherwise we we do use canonical and you have to you know if you've got like a few hundred nodes and they want an SOS report from everyone from every one of these compute nodes it's uh takes a long time right yeah for things like QM or Libvert which is I mean it's really deep knowledge of the kernel of the packet we are not I mean yep it's not in our skills but the open stack software is open right so it it does make it a lot easier so we would log a call with canonical and they'll say I'll give a sort of hints about where to look and it makes you know it isn't like a uh a black box so you can look inside it and you can see what's going on and it makes a lot easier to even work on with canonical to resolve problems rather than you know just sending off like a uh an SOS type report to your enterprise vendor and them coming back saying oh it's this bug and it will be fixed in six months time we can see actually what's going on so it does make it a lot easier to resolve things sometimes it's not easy because I said before to configure designate I spent really one or two months just trying to find how this option works but during these two months obviously I've learned something because you know back up of mySQL and it's everything we have a backup of all our configuration files we keep we keep all that in git so we keep it all the config in git and we keep we do a backup of our mysql databases into object storage that's all you need so we don't keep like six years of it we keep 10 days mysql because you know cp command and it's a curl command to push it up to object storage but all the configs held in git uh no we can't okay so the the sort of strategy behind it going back a couple of years was you know we looked at different products right and we looked at things like cloud stack cloud platform uh from citrix and open stack and vmware and eucalyptus right scale all these things and I think going back a few years open stack was a clear winner but there's still lots of the horses in the race and we felt that having lots of other people using open stack so rackspace and lots of other smaller companies popping up doing public clouds using the open stack api it was an obvious sort of that's a strategy that we want to stick with so we can deploy our applications using the same heat templates to either our open stack cloud or to rackspaces open stack cloud without having to recode anything so it's a it's a sort of we've gone on the open stack routes to do things does that make sense so if we want to scale out we'd use the same heat template to deploy into rackspace or into cloud or another open stack cloud okay that's all okay cool thank you very much cheers