 Right. Thank you very much for joining us today. I'm really honored to be here sharing a little bit of our experience in running an OpenStack public cloud with you. This presentation is about all the things I wish someone had told me and told us before we started running OpenStack and trying to condense our four years of experience running OpenStack to you guys. My name is Bruno. I work for a company called Catalyst IT. Catalyst has its headquarters in Wellington, New Zealand. We also have offices in Australia and in the UK. About three and a half years ago, four years ago, we started a private cloud implementation based on OpenStack which later on became the Catalyst cloud, the first public cloud based in New Zealand. I feel extremely privileged to work for Catalyst and to work on the Catalyst cloud. I guess when you see that map that the OpenStack foundation often puts up showing all the OpenStack public clouds, we are that little bot there in the corner. The interesting thing about working in New Zealand is that one, the scale, the proportion of the country is different from other countries. It exposes you to the whole problem space. I was there at the beginning when we had the idea of doing an OpenStack private cloud and I've gone through the design, the implementation, the running it in production and all the different aspects of running an OpenStack cloud. I'm sure that if I was working back in Brazil where I'm originally from or for other companies, I wouldn't have been exposed to that much. I would like to start with a disclaimer here. I'm not here to sell you a product. This is a very important disclaimer actually. At Catalyst everything we do is based on open source software and open source technologies but we don't sell products. We actually provide services to customers. I'm not here to sell you an OpenStack distribution. I'm not here to sell you an OpenStack appliance or hardware that integrates with OpenStack. I can freely and proofly share this information with you with no conflicts of interest and just in the interest of potentially helping you in your journey with OpenStack. The first question people often ask me when applying an OpenStack cloud is how much will it cost? Let's get that out of the way first. I'll probably suggest that an OpenStack cloud will cost you something, a production may I say OpenStack cloud will cost you something around $150,000 one-off for your production hardware and for your pre-production environment. And by the way you need one, right? From the beginning if you're doing OpenStack in production you need a pre-production environment that resembles your production and allows you to rehearse whatever you're doing production there and ensuring that you're not introducing regression to production when doing a major OpenStack upgrade for example. So do consider in your budget a pre-production environment from day one doesn't need to be brand new hardware but it needs to be there. The second thing is you'll probably need two to three people per month to run OpenStack and in the beginning it's probably worth you know if you're not keen to invest there to have a service provider managing it remotely for you. There are many companies that will do this for you. Catalyst is one of them but you guys know you know I'm here at the summit often I've met with other OpenStack service providers and I can tell you that on average they will charge you something like six to ten thousand dollars per month to begin with and then as your cloud grows as you get more regions more nodes they will grow with you. But there's definitely a point where having your own people managing your public cloud will be better than having a service provider doing it for you. Just consider that in the beginning. Now when it comes to selecting your hardware if you're getting some network gear for an OpenStack cloud you probably want to look at top rack switches so probably two 10 gig switches for a rack. If you're starting with more than a single rack then you also want your spine layer there you probably want two 40 gig switches for your spine. You will have a management network so one gig switch for your management network and don't forget to switch for the pre-prod cluster. I'll keep going back to that every now and then. The good news is these two switches are not required on day one. You may not need your spine layer on day one you may not need your management switch on day one and that's because you can overload your 10 gig top rack switches and put some management ports in there and then as you grow you take those management ports out and you offload them to a dedicated management switch and to later on expanding to a spine switch as well to grow beyond a rack. But what you need on day one is a plan. It's your network topology, your design so that when you get to that point that you're growing beyond a single rack you know how you're going to take that next step and it's not a surprise for you. You don't necessarily buy them and make sure that you have a plan on what you do when you need the spine layer. When it comes to features, the features that you probably want to look at on your switches are VLAN or VXLAN, probably VXLAN nowadays, MLEG and it's very interesting to have some form of layer 3 routing with BGP ECMP. I would say it's alright if you like Cisco, Juniper, Arista, no problem at all, you know, use your preferred switches but the reality is you don't need these switches anymore. There are plenty of decent generic switches, white box hardware that you can get out there that use exactly the same Broadcon ship set that you will find in those switches and you have very good open source switch operating systems available nowadays like Cumulus Linux and those solutions they play really well with OpenStack. So if you're starting now, consider using open source switches and open hardware from the beginning. It's definitely possible to do that on the network side. What I would suggest is if you are for some reason using a Cisco or Juniper, I would avoid using the vendor specific Neutron provider, the Neutron driver for that switch. It's tempting and in some cases you may want or need to use that driver but the reality is an abstraction layer or a virtual switch like OpenVswitch can do wonders for you and it gives you that abstraction layer that allows you to change hardware vendors for your network fabric when you need it. So I'll bet with you that in two years time, three years time you'll be doing something different in your network layer than you're doing now. We've gone through that and keep on iterating on the network very often so having that abstraction layer that allows you to buy the switches that make sense for you at a given point in time both from a commercial but also from a technical point of view is a very interesting advantage when running a public cloud. Now when it comes to server specifications, what I have here is by the way I'm splitting the OpenStack private cloud or public cloud deployment here in four node categories. I'll explain more about that later. The idea of a hyperconverged infrastructure is great but I'll tell you some of the pros and cons of doing that later during the presentation. But assuming you have four different node types in the beginning, you will have a controller node, a control plane with three controller nodes at least and they will have something like 12 to 16 cores, 64 to 128 gigs of RAM but I guess the most important thing on your control plane is that you probably want something like 400 gigabytes of SSD storage in there and that's because there's a lot going on on your control plane. You have your database, your MariaDB, your MySQL, you have RabbitMQ, you have probably a Mongo backing your Cilometer. So there is a lot of data going on there, a lot of transactions and some of these queries they are not optimal as you know and having the SSDs on your control plane will make a massive difference in terms of the performance of the APIs. Then for your compute nodes you could easily start with something like a 12 core node going all the way up nowadays with a one unit server you can easily get 44 cores in those servers. So on the Catalyst Public Cloud we currently have nodes with 44 cores, 768 gigabytes of RAM the more CPU and RAM you can put on that compute node the better from a financial point of view it will be when you're trying to find your price per VCPU, your price per gigabyte of RAM but what you need to be aware there is your failure domain. If you have nodes that are too dense when one of those nodes fails they are impacting a lot of customers, there are a lot of computing since a lot of virtual machines that are affected and right at the beginning you may not have that many nodes to work with those failures so what you may want to do is to start your cloud with smaller compute nodes and more of them you know maybe six, maybe ten if you can and then as you grow and reach a certain scale then you increase the density of your compute nodes to something like that. When it comes to block storage I'm assuming you're using something like CEPH an open source distributed storage solution in there if you're not using CEPH I would strongly ask you to reconsider and have a look at it it's really awesome and one of the biggest things we've learned when running CEPH is that you could get one PCIe SSD card with something like 400 gigabytes and actually have that as the journal for all your OSDs in that server and I'm assuming something like 12 disks per storage node for block storage and that works really well when it comes to performance but also the price point per gigabyte that you will achieve definitely possible not to have that PCIe SSD and go for it you know four or five disks per SSD drive but this ended up working much better for us and when it comes to object storage I would strongly suggest looking at the servers you guys may have seen the backup guys backblaze and their chassis with 60 I think 60 drives 63 and a half inch drives in there there are other vendors that are manufacturing servers inspired by their design Supermicro has got an interesting storage server I think Dell has also released something that has got 60 drives in there and you can definitely buy the original 60 drive design from backblaze a company called 45 drives I think will sell you the chassis the interesting thing is as soon as you start as soon as you have 64 to 6TB hard drives in an object storage node and you have a few object storage nodes you reach that price point that is actually cheaper than doing object storage with Amazon, with Azure, with the Google cloud so that density on the object storage side actually matters and it's how you drive your prices down Erasure coding awesome but what I'm saying here is with three full replicas of your data you could see you achieve price point that is cheaper than Amazon S3 if you have this kind of node density when it comes to the hardware we are using there in one of our data centers we are using open compute hardware for another data center we can work with the density of open compute hardware so we are using generic Intel servers nothing special there we don't buy Dell servers we don't buy Intel we don't buy IBM sorry at the end of the day the components inside all these servers they look the same so whatever you're happy with that works for you now I would like to address this question here I hear in a lot of presentations people saying yes OpenStack can pretty much work with all the hypervisors out there that's true OpenStack does work with most hypervisors on the industry they are successful deployments with VM or with Hyper-V and so on but what we have found is that KVM is by far the most widely adopted and best supported hypervisor in the ecosystem right now if you go to the community and you say hey I'm working on this bug here on this issue here and it's you know a KVM hypervisor you get much more support and engagement from the community than you would with other hypervisors the other thing is if you look at the support matrix I think Nova Sioux has gotten one of the wiki pages there the matrix showing the features supported by each hypervisor you see that KVM and Zen provided with most features compared to other hypervisors and finally if you're looking at this from a financial point of view that's where the numbers stack up right I've done the business model the financial model for both a cloud running with VM or and KVM and Zen and the open source using open source is where the numbers make sense so do consider that unless you have a very special deal with someone like VM or Microsoft now let's go back to the topic of node segmentation one of the reasons you want to do node segmentation if you are a public cloud provider is for financial reasons and that's because you end up specializing your nodes so that you achieve an optimal price point for each one of the services you're providing so for object storage like I said you're probably trying to jam as many disks as you can on a node there is again the failure domain that the size of the failure domain that you want to be careful there but you do want a high number of disks per node whereas for block storage it's in your interest to optimize it for performance you're probably chasing something like all my I operations will be completed under 30 milliseconds or 10 milliseconds so your interest here the types of this key would use the architecture you would use is likely different whereas for compute you may want a lot of GPUs you know depending on your workload if you are doing research you may want a lot of GPUs inside your compute nodes and trying to get all those things in a hyperconverged infrastructure where you just have one node type it may not be possible right it may be really hard and especially when it comes to achieving those price points that are competitive on a you know with global cloud providers I would say it's probably impossible in a hyperconverged infrastructure now the second part of this presentation is about techniques to drive quality and service levels up and I'm going to talk again about node segmentation but this time for a different reason when it comes to service levels we have found some potential issues with hyperconvergence actually our first cloud region was running compute and block storage object storage all in the same node right we had a compute object block node and what we have found is that we started running into a few bugs that are bleeding edge bugs on the Linux kernel for example where you know if you have compute instance there the more compute instance you put the more RAM you are using on those servers and then you get to some high memory utilization on those nodes and then we ended up reaching kernel bugs related to the proc file system for example that only happened in that circumstance of you know very high memory utilization and Seth doing something here and I'm talking about the kind of incident and issue that required us to interact with the you know Linux kernel maintainers document the bug understand it interact with the community patch fix roll that out to our production so it does require a significant level of engineering there to run that hyperconverged infrastructure and what we have found is that as soon as we start to splitting our services into discrete layers here's our compute our object storage our block storage our network nodes our service levels went up right now let's talk about useful techniques to drive your service levels up the first one already makes an assumption that you're deploying your open stack cloud using configuration management system like puppet ansible sort chef whatever you prefer right doing open stack in production without such a configuration management system it would be very hard so from the beginning your open stack cloud is software and you're driving it using a configuration management system so one of the first things I would do when deploying a production open stack cloud nowadays is to actually create a test environment that runs inside my cloud but in the beginning also on someone else's cloud in case mine is not there when I need it and that test environment will resemble what I have in production you know the same number of nodes the same network topology as close as possible to it so that I can run my automated tests on that infrastructure and I would integrate what we've done is we have integrated that in our CI so every time we change a line of code in open stack or in our configuration management system if we change a simple line in puppet saying this configuration changes from A to B that triggers a CI job that spins up a cloud inside our cloud run on our automated tests and make sure we haven't introduced any regressions now an interesting trick that we found along the way is we wanted to test our cloud as users would actually use them use it so what we have found is that the tempest scenario tests actually behave and do a lot of operations using the APIs like your cloud customers would be them internal or external cloud customers so if you haven't seen the tempest scenario tests I'll give you one example one of the tests will launch a compute instance attach some block storage to it write some data to that block storage detach the volume from the instance attach to another instance and check to see if the data is there and it's what you've written so a lot of tests and a lot of operations that your customers would do when consuming these services from you so what is interesting about tempest is that you can actually use it as a gateway in your CI so every time you change a line of configuration or a line of code tempest the scenario tests will run exercise that functionality and say okay from an end user point of view you haven't introduced any regressions as far as my test coverage is concerned but the other interesting thing we've done is we actually run the tempest scenario tests from every hour in our production cloud as monitoring right as a form of monitoring complementing of course the individual service and component checks from Nagyos, Aichinga and other services that we use for monitoring but what is interesting here is that we have tempest doing every hour every operation that our customers are expected to do with our services and telling us yes it's working as expected people can create a new network they can attach a compute instance to that network and so on the next one is I have mentioned that already have a decent preproduction environment and that's because you know even though you have your test environment in your cloud or someone else's cloud you will get to a point where you want to rehearse the operation that you're about to do in production in an environment that actually resembles it for real with real switches with the real network configuration that you have there and in the context of upgrading OpenStack to the next major release of OpenStack this is really important Fourth one is think about communication channels with customers and prepare communication tools ahead of time so what we have found is when you have an incident in a cloud environment let's say a compute node went down and a number of compute instances were affected you probably want to contact these customers that were affected as quickly as possible to let them know hey guys I know that your computing instances down we are working on it they are being restarted in another server that is healthy we are on top of it you don't need to worry about it and to do this in a cloud environment actually requires some tools that we didn't have before so what we ended up doing was to create tools that we can point to a compute node or to a network node that has failed and say tell me all the compute instances that were here the tenants that own those instances and by the way go to our CRM and fetch the contacts from this customer so I can email them give them a call and say we are on top of this and in terms of providing good service levels we found that this was fundamental it also allows you to talk to the people that were affected and that actually required that communication instead of broadcasting that failure to everyone and saying hey this is happening we did that in the past and what happens is we get a lot of calls and people say okay have I been affected is there something wrong with my own servers no actually you haven't so these tools are very useful to plan in advance by the way if you are doing configuration management and you have the ability to say hey I've wrecked a server this is a compute node installed Nova and all the components that I need in a compute node why not tweaking your configuration management manifests or playbooks or depending on the technology you are using and introduce the monitoring right there so every time you wreck a new compute node every time you wreck a new storage node your monitoring systems are immediately aware I have a new storage node on the network that I need to monitor and you start monitoring that straight away now in place upgrades I have a friend called Sergey and I meet him at LCA pretty much every year and last year Sergey was telling me Bruno have you ever upgraded the Catalyst Clouding Production or the cloud team at Catalyst have you guys ever upgraded the Catalyst Clouding Production I said yeah and he said no that's not possible I hear that OpenStack upgrades are impossible no don't believe you you can do rolling up for some time this is a reality in the beginning was really hard to do major upgrades of OpenStack but nowadays most services will have a backward compatible API that will support the last version of the API and what that allows you to do is to upgrade one service at a time you would typically start with your keystone and then go to something like Nova and then service by service you upgrade that and just check make sure that when you're deploying a new service some of the newer services under the big tent they don't necessarily have the capability but you know Nova, Swift, the usual suspects they're all capable of doing that the important thing is tests every change you're about to do in CI with your automated tests and the better your automated tests are the less likely you are of introducing a regression to production and that requires every movement in pre-production we have found that it was extremely beneficial to work on improving live migration and making sure that live migration was as bulletproof as we could make it in Emitaka there were many features introduced in Emitaka that makes live migration much better the thing is you know as cloud operators you still want to have a life you still want to work as much as possible during business hours not 3 a.m. in the morning or funny hours during the night so having the ability to migrate customers away from a hypervisor to do maintenance to do an upgrade of Nova is really useful but not only at that level what we've done was to develop and also improve some existing community scripts I think most of that I'm pretty sure most of that is contributed back upstream where we can now migrate routers from a network node to another network node with minimum downtime so we can also take network nodes down for maintenance without impacting the network for our customers and having those tools prepared in advance will put you in a good place to do in-place upgrades now the third part of this presentation is about common deployment mistakes and I would like to start with the number one mistake which is GUI driven OpenStack I see a lot of people deploying OpenStack and expecting that there will be a graphical user interface and I push a button and off you go you have OpenStack and it's true those things exist you know look at Mirage's fuel for example great interface for you to deploy OpenStack the point is the problem is not on the deployment itself getting OpenStack up and running is nowadays relatively straightforward the thing is once OpenStack is running it is a complex distributed system with lots of moving parts that someone needs to understand and needs to be capable of going down the network level inspect you know run a TCP dump inside a kernel namespace to find out what's happening with a specific router and if you don't have engineers that are capable of going down that level it will be very hard for you to run OpenStack in production with good service levels I have seen great tools for us to deploy OpenStack that have a nice you know juju like interface, drag and drop and you have OpenStack but the point is can you run and maintain that after you have deployed and as soon as you get to the point where you can run and maintain that level that level of engineering capability then you probably don't care about the GUI to deploy OpenStack anymore because you would know how to run you know your own puppet manifest chef or Ansible and get things going so please don't think that just because you could get OpenStack running with DevStack or something like few you're ready to go to production this is a big mistake right and if you take away one thing from this presentation I would like to be this don't carry on your own patches unless you have to unless you must do this as a rule of thumb what I would suggest is that you never run code in production that hasn't been merged upstream every time you develop a patch or a customization to OpenStack that is not committed upstream you're creating a recurring overhead on the team with every release of OpenStack and that happens every six months so every six months your team will need to look at that patch that little change you've done make sure it works with the new version and the changes unless you're doing something funky with you know following master and continuously checking that against the latest changes in master but you know we've seen big companies messing up with this one as an example I know that HP Enterprise was stuck for a long time I think in the Diablo version of OpenStack with a lot of patches they've developed themselves and it took them a lot of time effort and money to get out of that situation into back into you know vanilla OpenStack so I would say don't do it unless it's absolutely necessary and be prepared on the other hand you need to be prepared to fix bugs and introduce new features upstream if you are running OpenStack in production you will probably stumble upon bugs that affect you and you will need some engineering capability to do that upstream or have a relationship with a service provider that is capable of patching that upstream for you and once it's patched upstream that's fine you know back port your patches apply to your cloud that's there were situations and there are a few situations where we had to carry on our own patches on the Catalyst cloud we choose very carefully when we use that bullet the next mistake is the cloud is not a hypervisor so long story short if you're looking at OpenStack because you're looking for a VMware replacement you're looking at the wrong thing if you just need a VMware ASX replacement for the hypervisor look at KVM, look at Zen and look at the management tools around KVM and Zen OpenStack is a cloud operating system it's touching pretty much every part of your data center to run your infrastructure it can do much more for you than what VMware does but at the same time it's more complex to implement and to operate in production so not a VMware hypervisor replacement next one is Keystone is not an IDP so in the beginning it's probably alright for you to use Keystone as your back end for identity information your user credentials but I would suggest that from the beginning plan what will be your IDP in the future you may have open LDAP, Active Directory, a SAML based IDP as your preferred technology there think about how people will create, terminate, reset accounts, reset their passwords and how that information will flow to Keystone the last one that I want to address is all projects are production ready so a project exists therefore I can do it in production well that's not necessarily the case there are some projects that have been around for a while some of them have been in development for the last two years and we have decided not to implement them on the Catalyst Cloud yet because we don't consider them production ready so how would I identify a project that is ready? first of all I would suggest understand your requirements well understand what you expect from as functional requirements but more importantly your non-functional requirements you can replace them, you know, deploy that service even if it's in DevStack deploy that service and validate your requirements in real life the ones that people often miss are high availability can I actually deploy this with high availability upgrade procedures once I deploy this will I be able to upgrade it in production and keep on rolling to the next version how easy will that be and security standards don't take for granted that just because a project exists it's secure I won't mention any names because I don't want to offend any of the projects here but we've found one specific project in OpenStack that a lot of people are using in production we are aware and have raised a security issue already in relation to that project that specific security issue hasn't been fixed yet and yet there are people running in production and I would consider myself that security issue very dangerous so I would even consider doing a code inspection yourself if you can at least look at the architecture of that system how it works and sometimes you will find some obvious issues there if you want to know more about this one talk to me later I'm more than happy to share that with you personally and I would like to conclude this presentation with the question do the numbers stack up could an OpenStack private cloud be cheaper than something like Amazon AWS so what I would like to show you here is I've got those numbers yesterday from the AWS calculator this is how much it costs for you to run an M4 large computing system in the Sydney region M4 large is 2 vCPUs, 8GB of RAM and then you say okay Bruno, Sydney is not the cheapest AWS region in the United States hardware is cheaper so what about the US price so that's the same computing system in the United States and then the next question will be okay but that's not the cheapest price I can get out of AWS I could pay for a reserved instance maybe I can reserve my instance for three years and pay everything upfront and that's when I get my highest discount on AWS for a computing system so how about that price and that is the lowest price you would get right now from AWS for an M4 large computing system and this is actually how much that same computing system would cost on an OpenStack private cloud with a reasonable size and if you want to understand exactly how I arrived at that number there is a presentation going on tomorrow called can OpenStack beat Amazon AWS in price it is at 11.15 or 11.25 tomorrow on this floor just don't remember the number of the room go and check it out because on that presentation myself and another guy called Bruno from internet what we are doing is to show you the actual total cost of ownership model behind this number and how the prices compare between OpenStack and Amazon AWS for compute, block storage, object storage, network and what we are doing is I am providing numbers for the OpenStack private cloud implementation and Bruno is providing numbers for an OpenStack public cloud implementation based on their public cloud so that's it from me today if you have questions more than happy thank you just remember to use the microphone please no questions thank you very much for your time