 Hi, thank you for attending our 5.30 presentation today, realize it's late, but you know, just be patient and then you can go have the drinks outside for the happy hour. So today's presentation is about tuning OpenStack for availability and performance in large production environments. And my name is Gabriel, I work for Symantec and I'm part of the cloud platform engineering team and we're building a consolidated cloud platform that provides platform and infrastructure services for the next generation of Symantec products. And a little bit about myself, my background. I have more than 10 years of experience in large scale production environments and I've been doing OpenStack from probably 2011 for about three years and at Symantec we're doing that and I'm part of a great team and really excited. This is our agenda today, we don't have time to cover all the topics we wanted to but here are some of the topics that we're going to be covering. So we're going to start with talking about what large scale means for us and what high availability means for us. We're going to talk a little bit about infrastructure lifecycle and how we deploy and provision boxes and also about configuration management and some about orchestration. The next topic is going to be about how we've integrated Keystone and the art services with our enterprise directory which is LDAP. The next topic is going to be about Keystone and securing Keystone and about PKI tokens. My colleague Raj is going to come in later and he's going to talk about NOVA and some of the KVM tunings and also the database tunings and lastly the RabbitMQ cluster. So let's get started. So first of all what is a large scale production, what is a large scale environment means? So what we consider to be a large scale. So first of all it would have to be spanned through multiple data centers. It will have thousands of hypervisors, tens of thousands of VMs and millions of requests per minute to your API endpoints. And obviously these numbers are, they will vary based on the sizes of your hypervisors or your VMs. In other words if your VMs are very large VMs you're going to have a lower number, same thing with the hypervisor. A hypervisor that is 12 cores and 64 gig is going to be a lot different than one that has 40 cores and 2 terabytes. So these numbers will vary. What about availability? That is high availability and we're talking about the control plane here. So we're aiming to have four nines of high availability for our control plane and we think we can achieve that number and the way we have done this is to first of all our control plane is virtualized and is distributed among failure zones. So we don't have anything running in a single failure domain. We use hardware load balancers that come in pairs between, in the front of all our services and the next rule is about the separation of your network. So I don't know how many people here have had issues with spanning tree because, okay, so I've had some issues as well and I remember those nights. So spanning tree is good but the problems that come with it are really ugly. So the rule is no spanning tree across availability zones and so each one of our availability zones is single L3 domain. The next thing is you should go with redundant power and redundant network connectivity and pretty much everything redundant as much as budget allows you to do so. And to give an example, I put here one of our hypervisors, how it's configured and as you see we have multiple redundant 10 gig nicks and so 10 gig nicks today should be pretty much on board on any infrastructure, on any enterprise class hardware server and the cost per ports has come down and it's probably not going to go up. So you should have redundancy where you can. So as you can see we're running in an active, active mode with LACP and we're sending multiple VLANs, we're running trunk ports and we're sending multiple VLANs to the hypervisors. Again these VLANs do not span between racks so they're completely isolated to the rack. And also very important have your management interface completely separated so you can always go and, if something goes wrong you can always go and manage your system there. So the next thing is infrastructure lifecycle and how do we provision boxes and how do we manage these systems. At this time we're using Foreman to provision our boxes and at the end of the provisioning we have, we're classifying the systems in different classes and then that classification ends up being used by puppets and for example if it's a compute node or if it's a storage node puppet applies different sets of modules and about puppets at this time we're using, we have switched to a master list puppet setup, we have used the puppet with regular servers in the past and although we have had some issues with scalability of it, that was not the reason we switched away. I mean there were many reasons but I think mostly because of it gives us more flexibility. Well our puppet code is in Git and we have a very easy way to pull these modules and apply them and the other thing is we can do so on a very large number of systems at the same time. Previously we were having issues running puppet job on let's say 2,000 boxes at the same time. That was a problem. Puppet servers were having a problem doing that. So without puppet servers that's not a problem anymore and we use, for orchestration we use salt and fab as well. A little bit about enterprise directories and OpenStack. So you build a cloud and you want your users to start using it. So what are these users? Users are if it's a large production environment your users are somewhere, they're either in some sort of active directory or they're in an LDAP directory. In our case we had both so now the question is which one you use. Active directory there's nothing wrong with it, it's just that in most of the enterprises or organizations it's usually handled by IT, it's owned by IT and you know our IT department is great and they're doing a really good job, it's just that you know their priorities are sometimes slightly different than ours, our job is to keep the site up and their job is to have our computers and our email and everything else working. So when you have a problem you might have to file a ticket and you don't want to do that. So LDAP was there and LDAP has been, LDAP is a standard that has been around for 20 plus years, it's free you don't have to pay anything. LDAP is perfect because it has built in replication, it has built in synchronization of data, it's the perfect directory for read-only queries which is exactly what you want, that's exactly what Keystone does. So that's why we went with LDAP and the other thing is you can, because you control the infrastructure you can design it so it fits your needs as far as load and performance as well. So at this time we have used LDAP only for identity. We kept the assignment in SQL in MySQL like stuff like tenants and projects and domains are still in MySQL. So LDAP is a read-only directory so people sometimes make it read-write and they think they should go and create users from Keystone and don't do that. Leave that problem for something like identity management or pretty much everybody should have a system of getting their users onboarded and don't get into that. So use it as a read-only. The next thing is the security of disconnection between Keystone and LDAP. Make sure it's using LDAP S and I've seen sometimes people use just regular LDAP on Portrait and I'm thinking that this is going to be TLS but the truth is TLS sometimes can be, it cannot be enforced all the time so if you're just using LDAP S you're sure that your credentials are going to be send encrypted. Also work with your LDAP administrator, create a special user for Keystone to bind to LDAP and set the right permissions for this user to allow this user to access only the rights or use and that will give you a lot of leverage when you have problems with the LDAP directory and your administrator will be able to pinpoint these problems. This is a diagram that shows what I've said before, the end-to-end encryption of your credentials and my colleague yesterday had a presentation on Keystone and she brought some of these things up already but make sure that all the links in the path of the credentials are being addressed. For example, how do people talk to Keystone? So one way is through a GUI, through Horizon for example, another way is through Keystone Client or through CURL Direct API calls. So you want to put, first of all you want your Keystone endpoint to be secure so use HTTPS there and then Horizon as well, HTTPS on your Horizon interface and maybe like a redirect rule that will have people use it. And we have already talked about the connection between Keystone and LDAP that is also encrypted so by doing this you are sure that at any time the credentials are protected. This is another picture that shows more detailed diagram on how the security on Keystone has been implemented. So as you see we are using load balancers and we have split the LDAP, we have split the Keystone and the Keystone admin into two VIPs because we wanted to use, we wanted to take the ports out of the URL to make it easier for everybody. They just say HTTPS, Keystone and the rest of the domain and that is the endpoint. No ports. So there is a port translation happening on the load balancer and as you see we are using on our Keystone servers, we are using self sign certificates versus the trusted sign certificates on the user facing ones. And somebody asked yesterday why do we do this encryption here since the data is already encrypted. And the reason is sometimes the compliance, because of the compliance rules if the information is passing from a more secure zone to a more secure zone through different other zones the information needs to be encrypted. So it was not a big deal. So we used Apache with the Mod W, HGI and Mod SSL, self sign certificates then on the load balancer client SSL and server SSL profiles and the discussion of which certificates you should use signed or self signed. Remember it's always about user experience. You want your users to not feel any difference when they come and use your cloud then they go and use a different public cloud. Like I haven't seen anybody complaining that I've got a warning certificate mismatch when they go to use a public cloud. That's never going to happen. So why should it happen for yours? So always customer facing VIPs use trusted CAs. So we're going to talk about Keystone and PKI tokens next. So I was curious how many from here have been using PKI tokens or are using PKI tokens? Okay. So not very many. So the question is why should I use PKI tokens? Why should I just not stick with UID tokens? Because you know they work, there's nothing wrong with them. The problem with UID tokens is that when somebody requests a token from Keystone they authenticate, they're getting this token. This token will have to be sent to a different component and it will have to be validated by that component. And then that component will have to contact Keystone, Keystone will have to validate the component and then it's going to have to reply back with OK or a list of more information. So every time that happens Keystone is being queried. When you have a really large environment, Keystone is going to suffer because of this. So then you switch to PKI. So PKI what's happening is Keystone sends you this token in an encoded format, it pretty much takes the JSON file that has this information and it encodes it and then it's signing this information. So the client pulls the certificate that was used to sign the token and it can validate the token without contacting Keystone again. And that's very important because it pretty much takes the score out of the, it never talks to Keystone again. There's some misconception there about, oh I switched to PKI tokens, my tokens are very secure and nobody can see the tokens. The tokens are not encrypted, the tokens are only signed and they're encoded. Anybody can take the token and decode it with base 64 and they can look at it. So remember that is tokens are not encrypted, they're signed. So there's a certificate expiration, there's a client that can validate the expiration on the signature and then make sure you, there were some problems before with their vocation list but they should be fixed by now. So it's not all good with PKI tokens because otherwise I guess more people would have used them. So there's some problems mostly related with, for us we have had some issues with the Keystone catalog size. So for one reason or the other our master catalog size had grew more than it should have. And sometimes when people request a scope token, this catalog has to be encoded into the token. And what's happening is we have seen that because of that this token gets passed out in the header of the HTTP request and a lot of components have issues with larger header size. So one of the options that you need to make sure it's being looked at was this Max Sederline option that I think it has been bumped from the default 8K to 16K in some components from what I remember but to be sure we have bumped it to 32K. And also, you know, newer mod WSIG has a specific option to deal with this. And also very important, talk to your users and have them use no catalog when they request a token if they don't need to have the catalog in the token. Sometimes they do not need it and they don't know about this option so that they can request it. And I think at this point I would ask to keep your questions for the end when we're going to have a Q&A. And at this time I'll have to let my colleague Raj to come and talk more about Nova and these other things. Thanks. Thanks, Gabe. Hi. My name is Raj. I work in Cloud Platform Engineering in Semantic. My background is on operations and infrastructure and I've been working with OpenStack from the past like two and a half, three years. So one of the important aspects before I start is like I want to ask people how many of the people here using Nova API behind Apache? Okay. Not bad. So one of the crucial important things that we saw related to performance especially is there's nothing wrong with Nova API using existing Python evenlets. But when you get lots of requests and especially Gabe mentioning, my colleague mentioning about the PK token, the header size increasing, you're definitely going to hit the performance and most of the things that we see is like, all right, this performance issue let me scale the worker. And you can't, you keep on scaling them but you're not going to get the performance better than the throughputs from the API services. So Apache has been proven through these web services well, better handled. And the other reason also we thought to use Apache because we want to make sure every layer of OpenStack is being encrypted and use SSL in the back end. So today we don't use that SSL for Nova but we have plans and if you see here we have commented that out but we're working on that. By using Nova API behind Apache, the differences that we saw being eight workers with Python evenlets and with three processes with Apache is going to serve it. So you see a significant difference. It is test config files that we use today. One is for Apache configuration to make the virtual host and to enable the ports. The other is for Apache, more WS, SGI, the script to invoke the Nova API service. Feel free to use this and then try in your labs and see the difference and you'll definitely see a significant difference. The other thing is I'm not going to touch more on the Nova scheduler because we use our own custom scheduling and most of the people I guess are also using their own custom scheduling for their own workloads. So I'm going to skip to Nova conductor. How many people doesn't use Nova conductor? And I think everyone uses. It works for you guys. One of the important aspects about Nova conductor is it's been recently added in Grizzly. It's a great idea to bring security. It's been added because to bring security in place. Not Nova compute talking to database directly. That's one of the major aspects that it was being introduced. Saying that but still Nova compute has been compromised with MySQL access because if you're using Nova API metadata, you still need MySQL database access or if you're using any volumes like Cinder, it still needs MySQL database. So apparently it's not solving everyone's problem but it's a good idea to start using that. But we see some performance issues by bringing Nova conductor into the picture. People say that yes, you can mitigate this problem by scaling it out but the same time I'm scaling out my database and I have to scale this out for no reason because I'm not getting an advantage today. Maybe I get advantage tomorrow. Maybe I should think about it. So to mitigate that and to get better performance, we actually disable Nova conductor. That's how it looks to flow. It's not the default behavior anymore. You have to put a configuration option in Nova.comput to disable that. And I'm pretty sure that you're going to see significant difference in your performance because you're not going to have one more layer. Because if you see Nova conductor talks to rabbit message and then talks to database and you will have massive issues of bottlenecks with this and also have one more step. So to do that in the config file, use an option called use local equal to true and then our compute. People who doesn't have MySQL configuration, make sure before you use this option, configure MySQL options like put your MySQL web using a password and database information there so that you're not going to get added out from the compute node. And this is one of the important things that we want to share. And also we're trying to actively work with the community to see what is the best we can do to scale out and maybe better idea for securing the database access. So the next important thing is also, we said it's performance tuning on OpenStack and it turns into KVM is kind of tightly being integrated with OpenStack nowadays because the only people by default when you say OpenStack, people also point to the KVM for virtualization. We did have done, I'm not going to talk about a lot of performance tunings in KVM but I'm not going to talk everything here. But we want to share a little bit what we have done, which are the major components that gives value, adds value to your existing cluster of workloads. One other thing is KSM, channel same page merging. This actually helps you to get, to share your identical memory pages among different processes or virtual guests in a virtual host. It's into a single virtual host. This is actually also so critical if people are doing their resourcing, like the provisioning resources is like overcome and made like having increased their VCPs and memories, especially memory, will help a lot. And also, this is not a tough thing but it's about the thing that we might be forgetting to install a package and turn the service on. There are services that take care of this. So in Rull based versions, it's KSM Tundi is a process, make sure it's turned on and it's an Ubuntu KSM D. Make sure it's turned on and it'll follow. The process will take care of the rest of the following steps that it's going to take here. I'm not going to go much details there because it's a straightforward thing. The next important thing is transparent huge pages. I think everyone talks about it and it's so important because it also gives benefit not only host but also guest. By saying that, you have to also keep in mind that the value of the host should be always, the value of the guest should be always less than the host because you're going to not gain much performance if the value is much increased. Also, to make it enable, it is a command to make it enable but if it's always disabled. If you cat on that file, you see it's never, so make it always to make it enable. The other thing is also add this configuration parameter in your Libboard XML file for a guest operating system to take advantage of it and there are other parameters that you can tune in. We're not really sharing the number here what we're using because it's kind of a test and tune. It's based on your workloads, so you can use the default. Maybe some people use the default value which was great for them, but some people may not. So the next thing is blockhive, the IOS scheduler especially. One of the major important things from the default behavior, CFS, we moved to deadline. It gave us significant performance increase when we use QCock images especially. It doesn't match with the raw but still it's a little bit performance increase by doing that. Also keep in mind and also if the same team dealing with the guest and the host, it's okay if there are two different teams that are dealing and then first exercise, make sure that coordinate because the value is being already configured at the host level and if you're trying to do on the guest, you're going to have a worse performance rather than having a best performance. So the other aspect also we have been seen and also we have been beat by it is because enabling cache equal to none, make sure because it's not default in the OpenStack, it's not default in the config XML file and I will equal to native adds more value. The reason because you already have a cache in the back end of the hardware which takes care of the things and you don't want to do again which gives you worse performance. That's more about the major three things that we actually looked at and we want to share that in the configs and the comments about KVM. The next one is database. It is kind of tricky because database is always a pain point in every aspect. The thing is people talk about like how do I replicate, do I have to do multi-master, master slave, how do I do reads, how do I do writes. So one of the approach we take is we use a standard model of Galera application. We have a three node cluster and it's not always ideal because based on your size of the and your workloads you can always increase and make sure it's odd number always for the speed brain mechanism. So we use WSwrap which is a write set replication. We enable the data replication behind the load balancer which is taken care of and we have a web that all clients connect to the web. These three nodes are all active and we try to make it to write to all nodes and we had problem and then what we start is like make one node write, make sure that it's committed to both nodes and then read from all the nodes and so we're differentiating the workloads between write and reads to get better performance. Talking a little bit of Galera Galera is a data center approach that means every node has its own unique ID and the rule of thumb here is the node belongs to data, data doesn't belong to node, that means data is not part of node, it exists everywhere. So no head cache or increment as legacy MySQL replication does here doesn't do that anymore. So we want to share our configuration here too because it might add value for everyone. What are the values that we do for our cluster with three node? How does it increase? So we have been like testing with the massive workloads and increasing each individual performance for each individual configuration option and then tune them to our workloads but it also depends of your workloads but this works perfect for us. Feel free to use it and test and see. The other thing is there is always software, we always have limitations, everything is not perfect. So the most important thing is it supports only InotDB and make sure that your database has been using InotDB and primary key is masked. With our primary key please do not even try using Galera cluster and that can be fixed, the schema can be fixed if you can spend some time on your schema which we have done a few times. And also commit latency which is another one that we said like when we make it all active that's what we actually ran into problem with commit latency because of different workloads and reading and all these things. So that's the reason we made one master and then the rest like it's a pristine connection behind the wave. One master gets all the connections and then when it fails down then only it goes to the second or third masters and we do have a custom scripts behind the load balance that checks whether the node is in the cluster or it's not in the cluster. We have a separate user that has been created to MySQL that we use that to check that nodes. The other thing is it doesn't like huge transitions, it doesn't like to put really huge transitions and deadlocks and commit is always a problem with all the databases not only MySQL but it's one of the limitations. That said this is what or few limitations there are few more but these are the major limitations that we see from our side. That said now we're kind of switching to Rabbit MQ, this is one of the coolest things so I just wrote it down. Rabbit used to be using for cartons and now it started using for data queues. So Rabbit MQ, what can Rabbit MQ do for us, for you, for everyone in the open stack cluster? The one of the important aspects people always forget is there are a lot of differences between Rabbit 2.x and 3.x. We use 3.x. It's been improved a lot especially it has an A2 clustering mechanism only to use a pacemaker and chorus sync which I saw a lot of operations guide this but you know to do that we try to contribute that to have this documented there too and it has a high level queues that means by default the queues are not part of the cluster. We have to enable the HA policy on the queues. Don't forget that but that's that's one of the most things. People create the clusters but forget to run that command and also it has the latest implementations for the MQ protocol 0.9 and 1.0 came recently which has a lot of cool features especially connection pooling for connection channels and federation is another important cool thing that people can use it if you have an isolated networks that you want but you want to send the queue in different places you can use federation you can set a policy that xq go to winord and then it goes and Rabbit take care of a good job there and flexible routing for the queues so when a producer puts a message to the exchange and then you have a flexible routing that you can enable what kind of routing mechanism you want to use between exchange and queue I'm not going to go much details with Rabbit subject here it's a it's a big one so this is how our cluster looks like so we have a cluster with three nodes here so the cluster automatically replicates all the exchange and it's on data and maintains a lot of database and we have a queue that is that that got created in node one and got replicated here to node two and node three and this is the same thing when a kind of connection came here and if you create a queue in node two it's going to get replicated node one and node three so this is how our configuration file looks like and we would like to share that too because most of the implementations in my experience that I saw people run commands using their configuration management but Rabbit pretty much you can put things in the configuration file stay statically so that you know to run or forget and go with your craziness of configuration management so you can set once by every restart it's going to read this configuration file and then go with it the other important then I said like you have to enable the policy to replicate the queues I mean to have a mirror the queues so here's the command this is what we do so usually what we do is we usually add this command to the startup script itself because it's not going to harm anything if there is a policy already set it's not going to do anything if it's not it's going to set the policy we're making sure that it exists and usually when you plan for cluster when you create a first node you can run this command on the first node so that one that node two one node three joins automatically the queues will be mirror so so that's that's that's where it is so one of the other important thing that I want to share why why we are using a VIP before the rabbit cluster people might be asking OpenStack has an implementation having a HEA queues that you can put host a list of hosts it implements this with a for loop going node one or two node three it always goes in a loop node one node two node three and we don't want to use that reason because it's a client implementation it's it works great but at the same time I'm not going to get get it load balanced I'm getting HEA but I'm not getting load balanced so I want to use load balancing functionality the rabbit so that I get better performance by doing that I'm not choking at one single node because what happens when you have thousands of hypervoiders assume node one goes down all the nodes are going to be node two it goes down all the connections are going to be node three and you have a choke point there some people have seen the confusion management puts a logic odd number hosts go to odd number node even number goes to still it's it's how I wonder but we use load balances because we do have a little bit of load balances and we do and we also use connection caching on the load balances for connection pooling to give even better better performance and also make sure that you have keep alive enabled on your load balance it's a little bit the degree of performance but it's not that bad that's all about about robin mq and thanks for listening to us and any questions so you mean which manufacturer yeah so the question is what load balance are we using so at this time we're using the 810 but any load balancer will work just fine can use f5 or citrix it's not specific to a vendor I mean to add to that we have not tried hf proxy we're planning to try that too but we have an enterprise usually you have hardware load balances so we're taking advantage of it we are in havana and we are trying to upgrade to icehouse we have not seen anything like that from day one when we have set up this cluster we are using behind the load balancers and make sure to keep alive is there that's the reason it's there but we have not seen any that kind of things we see some of the timeouts losing connections from the consumers and it's a it's all about the one of the thing that open stack does with the loop is keep on check but it's all about the client implementation right how do you keep on check and connect back that's the reason we have a keep alive enabled actually to get indicated that but one of the thing that once we notice that we actually did little hack I would like to say because we use that configuration option on the config file of every component we use the same whip same whip repeatedly so this connection last first one still it going to try and try so we have not seen any specific RPC timeouts and this is the single cluster that talks everything even our SDN and everything so they have not seen we have we for that we actually use to add to that we use stack stash which is a rack space project for listening to the notifications we do track of that we have not seen anything like that because we keep on track every every minute what's going on on our cluster notifications and we track the flow of it's going and that is the problem but we have not seen anything we see okay with my sequel are using load balancer introspection to actually split your rights from your reads how are you directing your rights specifically to one node so we have tried the we have used that in the past with f5 you can write an iRule that will inspect the sequel query and then it can direct the traffic to specific nodes and you know where it's kind of makes load balancers a client server kind of thing so that's the reason it needs an sequel user to make sure that that what kind of commit it's coming up it's an update or to see what kind of sequel query that it reads and then based on the it takes so you are still using that on the a10 to say these are rights and we don't we don't use that at this time okay not at this time that's the reason we have made enable the persistent connection to one node right now and leaving that but we're focusing to go in the direction okay cool so you guys cut out the nova conductor yeah the community is really against the computes having db access yes have you looked into increasing the efficiency in conductor versus cutting it out good point actually even as we are security company we are concerned about that but conduct is not giving us value uh in terms of even uh even the community is saying that it should not have access to my sequel but still i need my nova api metadata needs my sequel access i'm not it's not solving the problem so rather than spending time in uh throwing there because it's not ready yet we just cut it off and we're probably gonna go back and because we need that security to be there so you guys are using nova networks still no i think that's the end but it's also not only about that but also if you use cinder or anything it requires my sequel connection from the company so still it's not solving the problem all right thanks uh so the question is how do we split the so as i said our control plane is all virtualized and we have a separate domain failure domains for the control plane and so they're distributed among different racks and so they run on different hypervisors so we're making sure that they don't run in the same rack and they don't run on the same well they cannot run on the same hypervisor but if there's many services of the same kind in the same rack they won't be on the same hypervisors so uh one important thing is uh our workloads of control plane is all virtualized we don't we don't run on bare metals yeah so it's because the scale out it's cloud i wonder like you took on dogfoot one more question about rapid mq yeah i didn't follow did you say that you are using rapid mq hosts parameter yes to tweak okay yes because of the timeouts because uh that we have not fixed that still it goes to the whip but it's going to do the same whip and the load balance is going to take care of how the connection goes and one thing guy didn't mention before and they should have was that we have uh checks for the service for all these services on the load balancer so all these services are being marked down if the service is not performing so it's not only that the host is being able and the tcp port accepts the connection you also want to make sure that the service performs and then the law if it doesn't the load balancer mark marks it down so the client will never see the make sure the service is enabled and it's accessible by the user that's that's how we mark it up in the load balancer okay thanks the question is do you monitor the q and the node status we do monitor the node status we don't monitor the q but we monitor as i said by using stacky the flow of the q for each and every request that is going on the request ID has everything so i will definitely as you said if i see something is missing there i definitely see that the notification has not been there so that we see that but we don't see and we don't monitor that we don't go into the looking at the q but we look the notification and get the q and we save that into database for looking the history how many cues we i mean how many requests we served and how did it go what's the time that it took i think i actually want no more questions or you guys we can take one more question if you guys have anything okay or we can talk about the beach yeah we're gonna be around so if you guys have any more questions find us thanks for staying late and staying