 Good morning, everyone. Good morning. Almost good afternoon. Just want to thank everyone for coming to this session on OpenStack Trove and Message Bus Security. My name is Mary M. John, and I work for IBM. And I'm a core on the Trove project. I'll be co-presenting with Amrith Khosa, PTL. And he's from Verizon. So today in this presentation, we will be talking about focusing mainly on how we've enhanced security for our Message Bus, just one of the features we worked on this release. I'll start with a general introduction of Trove, a brief introduction of the Message Bus architecture within Trove. And then Amrith will talk more in detail about how we've enhanced the security. So Trove is the database as a service component of OpenStack. Its main goal is to provide a scalable and reliable platform to provision and manage databases, SQL and no SQL databases, in a uniform and consistent way. So the way I see it, one of the main advantages of Trove is it provides a consistent platform to provision databases, whether it's SQL or no SQL. And what that means is you have a single interface, whether it's through Horizon or the OpenStack client, to, for example, create databases, provision clusters and manage replicas with just a single command. It can create a MySQL database or a Mongo database and manage them so that it frees the user from getting to know the details about each individual database. In addition to that, one of the key features are user and database management, resizing instances and volumes to help with scaling, cluster management, replication and failover, backups and recovery, managing configurations, and providing the ability to do database instance upgrades, whether it's having to upgrade the operating system version or the database server version or applying patches and things like that on the database instance itself. Now, in this chart, we look how Trove is placed with respect to all the other OpenStack services. So as you can see, Trove uses many of the core OpenStack services like KeyStore for authentication, Cinder for block storage, Swift for backups, Glance for storing the database guest images, Nova for provisioning neutron for networking. And while all the other core OpenStack services can use a shared message bus, Trove has its own separate message bus that it shares between the different Trove services. It does not share that with the rest of the OpenStack services. In this slide, we talk a little bit in detail about the Trove architecture. So we have Trove consists of mainly four components, the Trove API, the Trove Task Manager, the controller, and the guest agent. So the three services, API, Task Manager, and Conductor, they run on the control plane, whereas the guest agent, along with the database server, runs on the computer node, which Nova provisions. Now, as you can see on the left side, all these different Trove services, they share the same message bus and a database over there. And this is separate from what the rest of the OpenStack services use. So when the user makes a REST API call, say it's to provision or launch an instance, it goes through the Trove API. And then from there, it makes RPC. It communicates with all these different Trove services. They communicate with each other using RPC. Over this message bus in Trove, we use RabbitMQ to implement the messaging service. While also messaging RPC supports many different messaging services, we use RabbitMQ. So to talk a little more about the message bus communication, like I mentioned earlier, the different Trove services, the API, the Conductor, Task Manager, and the guest agent, we talk between each other over the message bus, which is implemented using RabbitMQ. So for example, now since this is an RPC-based communication between the services, what it means is that there are RPC service and clients. So the Task Manager, the guest agent, and the Conductor are configured as RPC service. And what that means is that they expose a list of endpoints or methods to which clients can call over the message bus. So for example, we launch an instance, the API sends that message to the Task Manager. The Task Manager then calls Nova to provision that VM. And once it's provisioned, it sends a message to the guest agent to prepare that instance. So what it does at that point is that it does all the database configuring, whether it has to install and set up the database. And that message goes to the message queue that's configured for the guest agent. So each RPC server has a queue defined. So Conductor has a queue, guest agent has a queue, and the Task Manager has a queue. So all these different RPC servers communicate with each other, post messages to these different queues to get things done. Similarly, the guest agent, once it's up in Rani, it sends its status to the Conductor. And that's how the communication between the Conductor and the guest agent happens. To let it know that it's pretty much called hard beat messages, you just let it know the status of those instances. And this happens every couple of seconds so that we know that whether the guest instances are alive or not. Now all these RPC clients, they use the same RabbitMQ credentials. And I'll talk in a couple of slides where these credentials are stored. So how do these RPC clients communicate over the message bus? So like I mentioned earlier, we use RabbitMQ as a standard for our messaging service. So we use the RabbitMQ username password and the message queue names, which are all stored in configuration files. So each Trove service has its own configuration files. And these credentials are stored in those files. So how do the credentials get to the guest agent? Now the guest agents, they sit on the guest instance, which are separate from the control plane. So once we create these guest instances, how do we get the configuration files over to the guest instance? It's pretty much what we are doing is when the client launches, the task manager asks the guest manager to launch an instance. What happens is it tells task manager tells Nova to create a VM. And once that happens, it prepares two configuration files. One is the guest info, which contains the guest specific information, like the guest ID, which identifies the guest instance. And then the other file is called the Trove guest agent configuration file. And that one contains all the information about the RabbitMQ credentials, like the username, password, the server name, and then the queue name. And this one gets injected into the instance using config drive during the bootstrapping process. And so that's how these configurations get into the guest agent. And Amrith will now talk about the security part of it. Thanks, Mariam. You guys able to? No? OK, now you can. OK. Thanks, Mariam. Thanks for the introduction. So I know there are several people here in the audience who have deployed Trove at reasonable scale and who probably know about some of these problems. So just to make it interesting, throw questions up at any time. We don't have to wait till the end or something like that. As Mariam said, the fundamental issue, which has long been raised as a security concern, I'm going to go back a couple of slides to the architecture here, is when Trove provisions a guest instance for you, it does that by talking to Nova. And whatever Nova provisions as a guest, that could be a bare metal, that could be a VM, whatever it happens to be, or if you're fancy and you're using LXC containers, it's a container, you have the guest agent running there, and the guest agent is connecting back to the message bus. Now, whatever credentials you need for the message bus, let's assume it's rabbit, those credentials are stored on the guest. Now, if for some reason, that guest happens to be compromised, and there's any number of ways in which it can be compromised, if you build a guest image where you happen to leave port 22 open, and you have a security group which is not tight enough, or somebody is really clever and they manage to break out of the database and land at a shell, those credentials are stored in plain text in a file. Once you have those credentials, you have the ability to connect back to RabbitMQ or whatever your AMQP server is. And at that point, you're talking to any other service which is listening on that bus. So one of the things which typically Trove has always advised is you use a different message bus for Trove. Part of the reason for that is Trove is often deployed not as a project which is part of the control plane, rather it's deployed in a Nova VM by itself. And if you want to install Trove in a Nova VM, the last thing you want is a VM running in tenant space to be talking back to your control plan to your message bus. So great, you have an independent message bus and all that. But still, there's an API service, a task manager service, there's a whole bunch of other guest agents and a conductor all listening on that message bus. And now you have the credentials, and you can do a fair number of bad things. So the guest instance is compromised. If you have the guest credentials, now the message bus is compromised. That is the fundamental issue which you're trying to solve. So pretty pictures, now this is the words to go with that. The important thing to remember is that when you're evaluating the security of a system, you start from the point which says if a system can be compromised, it has been compromised, what happens next? So since you have credentials in plain text, and it's possible in some cases to compromise a guest, now the guest is compromised, what happens? Anybody who is clever enough to connect to your message bus can now send any message they want. So it's really important if you're deploying Trove to protect your guest image. And there are any number of things which you can do in security like the proverbial onion is multiple layers. So prevent port 22 access, no shell access. Have things like app armor or SE Linux or whatever you want. Make sure your database is configured such you can't have a escape. All of those are things which you can do, but still, if that's not good enough, then what? And it's important to go that far because there are going to be situations where even though you do all of these things for troubleshooting, you need to get onto the instance. So you might have port 22 access even if it's from a control network. So at the end of the day, how do you further narrow the damage radius from somebody compromising your guest? And so I'm going to talk about something which we actually implemented in Okada just to do this. And the solution we came up with was relatively straightforward. We're just going to encrypt all the traffic which runs on the message bus, okay? And we're going to do that encryption with unique keys depending on what the traffic is being used for. So there's a unique key for every guest. There's a unique key for the control plane. And the key for every guest is stored in the database on the control plane. So is everybody kind of clear with this or do you want to see a picture which explains this a little bit better? Yes, no. I'm going to go back to the picture just to make it easy for me at least. There is a database down here which is the database on the control plane. That one over there. The key which is used for the API service to talk to the task manager is stored on the control plane. The key for every single one of these guests is stored in this database. And each one has its own unique key. Everybody okay so far? All right. If you were to do that, then every piece of traffic which is on the message bus is encrypted using a specific key. On the control plane, we use a key for the control plane. For every guest instance, we use a key which is unique to that guest. And we'd regenerate that key. We generate it when an instance is created and we regenerate it when the instance is upgraded. Okay. So we got some of the OpenStack security folks to take a look at this. And while I was doing that, I came up with this analogy to compare it with a security mechanism which everybody kind of knows and understands. Everybody is familiar with HTTPS, secure sockets layer and so on. And the way in which SSL works, whether it's with your browser or anything like that, is you rely on PKI, public key cryptography, for one part of it. But the bulk of your actual communication is using reversible ciphers. We can stay away from the discussion of why you do that, but the costs involved and so on. But in standard HTTPS, the three steps are first to establish a session and securely transfer a key. Once you transfer the key, then only one and one person alone can get that key. And then you do everything using that reversible cipher. That's the standard steps which you have. So you start by a server identifying itself based on a certificate. The client receives the certificate as part of the connection and says, do I trust who you are? HTTPS colon www.google.com. Are you Google? Well, I use a mechanism of trust and a hierarchy of trust to decide whether you are or you're not. And in your certificate, I have your public key. Since I have your public key, I as a client generate a private key and cipher my session key with that public key and send it back to you. As long as your private key is not compromised, only that server will be able to extract that key. So you've safely transferred a key across. After that, everything is a reversible cipher based on that session key. And on the server side, the session ID is associated with the key which is being used. So if anybody were to do the classic man in the middle attack, they're gonna get a whole bunch of encrypted bits flying across the wire and they can't do anything with it. And the mechanism you rely on is the fact that the transfer of the key securely is the important thing. So that's eventually what it all comes down to. So we've done exactly the same thing and we've said, I'm gonna use some mechanism to securely transfer a key across. Once I transfer the key across, there's a shared key in place and then I use a reversible cipher. Everybody okay so far or do you want me to go over this one more time? Okay, the mic's probably not working because nobody's saying anything, all right. Okay, good. So every guest has a unique key. All the communications are now protected. So the same situation here is that you have a public transport, the internet. Everybody can snoop on what's going on on the internet. We assume that's the case. Every communication on the message bus can be compromised, people can snoop on it, but all they're gonna see is a bunch of encrypted traffic. More importantly, if I decide that I want to put my head in the middle of that banking transaction you're trying to do, I don't have the shared key to do any damage. So even though I know www.yourbankname.com, I can't do anything more, okay? So, confusing picture. On the control plane, there is a key. It's established when Trove is deployed. What this basically means is that if you have multiple task managers and multiple API services on multiple hosts, they should all have the same key. Otherwise bad things will happen. When Trove API receives a request of any kind and wants to talk to the task manager, it's going to encrypt that using the control plane key, reversible cipher, sends a message over the message bus. So if the message bus is compromised, can you impersonate the Trove API and tell the task manager to do something? Nope, you don't have that key. Okay, task manager gets a message. You tell it to create an instance. It goes, it talks to Nova, creates an instance and uses the config drive mechanism to securely inject a file. In that file, it has the key. Again, I'm relying on file injection to be the secure mechanism here. So if you compromise cloud in it, yep, bad things could happen. So now there's a key on the guest stored there, but on the control plane, the same key is stored in a database. Whenever the task manager wants to communicate with the guest, it will encrypt the message using that key. Whenever the guest wants to reply to the task manager, it will encrypt using the same key. When it wants to send a heartbeat, it will encrypt it using its key. Guest one always encrypts using that key. When the conductor receives a message and it's supposedly from guest one, because the envelope says it's guest one, it goes and looks in its database and says, give me the key to decrypt this thing. And it uses a key which it trusts from its own database and it decrypts a message. And so long as this is guest one, really the person who generated the message, it gets a valid message, otherwise it gets an invalid message and throws it away. So if you were to compromise this guest, you now have access to this key. It's stored in the exact same place where the other passwords are stored. But if you wanna send a message to the task manager and you want to tell the task manager, hi, I'm the API service, go create me a cluster, which is this ginormous thing, you don't have that key so you can't do it. I wanna send the task manager a message saying, I'm Cope and this is Pepsi's database, I want you to go shut this thing down. You try to send task manager a message to kill this, well, you can't do that because you're talking with guest one key, task manager says, sorry, that message doesn't make sense. So the only thing which you can damage if you compromise this key is this guest itself. But guess what, you already damaged it. So you now again have a public channel which can be compromised and you have encryption based on a securely shared key which guarantees that the damage from compromising one of these guests is localized to that guest. So I know there's at least three people in this room who've thought about this problem a long time. Is Crystal here? No, he's not. So it's all yours, Matt. Does this work for you? So great, great question. So that was the reason why I said there's specifically two places where the keys are regenerated. So think through this path where you say you have a guest which has no key and you have software which doesn't understand this stuff. You have a control plane which has software which doesn't understand encryption either, there's no key. So now you're going through the upgrade process. So the upgrade process is always tight to upgrading a guest. So the first thing which you're gonna do is you're gonna upgrade the guest and you're gonna have an upgrade path which says send it down a new key because you're gonna regenerate the image. And it's also gonna come down with one additional flag which says control plane still doesn't understand encryption. So while the key is down there, the guest is not gonna encrypt. Once you upgrade the control plane, oops, wrong button, once you upgrade the control plane and you have this site in place, and by the way setting this up is a control plane only thing. Once you have a control plane which understands encryption, then there's a message across to the guest which says now switch to encrypting. So you will transition the entire system over with software which understands encryption but does not do encryption. And then you move it forward. And the plan was that we would do Okada with encryption and no encryption coexisting. And in Pyke we would turn it to all encrypted. We haven't yet done that. So that is still currently the plan. Works for you? And I did mention that the keys are stored in the database. Only the guest keys are stored in the database. The control plane key is not stored in the database. Yeah, so we did, for those of you who are not familiar with Barbican, it's the secret service which is part of OpenStack. So let me try and answer the question this way. This is currently the list of services which Trove already depends on. When we tell somebody to go and install Trove, having them install these first six services is usually not a hard sell. Telling them that they need to use Horizon also not a hard sell. Telling them to use Mistral for scheduled backups is an uphill battle. Telling them they need to use Designate, yeah, it's a problem. If I want to, as a project, if I want to make my project more ubiquitously deployed, making it depend on another project which has a lower deployment doesn't seem like a winning strategy to me. At some point I would like to go to Barbican, but it's not something which I wanted to make a hard dependency at this stage. Absolutely, absolutely, there's, absolutely. So personally my, I will tell you that this, there's a technical term for this and that is a hack. It's a horrible hack. The place where this stuff should live is in Keystone. But I can't convince Keystone that they should give me the ability to store stuff on a poor project basis in a secure manner. They say that's not their bailiwick. But Service Catalog is, I don't understand why that's the case. Barbican would be the logical place to store it, but this is a starting point. At this point, all I have is a task manager creating an instance or upgrading an instance, storing something into a database. The conductor and the task manager when they receive a message, looking at an in-memory cache, and if there's a miss on the cache going to a database and fetching it. Replacing that with Barbican is not going to be a big deal. So at some point I would like to do that when it's not an impediment to the project. Any other questions about this so far? The communication between Trove API and... So the Trove API never actually talks to the database. Oh, sorry. Which database? This database? This database is on the control plane. So that is secured by whatever MySQL, suppose you're using MySQL, you can turn on MySQL in your client and say communication between, okay, so different picture. It's TLS to MySQL and you're all set. So in that particular picture, this communication here, first it's entirely on the control plane and that you can encrypt using whatever settings you want if you're one of those brave people using Postgres there, knock yourself out, you can do that, or TLS with MySQL, okay. Okay, so how does this actually address the issue? I'll be very clear that no solution which I've been able to come up with or the team has been able to come up with actually prevents you from being able to compromise a guest. So I'm still going to always assume that the guest can be compromised and what happens if it is compromised. All we can do is we can control the blast crater which comes from such a compromise. If you compromise an instance, all you should be able to do is damaging things about that instance. But guess what? If you damaged an instance, you already have damaged it, so you can't do much more. You cannot damage other tenants. That's the important part. And I don't want you to damage other instances by the same user. That's exactly all we can do at this point, okay. And with that, I'm pretty close to the end of all I had to say. If you have specific questions about this, you can ping either Miriam or me on IRC. There's an onboarding session where we can talk more about this. There's also more time where we can go into a demo if you're interested. Or if you have other questions, go ahead. A separate message box in what way? We do recommend that. So if you think about the typical way in which people end up using OpenStack, they start by deploying these six. And then they want to try out Trove. They typically don't put Trove on the same control plane. They typically go get a Nova VM and stick Trove on that. So in that kind of an environment, what is easiest for me to tell somebody is you want to evaluate Trove? Great, spin me up three VMs. One of them is going to control, have a MySQL server. One's going to have RabbitMQ, and one's going to have the three Trove control plane services. And that's entirely running in tenant space. Sorry, I think I may have misunderstood your question. Did you say one message queue or one message bus? Okay, we're using an independent message bus. But on a per instance basis, we're also using discrete queues. It has its own topic, correct? It is doable, we did not do that because with Oslo messaging, you're allowed to have one RabbitMQ endpoint. Or you can have one, whatever your underlying MQP transport is. Oslo messaging doesn't support having multiple. So that's, and we didn't go try to fix that. And yeah, so Murano uses its own RabbitMQ exactly the same we do. It keeps the two of them separate. Okay, I should go and chat with them in that case. The simple solution in that case is to put Trove on your control plane. Sure, sure. And the answer to your question is, even from tenant space, you can talk back to the control plane. It's a matter of whether you're willing to open up that security port or not. Zane. So first of all, congratulations on getting this thing. Thank you. That was really awesome. My question is, is there an issue because RabbitMQ is not inherently multi-tenant? Do you also have like a potential denial of service issue where you don't have the encryption key, but you could just flood it with crap that the control plane? Yeah, absolutely. We absolutely have that issue. And we consider the option of, how do we avoid that and how do we go to a multi-tenanted messaging system like Zuckar? Well, guess what? I'll do the denial of service on the front-end port for Zuckar. It's the same thing. Okay. So if the issue is a denial of service, Zuckar's not the answer. It's exactly as susceptible. It's not, and to the earlier point about having dependencies on other projects which are less deployed. But in theory with a multi-tenant message bus, you could do rate limiting at the API. So rate limiting is another way of saying that you're going to have a fancy title for what happens when you have a denial of service attack. You can write limit per tenant. But a denial of service attack is still going to overwhelm the service which has to receive that and say, yeah, it's the same tenant, drop it on the floor. Sure, potentially, yeah. Yeah, so since it's, if you're talking about somebody who wants to do a distributed denial of service attack or a denial of service attack, the fact that they know the endpoint to go to rabbit or they know the endpoint to go to Zuckar is not a material difference. But we did consider Zuckar and the reason we didn't, so Zuckar was also presented as the answer because you don't need to store credentials on the guest. Well, you need to store something on the guest for it to be at least able to get back to Keystone and authenticate. So you're not storing the RabbitMQ credentials, you're storing some Keystone credentials because you need to go get a token. You can store signed URL as well. But once you have the signed URL, it's exactly the same thing. It's a secret which you're storing on the guest which is now compromised. Other questions? Yes, sir? Why not use, oh, why not use RabbitMQ virtual hosts? Okay, so we tried a couple of things with RabbitMQ to go down this path. So the first thing which we tried to do was on this message bus here, we tried to have per guest credentials, we tried to do virtual hosts and all of those things. When we got to the point where we had more than a dozen guests and closer to 100 or 150, the performance of RabbitMQ went in the toilet just dealing with multiple virtual hosts and dealing with multiple ACLs. The CPU utilization of Erlang was going through the roof and it pretty much became unusable at that point. The best practices seem to be that RabbitMQ is a protected message bus. We're really using it not as a protected message bus but as something to do RPC. If you wanna do RPC, we should really do this as RPC. That's the fundamental problem. Oslo messaging RPC is an abomination. So the correct solution to this would be that communications to the guest and communication from the guest to the conductor and task manager should all be restful endpoints. If the guest agent exposed a restful endpoint, that's it. The reply from the guest agent should be to a restful endpoint but we're instead using RPC. Other questions? Yes, sir. So we've done a fair amount of testing of what the actual overhead is and that's one of the other reasons why we suggest keeping the RabbitMQ bus away. A guest agent typically sends a heartbeat message every 30 seconds. A reasonably sized rabbit cluster can happily handle a couple of thousand guest instances before it even breaks a sweat. There is certainly an amount of traffic going over the wire but it's not a lot. So the entire heartbeat message literally is, as far as the payload is concerned, it's four bytes. Whatever RabbitMQ wraps around it is some amount of overhead. It's not a lot of traffic. Anything else? Thank you for coming.