 Good morning, we're going to talk about something we call hybrid messaging in this presentation, not to be confused with hybrid cloud. This presentation, we're actually going to talk about various deployment models for the message bus control plane that you can use with OpenStack. And this is everybody's thinking, you know, rabbit. This is rabbit. Everybody knows rabbit. There are other technologies that we're bringing forward that differ from rabbit and complement it. We'll be talking about those things today. So quick introduction. My name is Ken Giusti. I'm a member of the Apache Cupid project, and I'm also a developer in Oslo messaging. And I work for Red Hat. Mark, I work for Red Hat as well. I'm in the performance and scale group. I'm basically a performance engineer at Red Hat. So let's get started. So this is a lot of information involved around this. We have other presentations if you're interested in looking at spending a little more time on this topic. But we're going to cover this at a pretty high level. So we're going to focus around the Oslo messaging library. That is the library that is used to provide RPC notification services, basically the traffic you see going across rabbit. And then we're going to talk about various, we call back end. That's the messaging implementation, the message bus. You'll probably see me hear me use that term. And we're going to describe what we call hybrid messaging. And then we're going to talk about these other technologies that are supported in Oslo messaging. And show some testing that we've done to kind of give you a sense of how these different technologies perform. And we'll talk about future plans. So Oslo messaging. Basically when you see stuff go through rabbit, it's via Oslo messaging. The Oslo project of which Oslo messaging is a part is a collection of utilities that are shared amongst the projects of OpenStack. It's kind of like a library used to do different things so we don't keep reinventing the wheel. Part that we're talking about today is called Oslo messaging. And it provides high level abstractions for a couple of very common messaging patterns. One being remote procedure call, another one we call notifications. Notifications are like eventing or logging, except at the message level. We'll talk a little bit more about these in a second. But as I previously described, this provides an abstraction to the project. Now the projects aren't speaking directly to rabbit via rabbit APIs, they are speaking to rabbit via Oslo messaging at a high level. So a little bit about these services. As I said, notification is about eventing, think of like logging. There are two players in this pattern. There is the thing that does the eventing, we call that the notifier. And the thing that does the consumption of these events and process them. Call that the listener. Now this pattern is loosely coupled in time, what I mean about that, I mean that the notifier is asynchronously firing off these events. And all it cares about is that the message bus received events and that is rabbit queued them up. It doesn't get any feedback of when they're consumed, if they're consumed ever. So these are temporarily separate, that's separate from the server which can come on at any time even after the caller has even exited and consume these messages at a later point in time. So they're temporarily separate. They don't depend on both parties being there to complete the transaction. That requires queuing. So we think of rabbit, we talk about the queue. This is a pattern that requires that type of intermediary to store the message until the listener comes and pulls it off. Remote procedure call is quite a different pattern than this. It's more synchronous, right? You have a client that makes a request of a server by sending in a message. The server processes that request and returns a result. The client has to be waiting for that server. The server has to be present for this transaction to complete. So we're talking about something that is tightly coupled. The two parties have to be present. You'll notice I didn't use the word queue in describing this model. We'll get to that in a second. So if we kind of peel back the covers on what's going on with rabbit and notifications, what we see is pretty much a standard flow of messages through a queue, right? We have these notified clients that are firing off events, they're getting acknowledgments from the message bus from rabbit that say, yeah, I got this message, I put it on a queue, you can go off and do whatever it's you want to do. And then at any time the notification listeners can come in and consume them from that queue. So it's a unidirectional flow through a queue and it's a synchronous, right? Perfect, brokers do this very well, that's what they're designed to do. However, if we look at how we implement RPC today using rabbit or any broker in Coup de D, active MQ, the RPC transaction takes place across four discrete transfers of ownership of that message. Remember I said that notification hands it to the bus and then the bus says, I got it, don't worry about it. I'll take it from here. That has to happen four times synchronously for an RPC per RPC call, for the RPC to complete, all right? So the RPC client has to issue its request, get it queued up and at the point the broker says, I got it. The RPC client can't do anything really at this point. It's still pending on that reply, all right? So the message goes through the queue. The RPC server takes it, says the broker, I've got this, you can get rid of it now, processes it, sticks it back on a different queue. Notice we have to use two queues here, right? So double the resources for each client, potentially. And the RPC client then will consume it off. Hopefully it's still there. If the RPC server isn't there, what happens? You get a backup on your queue. Messages, the state, that's the thing. The broker is now containing state, and that state can go stale. If the RPC server isn't there, the messages still remain. Same thing with the replies, that the client dies, you have what's called an orphan queue with orphan messages. Again, the broker is maintaining state, and what is essentially a pretty lightweight stateless transition here. And there's this artificial barrier because they're not communicating directly, the server and the client. It would be much better if we got rid of the queues. And we did something like this, right? The RPC client could communicate directly, whether through a proxy or through TCP, with the server, and get the response directly back, right? That way, the endpoints maintain the state. They know when the messages have been consumed, and they know when they can release the state, not have to retransmit. Or if they do have to retransmit, they will know at least that the message didn't get there, get a negative acknowledgement from the message bus, okay? You can't do this with a broker, but you can with other messaging technologies. There are technologies out there, protocols that don't require queues. You can still use queues with them, but they don't require it. They support peer-to-peer communication paths. And we support that in OSL Messaging. Here's a kind of a simplified layered look at the components within OSL Messaging. What you see here is, I want to highlight this, the RPC service and the notification service, right? These abstractions, they're basically separate, right? They share some shared infrastructure, but the APIs are separate, and their operations are separate. What they do share is the transport. That's where the logic that's specific to the message bus technology is contained, it's a driver, right? So we have calls to the Combo library for rabbit in this transport, and of course it connects directly to the message bus. Well, since I think about Metaka, I think, we've had the ability to do this. You can use dedicated transports for each service, notification, transfer notifications, RPC transport for RPC now. Technically, they could be talking to the same bus, but that really doesn't add as much value. We've enhanced the API for the projects to be able to support this. The projects used to just allocate a transport and use it for both. Now, optionally, they can allocate a transfer per service, which enables this. An example change to the configuration that is supported is we've added this new aqua messaging notifications, let me see if I don't blind myself. Awesomenessing notifications config item, where you define a secondary transport URL. Talking to the message bus, this is the address of the message bus with the user credentials. And the original URL is still maintained. It's a little quirky. We need to add an RPC URL here. We'll be doing that in the future. But you could add one specific of notifications. And if you constrain yourself to RPC, you can have two different transports. So what does that give us? Right now, this is what we're used to, right? We're used to notifications and RPCs just being handled by the same broker or the cluster, right? As I said, RPC traffic is not very efficient when it's passed through a queue. So let's do this, right? Let's have a broker cluster dedicated to RPC and a broker cluster dedicated to notifications. And there's some valueless, but it still doesn't solve the kind of messiness of queue-based RPC in efficiency. This is hybrid messaging. Using a broker cluster, using rabbit for notifications, because that's what rabbit does well. Using a different Oslo messaging transport, different messaging technology that does point-to-point for RPC. And it can do it in much more efficiently. So it's about using the right tool for the right job, right? Notifications, you want to be able to store them. And if you're doing something like billing or something critical, you don't want to drop them or you want to try very hard not to drop them. So you turn on things like queue mirroring. You turn on persistence. This slows down. It has a performance impact that if it's applied to RPC, it's a complete waste of time. You don't want to persist RPC messages, the temporary, right? What you'd rather do is tune each potential, each back-end you're using to the kind of traffic pattern it's taking. But there's also kind of a more longer term, and I'm going to talk about this a bit, kind of longer term, more, I think it's going to be important to like the NFE space and larger clouds is that. We no longer have to use the centralized broker hub and spoke deployment model. We can now use a distributed architecture, right? And you're not going to have to have multiple rabbits under your cloud, like on a cell's kind of configuration talking to each other and keeping that location. There are other protocols, other tools out there that will do that automatically as part of their functioning. So we're going to talk about two alternative messaging transports used that are supported by Oslo messaging. One is based on 0MQ. The other one is based on the existing AMQP10 protocol transport, the stuff that people have been using off and on with CupidD. But we're not saying talk to CupidD. We've got a broker, rabbit works fine. What we're talking about is new technology called the dispatch router, a lightweight switch, and we'll talk about this in a second that provides that kind of point-to-point semantics. I also wanted to mention there's an experimental driver. It's not conducive to the pattern I'm showing, but it's a notifications only driver. Its intent is perhaps providing the ability to do more scaling. It's an experiment using Kafka distributed streaming service, right? So this is different in that instead of using rabbit for notifications, you could use Kafka, but that's still under development. So let's talk about 0MQ. 0MQ is a protocol. 0MQ, the 0 and 0MQ supposedly stands for zero brokers or zero overhead, depending on who you talk to. The 0MQ protocol is a very lightweight protocol that's built directly over TCP, and it uses TCP's routing capabilities to communicate between two parties. It is a point-to-point protocol. You have to have a TCP connection between the two parties, okay? This is going to give you the lowest amount of overhead, the fastest latency of any of the solutions I'm going to talk to today. However, as you can see on the left-hand side, as you add clients and servers, you start increasing the number of TCP connections kind of dramatically. So to help solve with that, the 0MQ driver has an additional component. It's a proxy that uses a connection concentrator. Now, the deployer's guide describes when to use this, and it's kind of roughly rule of thumb. You start getting above 100 nodes, then you might have to consider using the proxy. And the proxy also supplies broadcast, because it's kind of difficult to do broadcasts in this manner. There's also a matchmaker service using Redis that does the mapping from message topics, message addresses to the actual hosts that service those messages, right? So there's a few components here, and there's a bump on the wire. But this is direct. There's no queues involved. And it can be in the left-hand side, extremely quick in terms of transfers. I recommend looking at the deployer's guide to have any interest in playing with this. And then I'm going to talk a little bit about this one's a little near and dear to my heart, because I do work with QBIT. The message router is it's not about being point to point. It really is a solution for high availability. High availability through to distribution. Think of an Internet. Think of the Internet. Think of routers, Internet routers, right? The Internet is run by the intelligence of those routers determining where the addresses land and where consumers and producers, if you will, from message terms, live, right? And they're connected in a mesh or redundant configuration. So if you put a bullet through the head of one of them, it recovers automatically for you. The clients don't have to do any work, right? This is the same thing the message router does. It's a fast message switch. It's like an IP router, but it's switching at the message level. It's looking at message addresses and finding the clients and servers until more does where they go. So all the intelligence is in this network. And it is stateless. Like I said, you can purposely destroy one of these things. If you've built up a redundancy in your mesh, they'll be failover. You can see what the dog of the lines. I'm kind of highlighting that. We did an entire presentation on this in Barcelona in a deep dive and talked about the management tools and things like that. I highly recommend if you're interested in using this to watch that video. And there's also a deploys guide if you want to play with it. And it's got DevStack support. So you can use it in DevStack. So just a quick overview of what also messaging supports, right? There's Rambu. There's Rabbit using the Combu API. And that's a Rabbit MQ server. The stuff we all know does RPC notifications. There's AMQ P1O and ZMQ, which primarily RPC, although you could technically run notifications over it. However, as we know, the listener has to be present and those messages will get dropped. And then there's Kafka, which is purely notification, experimental and kind of shows a different tact on queuing messages. So what we're going to do here is we're going to mark and run some tests with the objective of benchmarking the effect or the throughput of RPC based on queuing and based on direct. Now, what we're using here is a tool called the Oslo Messaging Benchmark tool. As I said, I'm a developer in Oslo Messaging. And I implemented most of the AMQ P1O driver. I had to build a tool that would test a distributed network of routers running Oslo Messaging. So this tool is for that. This tool does not simulate an open stack project flow. It's much more punishing. This is a stress tool. This is a load tool. I use this to break the system. I use this to throw traffic through Oslo Messaging in cases where I can try destroying nodes and observing failover and load things like that. So you're not going to see these levels in a typical open stack cloud, but we're pushing it to the point where you can see that there is definitely a difference in the two approaches to RPC. Think of it as the will it blend a tool of Oslo Messaging. Right now the test scenarios, we've just run them between Rabbit and the AMQ Cupid Messaging router just for lack of time and resources. We want to do more as we go on. And like I said, it's not a true open stack pattern flow. You couldn't produce this kind of load using an open stack deployment. I don't think I don't know anybody who has. If they have, I definitely want to talk to you. So we had to play some leeway in the number of notification messages service versus the RPC and produce consumer ratios. We're going to distribute consumers and producers throughout the message bus, payload size, things like that. And when we did this test, we tested single instances. We didn't use a rabbit cluster. We didn't use a mesh. We just had Institute, a broker, rabbit broker, and a message switch router, a Cuban message router. And with that, I'm going to turn over podium to my friend here, Mark, to explain what he saw. Thank you, Ken. You're welcome, Mark. So as Ken mentioned, the scenario is before. I'll run through them real quick here. The first scenario is a single broker back end. So it's single machine, one instance of rabbit running. The message assumptions, you know, RPC to notify traffic, we said 50-50. Producers and consumers, based on the tool, we have two producers for every consumer, and that's more just to keep the tool loaded the way it was written. And we use the payload size of 1K. Scenario two is basically the same thing, but we make the message durable, which means they're all going to get written to disk. And this is what you'd really want to do for notifications so that if your broker goes down for some reason, it's not lost just in the memory. It's stored on disk and when it comes back, if you're using notifications for things like billing and all, that data is important. So that was the second test. Third test is we separated. We put rabbit on one machine. Well, actually, for third test, we put rabbit on two separate machines. One was for notification, where we kept it durable. And the second was non-durable RPCs through rabbit. And we'll compare some data here. Graph on the left is latencies, and you can see the tool would measure throughput and latencies. And you can see... First problem we had is when it's all on a single box, we couldn't run as many clients. The test would just become unreliable and fail. But you can see that rabbit, the latencies go up as the number of clients increase. And then for throughput, again, this is for just notifications, you can see the blue line is when rabbit's on its own machine. Just for notifications, you can see you're getting, I think it was about 18,000 calls per second for a single. And then the hybrid durable, you actually got better than the... Wait a minute here. Sorry. Oh, for the single host, the red line at the bottom, you can see you're splitting time between RPCs and notifications. So the red line is pretty low for throughput. The yellow line is when we split, and the RPCs are going through another machine. You can see the improvement there. For the RPC calls, you see the same thing. The most interesting things here are the blue line for latency. Just goes straight up through the roof. So basically, rabbit for RPC calls, the latencies kill you as you scale. And you'll see the same thing for the rates they all decrease as you go. So the fourth scenario was basically the same as the third, but we switched out rabbit MQ on the RPC server, and we put in the QD router. The system tunings were the same. We just changed things to allow a lot of connections, things like that, at the kernel level. We didn't tune either product. And so we're using direct messaging that Ken talked about. And the broker back end. And for this test, it didn't really matter since the broker was on a completely different machine, so we didn't actually run the notification test for this. We just did straight RPC calls because they were going to a distinct machine. So the blue line on top is QD router. You can see we got upwards of 33, 34,000 messages, calls per second. And remember that a call for the tool, it sends it through the router, and then the client echoes the whole call back. So it's really a lot of messages going through there. And this is one call as the complete round trip. You can see that rabbit decayed as it got loaded. Then when we switch to the latencies, you can see the QD router just stays pretty much flat for latency. It increases a little bit, but compared to rabbit, it stays really low. And then the final comparison here, you can see all the graphs put together. So to cover them all, but you can, you know, the latencies of QD router really stand out as well as the throughput. And the final graph that I have is just CPU utilization. And rabbit uses more CPU, even though it's doing about a third of the bandwidth. And I know, as I said, I didn't tune any of these, and I work on the performance and scale of the router. And I know I can tune this so I get higher throughput and lower CPU utilization if I actually tuned it. But I wanted to do a fair comparison. So next steps, you want to handle that? Yeah, sure. Thank you, Mark. So the next steps, as I mentioned before, there is a DevStack plugin that supports a Cupid hybrid mode where it talks to rabbit MQ for notifications and a single instance of the dispatch router for RPC. There's also a ZMQ DevStack plugin, but I don't believe that supports hybrid at this point. Something for you guys to play with, if you'd like, you folks. But also, we're trying to educate developers more to not think of messaging as a wrap-up, but think at the high level. Think of the patterns that also messaging provides you. And more importantly, make sure that your code does allow the separation, does utilize and get notification transport in the soon-to-be included GetRPC transport. So people, folks can deploy this with ease. And of course, we need people in also messaging. I'll put my also messaging hat on. We've got these other interesting technologies we're working on. And if you find any of them interesting, please want to pitch in. Please just do it. Just come on in and start playing. You can definitely use the help. Going forward, the CI checks. I've got actually a patch in project for the OpenStack upstream CI checks to add a Cupid hybrid mode gating test for the also messaging test. They're also being used, I think, by Heat, too. They're doing what is it? Like a neutron integration? Or is it just a tempest? I think it's a tempest-based test. We'd like to test and measure additional hybrid scenarios. And we've done some work in triple O to kind of add. There's a lot of work when you consider that most of this stuff, and it's actually appealed to the developers who most of this stuff, the triple O's and the other deployment tools, hard code rabbit everywhere. So if you want to do something like ZMQ, you have to undo all that. You've got to take special steps. We're doing a lot of work in that area in triple O. And it should be possible to deploy at least a hybrid backend fairly soon. So that's all I had. Oops. Any questions? So first of all, very good talk. Thank you for that. Yes, I have a question. So have you considered or did you look at using Envoy from left as the message? Envoy? Envoy, yes. No. Any reason for that? Would it not be suitable? Not. I wasn't aware of that. What is Envoy? That's probably why. Well, they've open sourced recently from what I recall. And today it's like a distributed message routing system, which is gaining quite a bit of traction as I was just wondering. No. Well, these are messaging systems that are, these are messaging technologies that are supported by the Oslo Messaging Library. And is that supported by Oslo Messaging? Well, I'm pretty sure it is not. So basically you don't have to call as a driver to Oslo Messaging. Oh, OK. Yeah, so these tests were based on Oslo Messaging, so we couldn't use any backlight. But that's an interesting point. If you know developers or people interested in kind of adding that capability to Oslo Messaging, it's certainly something that you should come to the community and the IRC and the Oslo IRC meetings, which are on Monday to UTC and propose it. If you're interested in developing that stuff. All right, thanks. Sounds cool. Thank you. Question. Thanks a lot. So with the split that you described that came in Mitaka and Oslo to have like two mechanism for this transport, is this something that you need from the various projects in order to use that? Or is that already available everywhere? Yes, it has to be adopted by the projects. And what's the status of this? Well, we've shaken out most of the problems. But as we add these CI tests specifically, I called it out. We have to add more CI tests to get more coverage to ensure that's happening going forward so people don't break these things. And we have to put more enforcement in Oslo Messaging. So it's kind of a gradual thing. If you find a project that isn't doing this or this is broken, that's a bug. And we have to raise it and fix it. Right. Second question, like how much of this you would say is like production quality already? So if I want to like set this up now, say for my Neutron Rabbit MQ cluster that gives me headaches from time to time because we have quite a large number of nodes. Is it something where you would say, okay, this is already like production quality or would you say like be rather careful and try it out and let us know if you hit some limits? I am very conservative, right? Yeah. And to be honest, there are human decades of time, cumulative time getting rabbit-tuned. Do you remember how many problems were initially when we had a rabbit back in Kilo and all these things? These have not been tested to that level, okay? So I would say developer ready, definitely. CI test, full testing. But in production, you probably want to play with it first. So just to speak on behalf of the drivers, we have some developers that were working on the ZMQ stuff and got it developer ready. We're looking for more volunteers there because there's been some movement around supported, people supporting that driver. The AMQ10 driver has been in there for like three years now. I support it, Andy supports it. So the developers are there and we also work on this Cupid project separate from OpenStack. And it is being used actually in production, this Cupid dispatch router by a couple of different companies. And it just went GA. Okay, we'll talk about Red Hat. Okay, I'm putting my Red Hat on. It is called for interconnecting. It's a product from Red Hat. So probably the biggest risk would be the driver itself. All right? You're welcome. You use RabbitMQ. It already has good support for secure interfaces. Yes. You have to really secure your messaging. But on the other hand, if you try to use ZeroMQ, although it's low latency, but I don't see any clear path to the security, do you know of any alternatives with that level of low latency at the same time that you can have a secure messaging? Well, yes, good point. Thank you. I totally forgot about this AMQ situation, but the security that there's limited support at this time. Because we've got work items and we're looking for people to help with that. I could speak to the Cupid dispatch router. It does support SSL. It supports client authentication via SSL. It supports SASL and Kerberos. So that's a low latency, secure solution. In your implementation, you introduce new entity, the router one. And my question is, can we go with something that is more particularly like, and where you have as much router that we have client and server, so they need to go a little bit further, or why do you choose that router entity? I think I don't think I understand the question. Is there like a concern that there'd be too many routers, too many hops and things like that? Yes. You have that entity that you call a router, okay? Then you have your servers and your clients that are connected to those routers, because there are many routers. And my question is, can I have as many routers as I have clients and servers? There's, you could have that. Yes, you could. Right now I think there's a hard limit though. Let me preface this, all right? So these are fairly heavyweight things. These are heavyweight in the terms of they're considered like parable to a broker. They're not micro routers. They don't have the state that a broker has, and they do the whole routing protocol thing. But they're not something you can really embed anywhere. So for like the MQTT, I think that's what you're, were you talking about that? Did you mention that? No, I'm sorry. All right, so I just, I just heard. That's IoT. You can have, I know it's been a long week. You can have, you could certainly have one, what we've been, what we're recommending at this point is one dispatch router per data center. And that kind of scale. Okay. So you'd have multiple clients, but you can scale out the net as, I think as high as 128 routers. Yeah. Okay. All right. So they are somewhat a connection concentrator too. Anything else? Okay. Thank you. Thank you so much for coming. Thank you for the questions and. Thank you. Good night. Certainly if you want to follow up with questions or criticisms or ideas, let me put this, put my contact information up here. Okay. Thank you.